nsivabalan commented on issue #2639: URL: https://github.com/apache/hudi/issues/2639#issuecomment-798640139
@afeldman1 : is it possible for you to try hudi 0.7.0 with EMR? The time increase definitely seems unacceptable. I could not spot any issues on the first look especially since you have ran it w/ spark2 and its the issue only with spark3. Can you try setting this config and let us know how it goes. hoodie.datasource.write.row.writer.enable -> true DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL I don't suspect this should completely solve your perf issue since your screenshots shows delays in fetching small files. But just to get more signals. And, can you provide spark UI for stages with spark 3.0.1 as well. Wrt difference between "insert" and "bulk_insert" operation: In bulk insert, you can configure different sort modes. GLOBAL_SORT: all incoming records are sorted globally before being partitioned and written to hudi. PARTITION_SORT: records are locally sorted for each RDD partition after coalesce and then written to hudi. NONE: if you don't prefer any ordering with bulk_insert. If your record keys are completely random, you might as well set this [sort mode ](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode)to NONE. NONE is expected to match spark.write.parquet() in terms of number of files, overheads. I have one clarification as well: If you are incrementally ingesting data, why not consider "upsert" as your operation. Or are you just trying to load data into hudi for first time and trying out different approaches. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org