nsivabalan edited a comment on issue #2639: URL: https://github.com/apache/hudi/issues/2639#issuecomment-798640139
@afeldman1 : is it possible for you to try hudi 0.7.0 with EMR? The time increase definitely seems unacceptable. Can you try setting this config and let us know how it goes. hoodie.datasource.write.row.writer.enable -> true DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL Wrt difference between "insert" and "bulk_insert" operation: In bulk insert, you can configure different sort modes. GLOBAL_SORT: all incoming records are sorted globally before being partitioned and written to hudi. PARTITION_SORT: records are locally sorted for each RDD partition after coalesce and then written to hudi. NONE: if you don't prefer any ordering with bulk_insert. If your record keys are completely random, you might as well set this [sort mode ](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode)to NONE. NONE is expected to match spark.write.parquet() in terms of number of files, overheads. And, can you provide spark UI for stages with spark 3.0.1 for the job that's taking hours as well. I have one clarification as well: If you are incrementally ingesting data, why not consider "upsert" as your operation. Or are you just trying to load data into hudi for first time and trying out different approaches. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org