nsivabalan edited a comment on issue #2639:
URL: https://github.com/apache/hudi/issues/2639#issuecomment-798640139


   @afeldman1 : is it possible for you to try hudi 0.7.0 with EMR? The time 
increase definitely seems unacceptable.
   
   Can you try setting this config and let us know how it goes. 
   hoodie.datasource.write.row.writer.enable -> true
   DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL
   
   Wrt difference between "insert" and "bulk_insert" operation:
   In bulk insert, you can configure different sort modes. 
   GLOBAL_SORT: all incoming records are sorted globally before being 
partitioned and written to hudi.
   PARTITION_SORT: records are locally sorted for each RDD partition after 
coalesce and then written to hudi.
   NONE: if you don't prefer any ordering with bulk_insert. 
   
   If your record keys are completely random, you might as well set this [sort 
mode 
](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode)to 
NONE. 
   NONE is expected to match spark.write.parquet() in terms of number of files, 
overheads.
   
   And, can you provide spark UI for stages with spark 3.0.1 for the job that's 
taking hours as well. 
   
   I have one clarification as well: If you are incrementally ingesting data, 
why not consider "upsert" as your operation. Or are you just trying to load 
data into hudi for first time and trying out different approaches. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to