[GitHub] [hudi] nsivabalan commented on issue #2639: [SUPPORT] Spark 3.0.1 upgrade cause severe increase in Hudi write time

GitBox Sat, 13 Mar 2021 09:06:25 -0800


nsivabalan commented on issue #2639:
URL: https://github.com/apache/hudi/issues/2639#issuecomment-798640139



   @afeldman1 : is it possible for you to try hudi 0.7.0 with EMR? The time 
increase definitely seems unacceptable. I could not spot any issues on the 
first look especially since you have ran it w/ spark2 and its the issue only 
with spark3. 
   
   Can you try setting this config and let us know how it goes. 
   hoodie.datasource.write.row.writer.enable -> true
   DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL
   
   I don't suspect this should completely solve your perf issue since your 
screenshots shows delays in fetching small files. But just to get more signals. 
   
   And, can you provide spark UI for stages with spark 3.0.1 as well. 
   
   Wrt difference between "insert" and "bulk_insert" operation:
   In bulk insert, you can configure different sort modes. 
   GLOBAL_SORT: all incoming records are sorted globally before being 
partitioned and written to hudi.
   PARTITION_SORT: records are locally sorted for each RDD partition after 
coalesce and then written to hudi.
   NONE: if you don't prefer any ordering with bulk_insert. 
   
   If your record keys are completely random, you might as well set this [sort 
mode 
](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode)to 
NONE. 
   NONE is expected to match spark.write.parquet() in terms of number of files, 
overheads.
   
   I have one clarification as well: If you are incrementally ingesting data, 
why not consider "upsert" as your operation. Or are you just trying to load 
data into hudi for first time and trying out different approaches. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2639: [SUPPORT] Spark 3.0.1 upgrade cause severe increase in Hudi write time

Reply via email to