[GitHub] [hudi] xushiyan commented on issue #3933: [SUPPORT] Large amount of disk spill on initial upsert/bulk insert

GitBox Sun, 21 Nov 2021 11:33:20 -0800


xushiyan commented on issue #3933:
URL: https://github.com/apache/hudi/issues/3933#issuecomment-974875070



   @Limess a few questions
   
   - for this dataset do you want to run bulkinsert or upsert? if it's append 
only dataset, then bulkinsert should be the mode
   - does smaller parquet files matter a lot in this case? `GLOBAL_SORT` is 
expected to use more diskspace as it does a shuffling sort to line up records 
for bulk writing. If you change to `NONE` sort mode, you'd get more more small 
files but faster write and less disk spill. Small files can be mitigated by 
clustering. There is some trade-off to consider based on your needs.
   - 10036 parallelism may be a bit too high; some number around 3600 = # 
executors x # cores x 2 ?
   
   @nsivabalan @yihua any other tuning tips could help here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3933: [SUPPORT] Large amount of disk spill on initial upsert/bulk insert

Reply via email to