[GitHub] [hudi] Limess edited a comment on issue #3933: [SUPPORT] Large amount of disk spill on initial upsert/bulk insert

GitBox Sun, 21 Nov 2021 11:50:50 -0800


Limess edited a comment on issue #3933:
URL: https://github.com/apache/hudi/issues/3933#issuecomment-974883342



   Thanks!
   
   We're using bulk insert for this job and are happy with the performance vs 
regular upsert.
   
   Re: parallelism, we bumped this up after:
   1. Reading https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide, I 
guess this is now out of date:
        > We're setting parallelism based on the Tuning Guide which states to 
set it such that its atleast input_data_size/500MB.
   2. Observing the disk spill, we found increasing parallelism reduced it
   
   Smaller parquet files don't matter too much, if clustering can later fix the 
small files/sorting problems that sounds like a good thing to look at down the 
line (we haven't investigated clustering at all yet)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Limess edited a comment on issue #3933: [SUPPORT] Large amount of disk spill on initial upsert/bulk insert

Reply via email to