Limess edited a comment on issue #3933: URL: https://github.com/apache/hudi/issues/3933#issuecomment-974883342
Thanks! We're using bulk insert for this job and are happy with the performance vs regular upsert. Re: parallelism, we bumped this up after: 1. Reading https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide, I guess this is now out of date: > We're setting parallelism based on the Tuning Guide which states to set it such that its atleast input_data_size/500MB. 2. Observing the disk spill, we found increasing parallelism reduced it Smaller parquet files don't matter too much, if clustering can later fix the small files/sorting problems that sounds like a good thing to look at down the line (we haven't investigated clustering at all yet) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org