VitoMakarevich opened a new issue, #10997: URL: https://github.com/apache/hudi/issues/10997
**Describe the problem you faced** We are using Spark 3.3 and Hudi 0.12.2. I need your assistance in helping me to improve the `Doing partition and writing data` stage. For us, it looks to be the most time consuming. We are using `snappy` compression(the most lightweight from available as I know), file size is ~160mb, which is effectively 80-90 GB GZIP(default codec in Hudi for our workload). Files itself consist of 1.5-2M rows. So our problem is that unfortunately due to partitioning + CDC nature, we must udpate a lot of files at peak hours, we have clustering to group rows together, but it's still thousands of files affected. 75th percentile of individual file overwrite(task in the `Doing partition and writing data` stage) takes ~40-60 seconds, it does not correlate to the number of rows updated inside(for 75th percentile it's < 100 rows changed in every file). Also - the payload class is almost default(minor changes which not affect performance IMO). Q: 1. What are knobs we can play with? We tried compression format(`snappy` looks to be the best among `zstd`- has memory leak in Spark 3.3 BTW and `gzip`) Also we tried `hoodie.write.buffer.limit.bytes` - rising to 32MB, unfortunately no visible difference. Is there any other? 2. Do you know some performance improvements in newer versions(0.12.3-0.14.1) regarding specifically file write(`MergeHandle`) task **Environment Description** * Hudi version : 0.12.2 * Spark version : 3.3.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org