VitoMakarevich opened a new issue, #10997:
URL: https://github.com/apache/hudi/issues/10997

   **Describe the problem you faced**
   
   We are using Spark 3.3 and Hudi 0.12.2.
   I need your assistance in helping me to improve the `Doing partition and 
writing data` stage. For us, it looks to be the most time consuming. We are 
using `snappy` compression(the most lightweight from available as I know), file 
size is ~160mb, which is effectively 80-90 GB GZIP(default codec in Hudi for 
our workload). Files itself consist of 1.5-2M rows.
   So our problem is that unfortunately due to partitioning + CDC nature, we 
must udpate a lot of files at peak hours, we have clustering to group rows 
together, but it's still thousands of files affected. 75th percentile of 
individual file overwrite(task in the `Doing partition and writing data` stage) 
takes ~40-60 seconds, it does not correlate to the number of rows updated 
inside(for 75th percentile it's < 100 rows changed in every file). Also - the 
payload class is almost default(minor changes which not affect performance IMO).
   Q:
   1. What are knobs we can play with?
   We tried compression format(`snappy` looks to be the best among `zstd`- has 
memory leak in Spark 3.3 BTW and `gzip`)
   Also we tried `hoodie.write.buffer.limit.bytes` - rising to 32MB, 
unfortunately no visible difference.
   Is there any other?
   2. Do you know some performance improvements in newer 
versions(0.12.3-0.14.1) regarding specifically file write(`MergeHandle`) task
   
   **Environment Description**
   
   * Hudi version : 0.12.2
   
   * Spark version : 3.3.0
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to