xushiyan commented on issue #10997:
URL: https://github.com/apache/hudi/issues/10997#issuecomment-2058238697

   > we have clustering to group rows together, but it's still thousands of 
files affected. 75th percentile of individual file overwrite(task in the Doing 
partition and writing data stage) takes ~40-60 seconds
   
   based on this, i think clustering can be tuned further to rewrite files such 
that more updates can be targeted to the same file to reduce write 
amplification. Make sure your number of clustering groups is not limited to 
default 30, otherwise you miss a lot of files to cluster. COW is expected to 
have high write amplification with heavy updates, especially if you spread out 
the updates to a lot of files. Also consider a better partitioning to have 
updates concentrated on a few partitions if possible. Upgrade to newer version 
to try configuring the executor type too. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to