Re: [D] Make cleaning adaptive to workload [hudi]

via GitHub Thu, 20 Nov 2025 07:14:15 -0800


GitHub user xushiyan closed a discussion: Make cleaning adaptive to workload


Throughout a lifetime of a table, number of files written by commits can vary a 
lot, for e.g., a bulk insert + upsert/inserts, or there were traffic spikes. 
The cleaning process, either inline or async, should adapt to the workload. For 
example, the parallelism can be dynamically inferred. Currently, for execution, 
it's capped at the configured value:

> The clean execution, i.e., the file deletion, is parallelized at file level, 
> which is the unit of Spark task distribution. Similarly, the actual 
> parallelism cannot exceed the configured value if the number of files is 
> larger. If cleaning plan or execution is slow due to limited parallelism, you 
> can increase this to tune the performance.

This number can be inferred based on the planned cleaning tasks. And cleaner 
utility can also anticipate the cleaning workload and warn about memory to be 
used being too low based on the plan. 



GitHub link: https://github.com/apache/hudi/discussions/13846

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Make cleaning adaptive to workload [hudi]

Reply via email to