GitHub user xushiyan closed a discussion: Make cleaning adaptive to workload
Throughout a lifetime of a table, number of files written by commits can vary a lot, for e.g., a bulk insert + upsert/inserts, or there were traffic spikes. The cleaning process, either inline or async, should adapt to the workload. For example, the parallelism can be dynamically inferred. Currently, for execution, it's capped at the configured value: > The clean execution, i.e., the file deletion, is parallelized at file level, > which is the unit of Spark task distribution. Similarly, the actual > parallelism cannot exceed the configured value if the number of files is > larger. If cleaning plan or execution is slow due to limited parallelism, you > can increase this to tune the performance. This number can be inferred based on the planned cleaning tasks. And cleaner utility can also anticipate the cleaning workload and warn about memory to be used being too low based on the plan. GitHub link: https://github.com/apache/hudi/discussions/13846 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
