[I] [Discussion WIP] Add utility to both schedule and execute table services on MDT without needing to write to data table [hudi]

via GitHub Tue, 16 Jun 2026 16:15:43 -0700


kbuci opened a new issue, #19025:
URL: https://github.com/apache/hudi/issues/19025


   ### Feature Description
   
   **What the feature achieves:**
   - Add a utility API that will invoke the same steps as 
`HoodieBackedTableMetadataWriter::performTableServices` to schedule and execute 
compaction, clean, archival on MDT. This allows a user to ensure the MDT of a 
dataset has no extra/uncompacted files (that can impact storage footprint or 
read times) without needing them to do an unecessary or "empty" write on the 
data table.
   
   -   - By default, it should hold the table lock of data table, at least when 
validating and scheduling clean/compaction plans. If we add support for a 
separate lock on MDT, then we can relax this constraint.
   -   - But when executing the compaction plan, we can avoid holding the table 
lock the entire time by leveraging the configs in 
https://github.com/apache/hudi/pull/18295 . This is useful for cases where a 
MDT has a RLI, and executing compaction can take more than several minutes.
   -   -  - We can extend this support to clean as well in the future
    -  As an extra safety, can add a tunable config to control wether to 
`schedule`, `execute`, or `both schedule and execute` clean and compaction. 
This is since for datasets with a RLI and many record index shards, 
[compaction](https://github.com/apache/hudi/issues/17908#issue-3819983507) and 
clean may require a lot of time or spark executor resources. As a result, a 
writer may not have sufficient spark resources to execute said plan in a 
reasonable time bound.
   
   - Note that these will only trigger and perform table services if the 
expected criteria/conditions are met. For example, if there an older inflight 
instant or not enough accumulated writes, then compaction/clean won't be 
attempted. We just want a writer to be able to run the same steps that  
`HoodieBackedTableMetadataWriter::performTableServices`  would go through, 
except without having to write to the data table.
   
   **Why this feature is needed:**
   This is similar to the original sub-ask in 
https://github.com/apache/hudi/issues/17908#issuecomment-3923436575 . Our org 
has a use case where we would need to run such a utility to avoid buildup of 
data/instant files in MDT, which can impact writes and causes storage to grow 
unbounded. Typically this scenario happens if there is a backfill of 
clustering/deletePartition writes on a dataset, that do not perform MDT table 
services. They cannot be configured to do this (since they may not have 
sufficient spark executors to compact/clean a MDT with a large RLI).
   Currently, we work around this by performing an "empty commit" on the data 
table at a regular cadence (with sufficient spark resources) to perform this 
MDT "cleanup". But this is not an ideal solution, as it makes observability 
more difficult (distinguishing "empty"from "actual" write) and adds more 
instants to data table and MDT timeline (the latter required us to add 
optimization of https://github.com/apache/hudi/pull/18215#issue-3955154328 )
   
   
   ### User Experience
   
   **How users will use this feature:**
   - Configuration changes needed
   - API changes
   - Usage examples
   
   
   ### Hudi RFC Requirements
   
   **RFC PR link:** (if applicable)
   
   **Why RFC is/isn't needed:**
   - Does this change public interfaces/APIs? (Yes/No)
   - Does this change storage format? (Yes/No)
   - Justification:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Discussion WIP] Add utility to both schedule and execute table services on MDT without needing to write to data table [hudi]

Reply via email to