suryaprasanna opened a new pull request, #17935: URL: https://github.com/apache/hudi/pull/17935
### Describe the issue this Pull Request addresses Metadata table currently inherits the cleaning policy from the data table, which may not always be optimal for metadata table operations. This PR introduces a dedicated configurable cleaning policy for the metadata table that can be set independently from the data table. ### Summary and Changelog This PR adds support for configuring the metadata table's cleaning policy independently from the data table. Users can now set `hoodie.metadata.clean.policy` to control how the metadata table performs cleaning operations. **Changes:** - Added new config `hoodie.metadata.clean.policy` in `HoodieMetadataConfig` with default value `KEEP_LATEST_FILE_VERSIONS` - Added `getCleanerPolicy()` getter method in `HoodieMetadataConfig` to retrieve the configured policy - Added `withCleanerPolicy()` builder method in `HoodieMetadataConfig.Builder` to set the policy - Modified `HoodieMetadataWriteUtils.createMetadataWriteConfig()` to use metadata table's own cleaning policy instead of inheriting from data table - Retention values (commits/file versions/hours) are calculated as 1.2x the data table's configured values based on the selected policy The metadata table now uses its own cleaning policy configuration while still maintaining sensible defaults that scale with the data table's retention settings. ### Impact Users can now independently configure metadata table cleaning behavior. The default policy (`KEEP_LATEST_FILE_VERSIONS`) is optimal for most metadata table use cases as it ensures efficient file management regardless of the data table's cleaning strategy. ### Risk Level **low** - This change is backward compatible. The default policy (`KEEP_LATEST_FILE_VERSIONS`) ensures stable metadata table behavior, and retention values are automatically scaled from data table settings. ### Documentation Update **New config:** - `hoodie.metadata.clean.policy` (advanced): Determines the cleaner policy for metadata table. Default: `KEEP_LATEST_FILE_VERSIONS`. Supported values: `KEEP_LATEST_COMMITS`, `KEEP_LATEST_FILE_VERSIONS`, `KEEP_LATEST_BY_HOURS`. The retention values (commits/file versions/hours) are automatically calculated as 1.2x the data table's configured values. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
