[PR] perf: Add dedicated batch size config for LSM timeline migration on u… [hudi]

via GitHub Tue, 23 Jun 2026 04:50:35 -0700


cshuo opened a new pull request, #19052:
URL: https://github.com/apache/hudi/pull/19052


   …pgrade
   
   ### Describe the issue this Pull Request addresses
   
   This pr ports and finalizes https://github.com/apache/hudi/pull/18411.
   
   When a Hudi table is upgraded from table version 7 to 8, the legacy archived 
timeline is migrated into the LSM timeline in 
`SevenToEightUpgradeHandler.upgradeToLSMTimeline()`. Previously this migration 
reused the regular archival batch size (`hoodie.commits.archival.batch`, 
default 10) and ran `compactAndClean()` after every batch. Each `write()` 
involves several remote-storage operations (exists check, parquet write, 
manifest update), so for tables with hundreds of archived actions this produced 
excessive I/O and significantly inflated the one-time migration time.
   
   This PR makes the migration batch size independently configurable with a 
larger default and removes the per-batch compaction during migration, 
addressing [HUDI-18410](https://github.com/apache/hudi/issues/18410).
   
   ### Summary and Changelog
   
   - Introduce config `hoodie.migration.commits.archival.batch` in 
`HoodieArchivalConfig` (default `500`, advanced), with a 
`withMigrationCommitsArchivalBatchSize(int)` builder method and a 
`getMigrationCommitArchivalBatchSize()` accessor on `HoodieWriteConfig`.
   - `SevenToEightUpgradeHandler.upgradeToLSMTimeline()` now reads the new 
migration batch size instead of `getCommitArchivalBatchSize()`, so migration 
batching is decoupled from regular archival batching.
   - Drop the `lsmTimelineWriter.compactAndClean(engineContext)` calls (both 
per-batch and final-batch) from the migration loop.
   - `TestSevenToEightUpgradeHandler`: add tests for the config 
default/override and for migration behavior — batching follows the migration 
batch size and `compactAndClean` is never invoked (via 
`mockStatic`/`mockConstruction`).
   - `TestFlinkWriteClients`: add a test asserting a raw 
`hoodie.migration.commits.archival.batch` set on the Flink `Configuration` 
propagates through `FlinkWriteClients.getHoodieClientConfig()` to 
`HoodieWriteConfig`, independent of the regular archival batch size.
   
   ### Impact
   
   - **Functional impact**: Faster v7→v8 LSM timeline migration for tables with 
large archived timelines, due to fewer/larger batches and no per-batch 
compaction. Default migration batch size changes from 10 (shared) to 500 
(dedicated). No change to normal read/write paths or to regular archival 
behavior.
   - **Maintainability**: Migration batching is now explicit and separated from 
the general archival config, removing the previously overloaded reuse of 
`getCommitArchivalBatchSize()`.
   - **Extensibility**: The dedicated config lets operators tune migration 
throughput per environment without affecting steady-state archival.
   
   ### Risk Level
   
   low — Changes are confined to the one-time v7→v8 upgrade path and a new, 
defaulted config; no public API changes. Verified with new unit tests in 
`TestSevenToEightUpgradeHandler` (batching count and absence of 
`compactAndClean`) and `TestFlinkWriteClients` (Flink config propagation); both 
classes pass (16 and 21 tests respectively).
   
   ### Documentation Update
   
   New advanced config `hoodie.migration.commits.archival.batch` (default 500) 
is self-documented via `withDocumentation` and will surface in the generated 
configuration reference. No separate docs page change required.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf: Add dedicated batch size config for LSM timeline migration on u… [hudi]

Reply via email to