cshuo opened a new pull request, #19052: URL: https://github.com/apache/hudi/pull/19052
…pgrade ### Describe the issue this Pull Request addresses This pr ports and finalizes https://github.com/apache/hudi/pull/18411. When a Hudi table is upgraded from table version 7 to 8, the legacy archived timeline is migrated into the LSM timeline in `SevenToEightUpgradeHandler.upgradeToLSMTimeline()`. Previously this migration reused the regular archival batch size (`hoodie.commits.archival.batch`, default 10) and ran `compactAndClean()` after every batch. Each `write()` involves several remote-storage operations (exists check, parquet write, manifest update), so for tables with hundreds of archived actions this produced excessive I/O and significantly inflated the one-time migration time. This PR makes the migration batch size independently configurable with a larger default and removes the per-batch compaction during migration, addressing [HUDI-18410](https://github.com/apache/hudi/issues/18410). ### Summary and Changelog - Introduce config `hoodie.migration.commits.archival.batch` in `HoodieArchivalConfig` (default `500`, advanced), with a `withMigrationCommitsArchivalBatchSize(int)` builder method and a `getMigrationCommitArchivalBatchSize()` accessor on `HoodieWriteConfig`. - `SevenToEightUpgradeHandler.upgradeToLSMTimeline()` now reads the new migration batch size instead of `getCommitArchivalBatchSize()`, so migration batching is decoupled from regular archival batching. - Drop the `lsmTimelineWriter.compactAndClean(engineContext)` calls (both per-batch and final-batch) from the migration loop. - `TestSevenToEightUpgradeHandler`: add tests for the config default/override and for migration behavior — batching follows the migration batch size and `compactAndClean` is never invoked (via `mockStatic`/`mockConstruction`). - `TestFlinkWriteClients`: add a test asserting a raw `hoodie.migration.commits.archival.batch` set on the Flink `Configuration` propagates through `FlinkWriteClients.getHoodieClientConfig()` to `HoodieWriteConfig`, independent of the regular archival batch size. ### Impact - **Functional impact**: Faster v7→v8 LSM timeline migration for tables with large archived timelines, due to fewer/larger batches and no per-batch compaction. Default migration batch size changes from 10 (shared) to 500 (dedicated). No change to normal read/write paths or to regular archival behavior. - **Maintainability**: Migration batching is now explicit and separated from the general archival config, removing the previously overloaded reuse of `getCommitArchivalBatchSize()`. - **Extensibility**: The dedicated config lets operators tune migration throughput per environment without affecting steady-state archival. ### Risk Level low — Changes are confined to the one-time v7→v8 upgrade path and a new, defaulted config; no public API changes. Verified with new unit tests in `TestSevenToEightUpgradeHandler` (batching count and absence of `compactAndClean`) and `TestFlinkWriteClients` (Flink config propagation); both classes pass (16 and 21 tests respectively). ### Documentation Update New advanced config `hoodie.migration.commits.archival.batch` (default 500) is self-documented via `withDocumentation` and will surface in the generated configuration reference. No separate docs page change required. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
