kbuci opened a new pull request, #18288: URL: https://github.com/apache/hudi/pull/18288
### Describe the issue this Pull Request addresses If `getClusteringPlan` is called after the target instant is rolled back by a concurrent writer, a runtime exception is thrown. This causes the following important use cases to fail: - **Ingestion** checking whether other replacecommits are from clustering (via `ClusteringUtils.getAllFileGroupsInPendingClusteringPlans`) - **Clustering jobs** calling `ClusteringUtils.getAllPendingClusteringPlans` to find failed clustering attempts to rollback - **File system view initialization** calling `ClusteringUtils.getAllFileGroupsInPendingClusteringPlans` to track file groups involved in pending clustering In all of these cases, between the time the timeline is loaded and before `getClusteringPlan` is called, the instant can be rolled back by a concurrent writer, causing the requested metadata file to no longer exist. ### Summary and Changelog Update `ClusteringUtils.getClusteringPlan` to gracefully handle the case where a clustering/replacecommit instant is rolled back by a concurrent writer between timeline load and metadata read. - The method that directly reads requested replace metadata now catches both `IOException` and `HoodieIOException` - When a `HoodieTableMetaClient` is available, the active timeline is reloaded on error and the instant's presence is re-checked. If the instant is no longer in the timeline, the error is suppressed and an empty `Option` is returned instead of throwing - When `metaClient` is not available (e.g. callers using the timeline-only overload), the original exception behavior is preserved - A new overload accepting `Option<HoodieTableMetaClient>` is introduced to allow callers to opt into error recovery - Added unit tests covering: non-existent instant, deleted requested file (simulated rollback), and `getAllPendingClusteringPlans` gracefully skipping a rolled-back instant ### Impact No public API changes. The existing `getClusteringPlan(HoodieTableMetaClient, HoodieInstant)` and `getClusteringPlan(HoodieTimeline, HoodieInstant, InstantGenerator)` signatures are unchanged. A new overload `getClusteringPlan(HoodieTimeline, HoodieInstant, InstantGenerator, Option<HoodieTableMetaClient>)` is added. Behavioral change: `getClusteringPlan` now returns `Option.empty()` instead of throwing when the instant was concurrently rolled back and `metaClient` is available for verification. This also prevents file system view initialization from failing when it calls `getAllFileGroupsInPendingClusteringPlans` during a concurrent rollback. ### Risk Level Low. The fix only changes error handling behavior in a narrow race condition (concurrent rollback during metadata read). The happy path is unaffected. The error recovery path (reload timeline + check instant presence) is consistent with how other parts of the codebase handle concurrent modifications. ### Documentation Update None. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
