nsivabalan opened a new pull request, #18829:
URL: https://github.com/apache/hudi/pull/18829
### Change Logs
- Fixes a correctness bug where `INSERT_OVERWRITE` and
`INSERT_OVERWRITE_TABLE` could replace file groups that were in pending
clustering, because the clustering update strategies only inspected explicit
record-level updates and never saw the file groups being wholesale replaced.
- Extends `UpdateStrategy` with a `fileGroupsToBeReplaced` set populated by
the replace-style commit executors.
- `SparkRejectUpdateStrategy` (the default) now unions explicit updates with
the to-be-replaced set before checking against pending clustering, and throws
`HoodieClusteringUpdateException` if they overlap.
### Implementation
- `UpdateStrategy` / `BaseSparkUpdateStrategy` carry both the
pending-clustering set and the to-be-replaced set.
- `BaseSparkCommitActionExecutor` exposes `getFileGroupsBeingReplaced()`
(empty by default) and passes both sets when instantiating the update strategy.
- `SparkInsertOverwriteCommitActionExecutor` overrides the hook to compute
the file groups in the partitions being overwritten (static via
`STATIC_OVERWRITE_PARTITION_PATHS`, dynamic via the input record partitions).
- `SparkInsertOverwriteTableCommitActionExecutor` overrides the hook to
enumerate every file group across every partition, matching the table-wide
replace semantics of its `getPartitionToReplacedFileIds`.
- Flink `ConsistentBucketUpdateStrategy` /
`FlinkConsistentBucketUpdateStrategy` pass an empty `fileGroupsToBeReplaced` to
the new 4-arg `super` constructor so the Flink build keeps compiling.
### What is NOT covered (intentionally, in this PR)
- `SparkAllowUpdateStrategy` and
`SparkConsistentBucketDuplicateUpdateStrategy` accept the new parameter today
but do not consult it; tightening those is left to a follow-up. These are
opt-in non-default strategies.
- `DELETE_PARTITION` already has a separate pre-existing check
(`DeletePartitionUtils.checkForPendingTableServiceActions`) and bypasses the
`clusteringHandleUpdate` pipeline, so it is unaffected by this change.
### Impact
`INSERT_OVERWRITE` / `INSERT_OVERWRITE_TABLE` against a partition that has
pending clustering will now correctly abort with
`HoodieClusteringUpdateException` ("Not allowed to update the clustering file
group ...") under the default `SparkRejectUpdateStrategy`, instead of silently
winning the race against the clustering plan.
### Risk level
low. Default-strategy behavior strictly tightens detection of a known race.
Allow/bucket strategies and Flink subclasses are pass-through.
### Test Coverage
`TestInsertOverwriteWithClustering` (new) covers, all on COPY_ON_WRITE:
- static & dynamic `INSERT_OVERWRITE` against pending clustering on
overlapping partitions — must reject
- `INSERT_OVERWRITE` against a non-overlapping partition — must succeed
without rolling back clustering
- multi-step scenarios with selective overlap (overwrite one partition, then
a second that overlaps clustering)
- `DELETE_PARTITION` against a non-overlapping partition — must succeed
without rolling back clustering
- direct validation of `getFileGroupsBeingReplaced` and
`getPartitionToReplacedFileIds` for static, dynamic, multi-file-group, and
empty-partition cases
12 tests, all passing locally against current master (`mvn -pl
hudi-client/hudi-spark-client -Pspark3.5
-Dtest=TestInsertOverwriteWithClustering test`).
### Documentation Update
No new configs or user-facing config-key changes. Existing
`hoodie.clustering.updates.strategy` default behavior is tightened only — users
on the default see a clearer error in a previously-buggy scenario.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added
- [x] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]