nsivabalan opened a new issue, #18830:
URL: https://github.com/apache/hudi/issues/18830

   **Describe the problem you faced**
   
   When a Hudi table has a pending clustering plan and an `INSERT_OVERWRITE` 
(or `INSERT_OVERWRITE_TABLE`) operation targets the same partition(s), the 
operation proceeds and replaces the file groups that clustering was scheduled 
against. The clustering update strategies (`SparkRejectUpdateStrategy`, 
`SparkAllowUpdateStrategy`) only inspect explicit *record-level* updates to 
detect a conflict. `INSERT_OVERWRITE` does not tag records with existing file 
groups — it declares whole partitions to be replaced wholesale via 
`getPartitionToReplacedFileIds`. The strategies never see the to-be-replaced 
groups, so `SparkRejectUpdateStrategy` (the default) does not throw, and the 
overwrite is admitted. With 
`hoodie.clustering.rollback.pending.replacecommit=true`, this can also lead to 
clustering being rolled back repeatedly (pipeline starvation).
   
   **To Reproduce**
   
   1. Configure a table with 
`hoodie.clustering.updates.strategy=org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy`
 (the default).
   2. Ingest some data into partition `p`.
   3. Schedule clustering on `p` (do not run it).
   4. Issue `INSERT_OVERWRITE` against partition `p`.
   5. The overwrite completes; the `Reject` strategy did not detect the 
conflict.
   
   **Expected behavior**
   
   `SparkRejectUpdateStrategy` should throw `HoodieClusteringUpdateException` 
because the file groups being replaced overlap with pending clustering. Same 
expectation for `INSERT_OVERWRITE_TABLE` against any partition that has pending 
clustering.
   
   **Environment Description**
   
   * Hudi version: master
   * Spark version: 3.5
   * Storage: any
   
   **Additional context**
   
   `DELETE_PARTITION` already has its own pre-existing check 
(`DeletePartitionUtils.checkForPendingTableServiceActions`) and is unaffected. 
PR #18829 addresses this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to