kbuci commented on issue #17907:
URL: https://github.com/apache/hudi/issues/17907#issuecomment-3986554330

   @nsivabalan Thanks for feedback. 
   > the additional wait time may not buy us much and could only reduce the 
probability of clustering failing. I would recommend to leverage early conflict 
deduction to save compute costs on clustering.
   
   after reviewing further , I still think that for our org's use case we would 
want a way to "wait" some time for a requested to transition to an inflight 
(before proceeding to finialize the write and commit transaction)
   This is since for our upsert-heavy RLI enable datasets there can be only a 
few hundred files per partition, but a ~10 minute window in between ingestion 
`requested -> inflight` and a hour window between `inflight -> commit`. Meaning 
that if clustering finishes fast before an ingestion instant has transitioned 
to inflight, then it would be preferable to wait a few extra minutes in the 
spark driver. Since otherwise a retry of the clustering would again run another 
spark stage to cluster the files. And this would mean that any clustering job 
that was "unlucky" enough to finish it's write in that 10 minute window would 
be guaranteed to fail. 
   Ideally this would be resolved by long term resolutions such as 
user/application side orchestration changes or being able to concurrently 
cluster and write without write conflict (like 1.x compaction, though this 
would likely anyway only be there for MOR tables). But we would still like a 
way to implement/use this "knob" in the short term. Could we add a new API to 
cluster strategy or write conflict strategy classes to support this (similar to 
`checkPrecondition` for clustering plan class) - the default impl can be a 
no-op but we can implement/set our own implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to