kbuci commented on issue #17907: URL: https://github.com/apache/hudi/issues/17907#issuecomment-3986554330
@nsivabalan Thanks for feedback. > the additional wait time may not buy us much and could only reduce the probability of clustering failing. I would recommend to leverage early conflict deduction to save compute costs on clustering. after reviewing further , I still think that for our org's use case we would want a way to "wait" some time for a requested to transition to an inflight (before proceeding to finialize the write and commit transaction) This is since for our upsert-heavy RLI enable datasets there can be only a few hundred files per partition, but a ~10 minute window in between ingestion `requested -> inflight` and a hour window between `inflight -> commit`. Meaning that if clustering finishes fast before an ingestion instant has transitioned to inflight, then it would be preferable to wait a few extra minutes in the spark driver. Since otherwise a retry of the clustering would again run another spark stage to cluster the files. And this would mean that any clustering job that was "unlucky" enough to finish it's write in that 10 minute window would be guaranteed to fail. Ideally this would be resolved by long term resolutions such as user/application side orchestration changes or being able to concurrently cluster and write without write conflict (like 1.x compaction, though this would likely anyway only be there for MOR tables). But we would still like a way to implement/use this "knob" in the short term. Could we add a new API to cluster strategy or write conflict strategy classes to support this (similar to `checkPrecondition` for clustering plan class) - the default impl can be a no-op but we can implement/set our own implementation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
