vinishjail97 opened a new pull request, #18692: URL: https://github.com/apache/hudi/pull/18692
### Change Logs Adds a `ServiceLoader`-based SPI that lets an external writer perform the write phase of clustering for a single input group end-to-end (scan, sort, write) and return the resulting `WriteStatus`es. Default behavior is unchanged when no provider is on the classpath -- the registry resolves to `Option.empty()` and `SparkSortAndSizeExecutionStrategy` runs its existing path. **New types:** - `ClusteringGroupWriter` -- the SPI. Single async entry point `runClusteringForGroupAsync(ClusteringGroupWriteContext)` returning `Option<CompletableFuture<HoodieData<WriteStatus>>>`. Returning `Option.empty()` declines the group; the caller then falls back to the default per-group. `isEnabled()` defaults to `true` and lets a provider self-gate via its own configuration without leaking it into Hudi. - `ClusteringGroupWriteContext` -- immutable parameter bundle (clustering group, strategy params, instant time, executor, schema, table, write config). Builder-shaped so the SPI signature can grow without breaking implementers. - `ClusteringGroupWriterRegistry` -- one `ServiceLoader` resolve per JVM, cached. `Option.empty()` when no provider is registered, plus a test-only override slot guarded by `AtomicReference`. **Strategy hooks:** - `MultipleSparkJobExecutionStrategy` -- new protected `shouldForceRowWriter()` (default `false`), `OR`-ed into the existing `canUseRowWriter` check. Subclasses delegating to a Dataset-only external writer override this to opt into the Row path. The standard Row-writer compatibility check still gates whether the path is actually taken. - `SparkSortAndSizeExecutionStrategy` -- `shouldForceRowWriter()` returns `true` iff a writer is registered and reports `isEnabled()`. `runClusteringForGroupAsyncAsRow` consults the SPI first; on absent / disabled writer, or when the writer returns `Option.empty()`, falls through to `super.runClusteringForGroupAsyncAsRow` per group. ### Impact No user-facing behavior change without a provider. With a provider: - Each clustering group is delegated to the writer end-to-end (scan, sort, write). Hudi keeps ownership of plan generation, instant management, and the replace-commit; only the per-group write phase is pluggable. - Per-group fallback (rather than per-plan) lets a provider decline a single group it cannot serve (for example, MOR groups with delta log files) without disabling acceleration for the whole table. ### Risk level low The SPI is opt-in (provider must be on the classpath and report `isEnabled()`). The registry returns `Option.empty()` when no provider is registered, which short-circuits before any new code runs in the existing path. The `shouldForceRowWriter()` hook is `false` by default. ### Documentation Update N/A -- internal SPI for downstream integrators; no user-facing docs change. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
