[PR] [HUDI-XXXXX] Add ClusteringGroupWriter SPI for pluggable clustering writes [hudi]

via GitHub Tue, 05 May 2026 18:15:50 -0700


vinishjail97 opened a new pull request, #18692:
URL: https://github.com/apache/hudi/pull/18692


   ### Change Logs
   
   Adds a `ServiceLoader`-based SPI that lets an external writer perform the 
write phase of clustering for a single input group end-to-end (scan, sort, 
write) and return the resulting `WriteStatus`es. Default behavior is unchanged 
when no provider is on the classpath -- the registry resolves to 
`Option.empty()` and `SparkSortAndSizeExecutionStrategy` runs its existing path.
   
   **New types:**
   
   - `ClusteringGroupWriter` -- the SPI. Single async entry point 
`runClusteringForGroupAsync(ClusteringGroupWriteContext)` returning 
`Option<CompletableFuture<HoodieData<WriteStatus>>>`. Returning 
`Option.empty()` declines the group; the caller then falls back to the default 
per-group. `isEnabled()` defaults to `true` and lets a provider self-gate via 
its own configuration without leaking it into Hudi.
   - `ClusteringGroupWriteContext` -- immutable parameter bundle (clustering 
group, strategy params, instant time, executor, schema, table, write config). 
Builder-shaped so the SPI signature can grow without breaking implementers.
   - `ClusteringGroupWriterRegistry` -- one `ServiceLoader` resolve per JVM, 
cached. `Option.empty()` when no provider is registered, plus a test-only 
override slot guarded by `AtomicReference`.
   
   **Strategy hooks:**
   
   - `MultipleSparkJobExecutionStrategy` -- new protected 
`shouldForceRowWriter()` (default `false`), `OR`-ed into the existing 
`canUseRowWriter` check. Subclasses delegating to a Dataset-only external 
writer override this to opt into the Row path. The standard Row-writer 
compatibility check still gates whether the path is actually taken.
   - `SparkSortAndSizeExecutionStrategy` -- `shouldForceRowWriter()` returns 
`true` iff a writer is registered and reports `isEnabled()`. 
`runClusteringForGroupAsyncAsRow` consults the SPI first; on absent / disabled 
writer, or when the writer returns `Option.empty()`, falls through to 
`super.runClusteringForGroupAsyncAsRow` per group.
   
   ### Impact
   
   No user-facing behavior change without a provider. With a provider:
   - Each clustering group is delegated to the writer end-to-end (scan, sort, 
write). Hudi keeps ownership of plan generation, instant management, and the 
replace-commit; only the per-group write phase is pluggable.
   - Per-group fallback (rather than per-plan) lets a provider decline a single 
group it cannot serve (for example, MOR groups with delta log files) without 
disabling acceleration for the whole table.
   
   ### Risk level
   
   low
   
   The SPI is opt-in (provider must be on the classpath and report 
`isEnabled()`). The registry returns `Option.empty()` when no provider is 
registered, which short-circuits before any new code runs in the existing path. 
The `shouldForceRowWriter()` hook is `false` by default.
   
   ### Documentation Update
   
   N/A -- internal SPI for downstream integrators; no user-facing docs change.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-XXXXX] Add ClusteringGroupWriter SPI for pluggable clustering writes [hudi]

Reply via email to