[PR] [SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API [spark]

via GitHub Thu, 14 May 2026 11:43:09 -0700


metanil opened a new pull request, #55885:
URL: https://github.com/apache/spark/pull/55885


   ### What changes were proposed in this pull request?
   
   This PR adds Storage Partitioned Join (SPJ) support for the `truncate` 
partition transform. The approach generalizes the `ReducibleFunction` API to 
accept arbitrary parameters via a new `ReducibleParameters` container, so SPJ 
can reason about any parameterized transform (bucket, truncate, future ones) 
through one code path.
   
   Key changes:
   - **New public API** `ReducibleParameters` in 
`org.apache.spark.sql.connector.catalog.functions` — a typed parameter 
container.
   - **Generalized reducer** `ReducibleFunction.reducer(ReducibleParameters, 
ReducibleFunction, ReducibleParameters)`. The old `reducer(int, ..., int)` is 
marked `@Deprecated` but preserved as the default fallback, so existing 
connector implementations (e.g., Iceberg 1.10.0) continue to work unchanged.
   - **`TransformExpression` refactor**: literal parameters (e.g., bucket 
`numBuckets`, truncate `width`) now live inside `children` rather than a 
bespoke `numBucketsOpt: Option[Int]` field. `collectLeaves()` is overridden to 
filter literal parameters and return only column references.
   - Generic path in `TransformExpression` that extracts `ReducibleParameters` 
from literal children and delegates to the new API; compatibility checks 
(`isCompatible`, `reducers`) work uniformly for bucket, truncate, etc.
   
   ### Why are the changes needed?
   
   Today a join on tables partitioned by `truncate(col, N)` always shuffles, 
even when both sides share identical partitioning. The write-side was fixed by 
[SPARK-40295](https://issues.apache.org/jira/browse/SPARK-40295) (`Allow v2 
functions with literal args in write distribution and ordering`), but the 
read/join side was never enabled.
   
   Previous work in [#49211](https://github.com/apache/spark/pull/49211) 
(@szehon-ho) explored direct support for transforms with literal arguments by 
adjusting the SPJ paths to recognize them. This PR generalizes the reducer API 
so the compatibility check is function-agnostic, with a default method that 
delegates to the deprecated single-int signature for backward compatibility.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, for connector/catalog authors:
   - New public class `ReducibleParameters`.
   - New overload `ReducibleFunction.reducer(ReducibleParameters, ...)` with a 
default that delegates to the deprecated single-int signature for backward 
compatibility.
   - No action required for existing connectors; they keep working via the 
default fallback. Iceberg 1.10.0 (which implements only the old API) is 
verified via a dedicated `LegacyBucketFunction` test fixture.
   
   For end users, queries joining tables partitioned by compatible `truncate` 
transforms (identical widths, or reducible pairs like `truncate(3)` and 
`truncate(5)`) now avoid shuffle via SPJ.
   
   ### How was this patch tested?
   
   5 New tests in `KeyGroupedPartitioningSuite`
   
   cc @szehon-ho @aokolnychyi @sunchao @peter-toth
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes — used only for test cases and Javadoc/Scaladoc comments.
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API [spark]

Reply via email to