metanil opened a new pull request, #55885: URL: https://github.com/apache/spark/pull/55885
### What changes were proposed in this pull request? This PR adds Storage Partitioned Join (SPJ) support for the `truncate` partition transform. The approach generalizes the `ReducibleFunction` API to accept arbitrary parameters via a new `ReducibleParameters` container, so SPJ can reason about any parameterized transform (bucket, truncate, future ones) through one code path. Key changes: - **New public API** `ReducibleParameters` in `org.apache.spark.sql.connector.catalog.functions` — a typed parameter container. - **Generalized reducer** `ReducibleFunction.reducer(ReducibleParameters, ReducibleFunction, ReducibleParameters)`. The old `reducer(int, ..., int)` is marked `@Deprecated` but preserved as the default fallback, so existing connector implementations (e.g., Iceberg 1.10.0) continue to work unchanged. - **`TransformExpression` refactor**: literal parameters (e.g., bucket `numBuckets`, truncate `width`) now live inside `children` rather than a bespoke `numBucketsOpt: Option[Int]` field. `collectLeaves()` is overridden to filter literal parameters and return only column references. - Generic path in `TransformExpression` that extracts `ReducibleParameters` from literal children and delegates to the new API; compatibility checks (`isCompatible`, `reducers`) work uniformly for bucket, truncate, etc. ### Why are the changes needed? Today a join on tables partitioned by `truncate(col, N)` always shuffles, even when both sides share identical partitioning. The write-side was fixed by [SPARK-40295](https://issues.apache.org/jira/browse/SPARK-40295) (`Allow v2 functions with literal args in write distribution and ordering`), but the read/join side was never enabled. Previous work in [#49211](https://github.com/apache/spark/pull/49211) (@szehon-ho) explored direct support for transforms with literal arguments by adjusting the SPJ paths to recognize them. This PR generalizes the reducer API so the compatibility check is function-agnostic, with a default method that delegates to the deprecated single-int signature for backward compatibility. ### Does this PR introduce _any_ user-facing change? Yes, for connector/catalog authors: - New public class `ReducibleParameters`. - New overload `ReducibleFunction.reducer(ReducibleParameters, ...)` with a default that delegates to the deprecated single-int signature for backward compatibility. - No action required for existing connectors; they keep working via the default fallback. Iceberg 1.10.0 (which implements only the old API) is verified via a dedicated `LegacyBucketFunction` test fixture. For end users, queries joining tables partitioned by compatible `truncate` transforms (identical widths, or reducible pairs like `truncate(3)` and `truncate(5)`) now avoid shuffle via SPJ. ### How was this patch tested? 5 New tests in `KeyGroupedPartitioningSuite` cc @szehon-ho @aokolnychyi @sunchao @peter-toth ### Was this patch authored or co-authored using generative AI tooling? Yes — used only for test cases and Javadoc/Scaladoc comments. Generated-by: Claude Code (Opus 4.7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
