peter-toth opened a new pull request, #55519:
URL: https://github.com/apache/spark/pull/55519

   ### What changes were proposed in this pull request?
   
   When a `KeyedPartitioning` passes through a 
`PartitioningPreservingUnaryExecNode` (e.g. `ProjectExec`), the previous 
implementation projected the partitioning as a whole expression via 
`multiTransformDown`. If any expression position could not be mapped to an 
output attribute, the entire `KeyedPartitioning` was silently dropped, 
resulting in `UnknownPartitioning`.
   
   This PR replaces that approach with a per-position projection algorithm 
implemented in two new private helpers (`projectKeyedPartitionings` and 
`projectOtherPartitionings`), with the main `outputPartitioning` reduced to a 
simple split, project, and combine:
   
   1. For each expression position (0..N-1), collect the unique expressions at 
that position across all input `KeyedPartitioning`s (using `ExpressionSet` to 
deduplicate semantically equal expressions), then project each through the 
output aliases via `projectExpression`.
   2. Positions with at least one projected alternative are *projectable*; they 
define the maximum achievable granularity. Positions that cannot be expressed 
in the output are dropped (narrowing).
   3. The shared `partitionKeys` are projected to the subset of projectable 
positions via `KeyedPartitioning.projectKeys`.
   4. The final `KeyedPartitioning`s are the cross-product of per-position 
alternatives, computed lazily via `MultiTransform.generateCartesianProduct`, 
deduplicated, and bounded by a single outer `take(aliasCandidateLimit)`.
   
   All resulting `KeyedPartitioning`s at the same granularity share the same 
`partitionKeys` object, preserving the invariant required by 
`GroupPartitionsExec`.
   
   ### Why are the changes needed?
   
   Without narrowing, a `ProjectExec` that drops any one of a multi-column 
partition key causes the entire `KeyedPartitioning` to be lost. This breaks 
storage-partitioned join optimisations that rely on the partitioning surviving 
projection.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Added unit tests in `ProjectedOrderingAndPartitioningSuite` covering:
   - Full-granularity alias substitution (existing behaviour, unchanged)
   - 2->1 narrowing without aliases
   - 2->1 narrowing with alias, verifying shared `partitionKeys` object identity
   - 3->2 narrowing with alias
   - `PartitioningCollection` where one KP can be fully projected and another 
cannot
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Sonnet 4.6
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to