naveenp2708 opened a new pull request, #55475:
URL: https://github.com/apache/spark/pull/55475

   ### What changes were proposed in this pull request?
   
   Fix for SPARK-46367. When ProjectExec aliases a column (e.g. `id AS 
new_id`), `KeyedPartitioning` from `outputPartitioning` still references the 
old column's ExprId. `EnsureRequirements` cannot match `ClusteredDistribution` 
on the aliased column and inserts an unnecessary Exchange shuffle.
   
   This fix adds direct ExprId-based remapping of `KeyedPartitioning` 
expressions through column aliases in `PartitioningPreservingUnaryExecNode`. 
Two new helpers:
   - `buildExprIdAliasMap`: builds ExprId → Attribute map from alias entries
   - `remapKeyedPartitioning`: substitutes attributes in KeyedPartitioning 
expressions via the alias map, recursing into transform expressions
   
   Non-aliased attributes absent from the output set cause the partitioning to 
be dropped, consistent with existing filter logic.
   
   ### Why are the changes needed?
   
   SPJ queries with column aliases followed by aggregation insert unnecessary 
shuffles, degrading performance. The bug has been present since Spark 3.5.0 and 
persists on current master after the KeyGroupedPartitioning → KeyedPartitioning 
refactor.
   
   ### Does this PR introduce any user-facing change?
   
   Yes. SPJ queries with column aliases will avoid unnecessary shuffles for 
downstream aggregations and dedup operations.
   
   ### How was this patch tested?
   
   Added reproduction test in KeyGroupedPartitioningSuite. All 211 related 
tests pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to