naveenp2708 opened a new pull request, #55475: URL: https://github.com/apache/spark/pull/55475
### What changes were proposed in this pull request? Fix for SPARK-46367. When ProjectExec aliases a column (e.g. `id AS new_id`), `KeyedPartitioning` from `outputPartitioning` still references the old column's ExprId. `EnsureRequirements` cannot match `ClusteredDistribution` on the aliased column and inserts an unnecessary Exchange shuffle. This fix adds direct ExprId-based remapping of `KeyedPartitioning` expressions through column aliases in `PartitioningPreservingUnaryExecNode`. Two new helpers: - `buildExprIdAliasMap`: builds ExprId → Attribute map from alias entries - `remapKeyedPartitioning`: substitutes attributes in KeyedPartitioning expressions via the alias map, recursing into transform expressions Non-aliased attributes absent from the output set cause the partitioning to be dropped, consistent with existing filter logic. ### Why are the changes needed? SPJ queries with column aliases followed by aggregation insert unnecessary shuffles, degrading performance. The bug has been present since Spark 3.5.0 and persists on current master after the KeyGroupedPartitioning → KeyedPartitioning refactor. ### Does this PR introduce any user-facing change? Yes. SPJ queries with column aliases will avoid unnecessary shuffles for downstream aggregations and dedup operations. ### How was this patch tested? Added reproduction test in KeyGroupedPartitioningSuite. All 211 related tests pass. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
