peter-toth opened a new pull request, #56088:
URL: https://github.com/apache/spark/pull/56088
### What changes were proposed in this pull request?
- Add `Expression.collectAttributes(): Seq[Attribute]` for cases that need a
positional `Seq` (preserving order and duplicates).
- Migrate SPJ-related uses of `_.collectLeaves()` in `partitioning.scala`
and `EnsureRequirements.scala` to:
- `Expression.references` / `AttributeSet.fromAttributeSets(...)` where
set semantics suffice (existence/size checks, single-attribute lookup).
- The new `Expression.collectAttributes()` where positional
`Seq[Attribute]` is required (`RowOrdering.create`, `reorder`,
`attributes.zip(clustering)`).
- Drop the now-redundant `.map(_.asInstanceOf[Attribute])` cast at the
`EnsureRequirements` site that previously typed the result of `collectLeaves`.
### Why are the changes needed?
`TreeNode.collectLeaves()` returns every node in the tree where
`children.isEmpty` — including `Literal`s. SPJ planning has always wanted
attributes only, but with the existing partition expression layout
(`TransformExpression.children = [col]`, parameters carried in a sidecar
`numBucketsOpt: Option[Int]` field), the difference didn't surface in practice.
Follow-up work that puts literal parameters directly into
`TransformExpression.children` (e.g. `bucket(Literal(numBuckets), col)`,
`truncate(col, Literal(width))`) — see SPARK-50593 / #55885 — would otherwise
force `TransformExpression` to override `collectLeaves` to filter literals,
breaking the universal `TreeNode.collectLeaves` contract for one expression
type.
Fixing the abstraction layer by using the right helper at each call site is
cleaner than overriding `collectLeaves` inside `TransformExpression`.
### Does this PR introduce _any_ user-facing change?
No. Pure refactor — for current partition expression shapes (no literal
children) `_.collectLeaves()`, `_.references`, and `_.collectAttributes()`
produce identical results.
### How was this patch tested?
Covered by existing test suites that exercise the migrated call sites:
`KeyGroupedPartitioningSuite`, `EnsureRequirementsSuite`,
`ProjectedOrderingAndPartitioningSuite`.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]