zhengruifeng opened a new pull request, #55582:
URL: https://github.com/apache/spark/pull/55582
### What changes were proposed in this pull request?
Fix an `AMBIGUOUS_COLUMN_REFERENCE` regression introduced by SPARK-55070
when a DataFrame is referenced both directly in a join and also nested under a
natural/USING join elsewhere in the same plan.
Single-pass implementation: introduce a `DFColumnCandidate(expr, depth,
hidden)` case class threaded through `resolveDataFrameColumnByPlanId` /
`resolveDataFrameColumnRecursively`. The walk tracks whether each candidate
ever passed through a `p.metadataOutput`-only ancestor by latching `hidden = h
|| r.references.subsetOf(AttributeSet(p.metadataOutput))` at each step. At
merge time, the candidates are partitioned by `hidden`:
- if any regular (`hidden = false`) candidate exists, run the merge over
regulars only and ignore hidden ones (e.g. the natural/USING-join hidden key);
- otherwise run the same merge over hidden candidates.
The depth-0 direct-match tiebreaker in the `foldLeft` is preserved.
### Why are the changes needed?
SPARK-55070 broadened the ancestor filter in
`resolveDataFrameColumnRecursively` from `p.outputSet` to `p.output ++
p.metadataOutput` so that `rhs["join_key"]` works after a natural/USING join
(where one join key is hidden in `Project.hiddenOutputTag`). But when the same
DataFrame is both used directly in a join and also nested under a
natural/USING-join wrapper elsewhere in the plan, the broadened filter lets
both candidates through `resolveDataFrameColumnByPlanId`'s merge, tripping
`throw ambiguousColumnReferences(u)`.
For example, queries like:
```python
enriched = events.join(dim, "dim_id", "left") # USING join hides dim's
dim_id
result = (fact
.join(dim, fact["fk"] == dim["dim_id"], "left") # direct use of dim
.join(enriched, "txn_id", "full_outer")
.select(dim["dim_id"])) # previously AMBIGUOUS
```
now resolve `dim["dim_id"]` to the direct-usage output candidate.
### Does this PR introduce _any_ user-facing change?
Yes — bug fix. Queries that referenced a DataFrame both directly in a join
and nested under a natural/USING join (where the wrapper hides one of the
columns into `metadataOutput`) previously raised `AMBIGUOUS_COLUMN_REFERENCE`.
They now resolve to the direct-usage candidate.
### How was this patch tested?
- New
`test_select_regular_column_with_reused_dataframe_hidden_in_natural_join` added
to `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`.
- Existing pyspark column-resolution tests should keep passing, including
`test_self_join`, `test_self_join_II/III/IV`, and `test_select_join_keys`.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]