[PR] [SPARK-56632][SQL][CONNECT] Fix AMBIGUOUS_COLUMN_REFERENCE regression for reused DataFrame in natural join [spark]

via GitHub Tue, 28 Apr 2026 04:05:50 -0700


zhengruifeng opened a new pull request, #55582:
URL: https://github.com/apache/spark/pull/55582


   ### What changes were proposed in this pull request?
   
   Fix an `AMBIGUOUS_COLUMN_REFERENCE` regression introduced by SPARK-55070 
when a DataFrame is referenced both directly in a join and also nested under a 
natural/USING join elsewhere in the same plan.
   
   Single-pass implementation: introduce a `DFColumnCandidate(expr, depth, 
hidden)` case class threaded through `resolveDataFrameColumnByPlanId` / 
`resolveDataFrameColumnRecursively`. The walk tracks whether each candidate 
ever passed through a `p.metadataOutput`-only ancestor by latching `hidden = h 
|| r.references.subsetOf(AttributeSet(p.metadataOutput))` at each step. At 
merge time, the candidates are partitioned by `hidden`:
   
   - if any regular (`hidden = false`) candidate exists, run the merge over 
regulars only and ignore hidden ones (e.g. the natural/USING-join hidden key);
   - otherwise run the same merge over hidden candidates.
   
   The depth-0 direct-match tiebreaker in the `foldLeft` is preserved.
   
   ### Why are the changes needed?
   
   SPARK-55070 broadened the ancestor filter in 
`resolveDataFrameColumnRecursively` from `p.outputSet` to `p.output ++ 
p.metadataOutput` so that `rhs["join_key"]` works after a natural/USING join 
(where one join key is hidden in `Project.hiddenOutputTag`). But when the same 
DataFrame is both used directly in a join and also nested under a 
natural/USING-join wrapper elsewhere in the plan, the broadened filter lets 
both candidates through `resolveDataFrameColumnByPlanId`'s merge, tripping 
`throw ambiguousColumnReferences(u)`.
   
   For example, queries like:
   
   ```python
   enriched = events.join(dim, "dim_id", "left")   # USING join hides dim's 
dim_id
   result = (fact
     .join(dim, fact["fk"] == dim["dim_id"], "left")  # direct use of dim
     .join(enriched, "txn_id", "full_outer")
     .select(dim["dim_id"]))                          # previously AMBIGUOUS
   ```
   
   now resolve `dim["dim_id"]` to the direct-usage output candidate.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — bug fix. Queries that referenced a DataFrame both directly in a join 
and nested under a natural/USING join (where the wrapper hides one of the 
columns into `metadataOutput`) previously raised `AMBIGUOUS_COLUMN_REFERENCE`. 
They now resolve to the direct-usage candidate.
   
   ### How was this patch tested?
   
   - New 
`test_select_regular_column_with_reused_dataframe_hidden_in_natural_join` added 
to `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`.
   - Existing pyspark column-resolution tests should keep passing, including 
`test_self_join`, `test_self_join_II/III/IV`, and `test_select_join_keys`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56632][SQL][CONNECT] Fix AMBIGUOUS_COLUMN_REFERENCE regression for reused DataFrame in natural join [spark]

Reply via email to