zhengruifeng opened a new pull request, #55556:
URL: https://github.com/apache/spark/pull/55556
### What changes were proposed in this pull request?
Replace the single broadened ancestor walk in `resolveDataFrameColumn` with
a two-walk pattern, mirroring the `outputAttributes.resolve(...) orElse
outputMetadataAttributes.resolve(...)` precedence used by `LogicalPlan.resolve`
/ `LogicalPlan.resolveChildren`:
- **Metadata access** (`df["_metadata"]`, `IS_METADATA_COL` tagged): a
single walk filtered by `p.metadataOutput`.
- **Regular access** (`df["col"]`): walk first with the strict filter
`p.outputSet` (pre-`a84a39a` behavior). That drops candidates hidden at an
ancestor — e.g. the right side's join key after a natural/USING join. If strict
resolves, use it. Otherwise retry with the broad filter `p.output ++
p.metadataOutput` to handle the SPARK-55070 `rhs["join_key"]` case where the
only valid resolution is via `p.metadataOutput`.
The filter choice is threaded as a `getAllowed: LogicalPlan => AttributeSet`
argument through `resolveDataFrameColumnByPlanId` /
`resolveDataFrameColumnRecursively`; no change to the `foldLeft` merge logic.
### Why are the changes needed?
Follow-up fix to `a84a39a` (SPARK-55070). That commit broadened the ancestor
filter in `resolveDataFrameColumnRecursively` from `p.outputSet` to `p.output
++ p.metadataOutput` so that `rhs["join_key"]` works after a natural/USING join
(where one join key is hidden in `Project.hiddenOutputTag`). But when the same
DataFrame is both used directly in a join and also nested under a
natural/USING-join wrapper elsewhere in the plan, the broadened filter lets
both candidates through `resolveDataFrameColumnByPlanId`'s merge, tripping
`throw ambiguousColumnReferences(u)`.
For example, queries like:
```python
enriched = events.join(dim, "dim_id", "left") # USING join hides dim's
dim_id
result = (fact
.join(dim, fact["fk"] == dim["dim_id"], "left") # direct use of dim
.join(enriched, "txn_id", "full_outer")
.select(dim["dim_id"])) # previously AMBIGUOUS
```
now resolve `dim["dim_id"]` to the direct-usage output candidate.
### Does this PR introduce _any_ user-facing change?
Yes — bug fix. Queries that referenced a DataFrame both directly in a join
and nested under a natural/USING join (where the wrapper hides one of the
columns into `metadataOutput`) previously raised `AMBIGUOUS_COLUMN_REFERENCE`.
They now resolve to the direct-usage candidate.
### How was this patch tested?
- New
`test_select_regular_column_with_reused_dataframe_hidden_in_natural_join` added
to `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`.
- Existing pyspark column-resolution tests should keep passing, including
`test_self_join`, `test_self_join_II/III/IV`, and `test_select_join_keys`.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]