[PR] [SPARK-XXXXX][SQL][CONNECT] Prefer output match over hidden-column match in DataFrame column resolution [spark]

via GitHub Sun, 26 Apr 2026 22:52:45 -0700


zhengruifeng opened a new pull request, #55556:
URL: https://github.com/apache/spark/pull/55556


   ### What changes were proposed in this pull request?
   
   Replace the single broadened ancestor walk in `resolveDataFrameColumn` with 
a two-walk pattern, mirroring the `outputAttributes.resolve(...) orElse 
outputMetadataAttributes.resolve(...)` precedence used by `LogicalPlan.resolve` 
/ `LogicalPlan.resolveChildren`:
   
   - **Metadata access** (`df["_metadata"]`, `IS_METADATA_COL` tagged): a 
single walk filtered by `p.metadataOutput`.
   - **Regular access** (`df["col"]`): walk first with the strict filter 
`p.outputSet` (pre-`a84a39a` behavior). That drops candidates hidden at an 
ancestor — e.g. the right side's join key after a natural/USING join. If strict 
resolves, use it. Otherwise retry with the broad filter `p.output ++ 
p.metadataOutput` to handle the SPARK-55070 `rhs["join_key"]` case where the 
only valid resolution is via `p.metadataOutput`.
   
   The filter choice is threaded as a `getAllowed: LogicalPlan => AttributeSet` 
argument through `resolveDataFrameColumnByPlanId` / 
`resolveDataFrameColumnRecursively`; no change to the `foldLeft` merge logic.
   
   ### Why are the changes needed?
   
   Follow-up fix to `a84a39a` (SPARK-55070). That commit broadened the ancestor 
filter in `resolveDataFrameColumnRecursively` from `p.outputSet` to `p.output 
++ p.metadataOutput` so that `rhs["join_key"]` works after a natural/USING join 
(where one join key is hidden in `Project.hiddenOutputTag`). But when the same 
DataFrame is both used directly in a join and also nested under a 
natural/USING-join wrapper elsewhere in the plan, the broadened filter lets 
both candidates through `resolveDataFrameColumnByPlanId`'s merge, tripping 
`throw ambiguousColumnReferences(u)`.
   
   For example, queries like:
   
   ```python
   enriched = events.join(dim, "dim_id", "left")   # USING join hides dim's 
dim_id
   result = (fact
     .join(dim, fact["fk"] == dim["dim_id"], "left")  # direct use of dim
     .join(enriched, "txn_id", "full_outer")
     .select(dim["dim_id"]))                          # previously AMBIGUOUS
   ```
   
   now resolve `dim["dim_id"]` to the direct-usage output candidate.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — bug fix. Queries that referenced a DataFrame both directly in a join 
and nested under a natural/USING join (where the wrapper hides one of the 
columns into `metadataOutput`) previously raised `AMBIGUOUS_COLUMN_REFERENCE`. 
They now resolve to the direct-usage candidate.
   
   ### How was this patch tested?
   
   - New 
`test_select_regular_column_with_reused_dataframe_hidden_in_natural_join` added 
to `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`.
   - Existing pyspark column-resolution tests should keep passing, including 
`test_self_join`, `test_self_join_II/III/IV`, and `test_select_join_keys`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-XXXXX][SQL][CONNECT] Prefer output match over hidden-column match in DataFrame column resolution [spark]

Reply via email to