[PR] [TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution [spark]

via GitHub Sun, 17 May 2026 21:05:10 -0700


zhengruifeng opened a new pull request, #55947:
URL: https://github.com/apache/spark/pull/55947


   ### What changes were proposed in this pull request?
   
   Add Connect-only test coverage that pins known Connect/Classic divergences 
in DataFrame column resolution, so future tightening cannot silently regress 
patterns customer workflows depend on.
   
   Two additions:
   
   **`python/pyspark/sql/tests/connect/test_connect_column.py`** — 14 focused 
parity tests, each exercising both `strict=true` and `strict=false` modes of 
`spark.sql.analyzer.strictDataFrameColumnResolution`:
   
   - Shadowing: `test_resolve_after_chained_withcolumn_shadow`, 
`test_resolve_after_select_alias_shadow`, 
`test_resolve_after_withcolumnrenamed`, `test_resolve_after_drop`
   - Pass-through operators: `test_resolve_through_filter`, 
`test_resolve_through_sort`, `test_resolve_through_distinct`
   - Aggregation / pivot: `test_resolve_after_groupby_count`, 
`test_resolve_after_agg_alias_shadow`, `test_resolve_after_pivot`
   - Set operations: `test_resolve_after_union`, `test_resolve_after_intersect`
   - Self-join: `test_resolve_self_join_alias`
   - Subquery / temp view: `test_resolve_after_subquery_view`
   
   **`python/pyspark/sql/tests/connect/test_parity_dataframe.py`** — 3 
mixed-surface layered programs (4-6 chained operators combining filters, joins, 
aggregations, set ops, window functions, UDFs and temporary views) running 
under non-strict mode, to catch interactions between analyzer rules that 
single-operator tests would miss:
   
   - `test_layered_filter_join_agg_shadow`
   - `test_layered_temp_view_subquery_udf`
   - `test_layered_union_window_pivot_shadow`
   
   Tests are intentionally placed in Connect-only suites (per the postmortem 
follow-up): keeping them out of Classic-shared mixins prevents them from being 
removed as "diverging from Classic" during routine cleanup.
   
   ### Why are the changes needed?
   
   The ES-1853063 postmortem (Spark Connect DataFrame column resolution 
regression on DBR 18.0 -> 18.1) called out two systemic gaps:
   
   1. No declared Connect-specific contract for the lenient `df["col"]` -> 
`col("col")` fallback used by customer workflows.
   2. No fleet observability for Connect-only analysis paths.
   
   apache/spark#55531 added the `strictDataFrameColumnResolution` config and 
one shadowing test. This PR widens the coverage to other divergence patterns 
(shadowing variants, set ops, aggregation, pivot, self-join, subquery-as-table) 
and adds layered mixed-surface regression programs as requested in the JIRA 
followup, so any tightening of Connect's column resolution will surface a clear 
test failure rather than a silent customer-visible regression.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-only change.
   
   ### How was this patch tested?
   
   The tests themselves are the change. The assertions follow the contract 
documented in apache/spark#55531 and `docs/spark-connect-gotchas.md` (added in 
apache/spark#55756). Assertions for less-common patterns (set ops, pivot, 
subquery_view) are best-effort predictions of current behavior and may need 
adjustment based on CI results.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Anthropic), claude-opus-4-7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution [spark]

Reply via email to