zhengruifeng opened a new pull request, #55947:
URL: https://github.com/apache/spark/pull/55947
### What changes were proposed in this pull request?
Add Connect-only test coverage that pins known Connect/Classic divergences
in DataFrame column resolution, so future tightening cannot silently regress
patterns customer workflows depend on.
Two additions:
**`python/pyspark/sql/tests/connect/test_connect_column.py`** — 14 focused
parity tests, each exercising both `strict=true` and `strict=false` modes of
`spark.sql.analyzer.strictDataFrameColumnResolution`:
- Shadowing: `test_resolve_after_chained_withcolumn_shadow`,
`test_resolve_after_select_alias_shadow`,
`test_resolve_after_withcolumnrenamed`, `test_resolve_after_drop`
- Pass-through operators: `test_resolve_through_filter`,
`test_resolve_through_sort`, `test_resolve_through_distinct`
- Aggregation / pivot: `test_resolve_after_groupby_count`,
`test_resolve_after_agg_alias_shadow`, `test_resolve_after_pivot`
- Set operations: `test_resolve_after_union`, `test_resolve_after_intersect`
- Self-join: `test_resolve_self_join_alias`
- Subquery / temp view: `test_resolve_after_subquery_view`
**`python/pyspark/sql/tests/connect/test_parity_dataframe.py`** — 3
mixed-surface layered programs (4-6 chained operators combining filters, joins,
aggregations, set ops, window functions, UDFs and temporary views) running
under non-strict mode, to catch interactions between analyzer rules that
single-operator tests would miss:
- `test_layered_filter_join_agg_shadow`
- `test_layered_temp_view_subquery_udf`
- `test_layered_union_window_pivot_shadow`
Tests are intentionally placed in Connect-only suites (per the postmortem
follow-up): keeping them out of Classic-shared mixins prevents them from being
removed as "diverging from Classic" during routine cleanup.
### Why are the changes needed?
The ES-1853063 postmortem (Spark Connect DataFrame column resolution
regression on DBR 18.0 -> 18.1) called out two systemic gaps:
1. No declared Connect-specific contract for the lenient `df["col"]` ->
`col("col")` fallback used by customer workflows.
2. No fleet observability for Connect-only analysis paths.
apache/spark#55531 added the `strictDataFrameColumnResolution` config and
one shadowing test. This PR widens the coverage to other divergence patterns
(shadowing variants, set ops, aggregation, pivot, self-join, subquery-as-table)
and adds layered mixed-surface regression programs as requested in the JIRA
followup, so any tightening of Connect's column resolution will surface a clear
test failure rather than a silent customer-visible regression.
### Does this PR introduce _any_ user-facing change?
No. Test-only change.
### How was this patch tested?
The tests themselves are the change. The assertions follow the contract
documented in apache/spark#55531 and `docs/spark-connect-gotchas.md` (added in
apache/spark#55756). Assertions for less-common patterns (set ops, pivot,
subquery_view) are best-effort predictions of current behavior and may need
adjustment based on CI results.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic), claude-opus-4-7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]