zhengruifeng opened a new pull request, #55848:
URL: https://github.com/apache/spark/pull/55848

   ### What changes were proposed in this pull request?
   
   Add a new gotcha section to `docs/spark-connect-gotchas.md` describing how 
Spark Connect resolves DataFrame column references (`df["col"]`) via plan-id 
tagging, and how this diverges from Spark Classic once a column has been 
shadowed by `withColumn` or `select`.
   
   The section covers:
   
   - Why `df.withColumn("col", ...).select(df["col"])` fails on both Spark 
Classic (`MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`) and Spark 
Connect (`CANNOT_RESOLVE_DATAFRAME_COLUMN`).
   - Why users may have observed this query succeeding on older Spark Connect 
builds (lenient name-based fallback when plan-id resolution does not match a 
tagged ancestor).
   - The recommended fix: use an untagged `F.col("col")` reference after column 
shadowing.
   - The opt-in escape hatch: 
`spark.sql.analyzer.strictDataFrameColumnResolution=false` (introduced in 
SPARK-56614 / apache/spark#55531) to re-enable the lenient fallback.
   
   Also adds a "DataFrame column references" row to the summary table at the 
end of the document.
   
   ### Why are the changes needed?
   
   The plan-id-based column resolution path is a Spark Connect-specific 
contract that is not documented anywhere user-facing. Users migrating workloads 
to Spark Connect have encountered surprises when patterns that previously 
"worked" stop resolving, with an error class 
(`CANNOT_RESOLVE_DATAFRAME_COLUMN`) and a config 
(`strictDataFrameColumnResolution`) whose connection to their code is not 
obvious. This adds explicit guidance and a code-level mitigation alongside the 
other Connect-vs-Classic gotchas already documented in this file.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Documentation-only change.
   
   ### How was this patch tested?
   
   Documentation-only change; no automated tests. Verified the markdown renders 
correctly and is consistent with the existing four-gotcha layout in 
`docs/spark-connect-gotchas.md`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Anthropic), claude-opus-4-7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to