cloud-fan commented on code in PR #55848:
URL: https://github.com/apache/spark/pull/55848#discussion_r3241530942


##########
docs/spark-connect-gotchas.md:
##########
@@ -418,6 +418,46 @@ println(structColumnFields)
 
 This approach is significantly faster when dealing with a large number of 
columns because it avoids creating and analyzing numerous DataFrames.
 
+## 5. DataFrame column references after column shadowing
+
+In Spark Connect, a DataFrame column reference such as `df["c"]` is tagged 
with the plan id of `df`. At analysis time the server resolves the reference by 
looking for the tagged ancestor in the plan and pulling the matching attribute 
from it. Spark Classic does not use plan ids; it resolves column references 
against the immediate child's output by attribute id and name.
+
+The two resolution strategies diverge once a column has been shadowed by 
another operator that produces an attribute with the same name:
+
+```python
+import pyspark.sql.functions as sf
+
+df = spark.sql("SELECT 1 AS c")
+df.withColumn("c", sf.col("c").cast("string")).select(df["c"]).collect()
+```
+
+`withColumn("c", ...)` does not mutate `df`; it returns a new DataFrame whose 
`c` is a new attribute that hides the original. The trailing `df["c"]` still 
refers to the *original* `c` attribute, which is no longer in the projection 
list.
+
+* **Spark Classic** has always rejected this query at analysis time with 
`MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION`, because the 
original attribute is not present in the operator's child output.
+* **Spark Connect** rejects it with `CANNOT_RESOLVE_DATAFRAME_COLUMN` by 
default. The plan-id-tagged reference does not match any attribute in the 
current plan. But when the SQL config 
`spark.sql.analyzer.strictDataFrameColumnResolution` (added in Spark 4.2.0, 
default `true`) is set to `false`, the analyzer still tries plan-id-based 
resolution first, and only when that fails does it fall back to name-based 
resolution: the tagged `df["c"]` is then resolved by name against the projected 
`c` from `withColumn`, and the query succeeds.
+
+### Recommended way
+
+If you hit any of the confusing failures mentioned above, it is recommended to 
switch to `sf.col` first. `sf.col("c")` is an untagged name reference that 
resolves against the most recent projection or `withColumn`, rather than 
`df["c"]` which is a tagged reference to `df`'s original column:

Review Comment:
   `sf.col` is confusing here as we didn't mention `import 
pyspark.sql.functions as sf` before



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to