ss created SPARK-38109:
--------------------------

             Summary: pyspark DataFrame.replace() is sensitive to column name 
case in pyspark 3.2 but not in 3.1
                 Key: SPARK-38109
                 URL: https://issues.apache.org/jira/browse/SPARK-38109
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.2.1, 3.2.0
            Reporter: ss


The `subset` argumentĀ for `DataFrame.replace()` accepts one or more column 
names. In pyspark 3.2 the case of the column names must match the column names 
in the schema exactly or the replacements will not take place. In earlier 
versions (3.1.2 was tested) the argument is case insensitive.

Minimal example:

```python
replace_dict = {'wrong': 'right'}
df = spark.createDataFrame(
  [['wrong', 'wrong']], 
  schema=['case_matched', 'case_unmatched']
)
df2 = df.replace(replace_dict, subset=['case_matched', 'Case_Unmatched'])
```
In pyspark 3.2 (tested 3.2.0, 3.2.1 via pip on windows and 3.2.0 on Databricks) 
the result is:

|case_matched|case_unmatched|
|-|-|
|right|wrong|

While in pyspark 3.1 (tested 3.1.2 via pip on windows and 3.1.2 on Databricks) 
the result is:

|case_matched|case_unmatched|
|-|-|
|right|right|

I believe the expected behaviour is that shown in pyspark 3.1 as in all other 
situations column names are accepted in a case insensitive manner. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to