Ying Wang created SPARK-26240: --------------------------------- Summary: [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException Key: SPARK-26240 URL: https://issues.apache.org/jira/browse/SPARK-26240 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Environment: Ubuntu 16.04 LTS (x86_64/deb)
Reporter: Ying Wang I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet file with illegal column headers, and when I had called df = df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME) and then called df.show(), pyspark errored out with the failed attribute being the old column name. Steps to reproduce: - Create a Parquet file from Pandas using this dataframe schema: ```python In [10]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1000 entries, 0 to 999 Data columns (total 16 columns): Record_ID 1000 non-null int64 registration_dttm 1000 non-null object id 1000 non-null int64 first_name 984 non-null object last_name 1000 non-null object email 984 non-null object gender 933 non-null object ip_address 1000 non-null object cc 709 non-null float64 country 1000 non-null object birthdate 803 non-null object salary 932 non-null float64 title 803 non-null object comments 179 non-null object Unnamed: 14 10 non-null object Unnamed: 15 9 non-null object dtypes: float64(2), int64(2), object(12) memory usage: 132.8+ KB ``` * Open pyspark shell with `pyspark` and read in the Parquet file with `spark.read.format('parquet').load('/path/to/file.parquet') Call `spark_df.show()` Note the error with column 'Unnamed: 14'. Rename column, replacing illegal space character with underscore character: `spark_df = spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')` Call `spark_df.show()` again, and note that the error still shows attribute 'Unnamed: 14' in the error message: ```python >>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet') >>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14') >>> newdf.show() Traceback (most recent call last): File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString. : org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; ... ``` I would have thought that there would be a way in order to read in Parquet files such that illegal column names can be changed after the fact with the spark dataframe was generated, and thus this is unintended behavior. Please let me know if I am wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org