PySpark schema sanitization

Shay Elbaz Sat, 13 Aug 2022 23:22:28 -0700

Hi,

I have a simple ETL application, where the data source schama needs to be 
sanitized. Column names might include special characters that need to be 
removed. For example, from "some{column}" to "some_column".
Normally I'd just alias the columns, but in this case the schema can have 
thousands of deeply nested columns. Creating a new StructType feels more 
intuitive and simpler, but the only way I know of to apply the new schema is to 
create a new dataframe -
spark.createDataFrame(old_df.rdd, new_schema). This makes the deserialization 
and re-serialization of the dataframe the most expensive operation in that 
"simple" ETL app.


To make things worse, since it's a pyspark application, the RDD is treated as 
Python RDD and all the data is moving from the JVM to Python and back, without 
any real transformation.
This is resolved by creating the new DF on the JVM only:

jschema = 
spark._sc._jvm.org.apache.spark.sql.types.DataType.fromJson(sanitized_schema.json())
sanitized_df = DataFrame(spark._jsparkSession.createDataFrame(df._jdf.rdd(), 
jschema), spark)

Is there another way to do a bulk rename operation? I'd like to avoid creating 
some uber "select" statement with aliases, or multiple withColumnRenamed 
operations, as much as possible, mainly for maintenance reasons.

Thanks

PySpark schema sanitization

Reply via email to