Shay Elbaz <shay.el...@gm.com> schrieb am So. 14. Aug. 2022 um 09:22:

> Hi,
>
> I have a simple ETL application, where the data source schama needs to be
> sanitized. Column names might include special characters that need to be
> removed. For example, from "some{column}" to "some_column".
> Normally I'd just alias the columns, but in this case the schema can have
> thousands of deeply nested columns. Creating a new StructType feels more
> intuitive and simpler, but the only way I know of to apply the new schema
> is to create a new dataframe -
> *spark.createDataFrame(old_df.rdd, new_schema)*. This makes the
> deserialization and re-serialization of the dataframe the most expensive
> operation in that "simple" ETL app.
>
> *To make things worse,* since it's a pyspark application, the RDD is
> treated as Python RDD and all the data is moving from the JVM to Python and
> back, without any real transformation.
> This is resolved by creating the new DF on the JVM only:
>
> jschema = 
> spark._sc._jvm.org.apache.spark.sql.types.DataType.fromJson(sanitized_schema.json())
> sanitized_df = DataFrame(spark._jsparkSession.createDataFrame(df._jdf.rdd(), 
> jschema), spark)
>
> Is there another way to do a bulk rename operation? I'd like to avoid
> creating some uber "select" statement with aliases, or multiple
> withColumnRenamed operations, as much as possible, mainly for maintenance
> reasons.
>
> Thanks
>

Reply via email to