Shay Elbaz <shay.el...@gm.com> schrieb am So. 14. Aug. 2022 um 09:22:
> Hi, > > I have a simple ETL application, where the data source schama needs to be > sanitized. Column names might include special characters that need to be > removed. For example, from "some{column}" to "some_column". > Normally I'd just alias the columns, but in this case the schema can have > thousands of deeply nested columns. Creating a new StructType feels more > intuitive and simpler, but the only way I know of to apply the new schema > is to create a new dataframe - > *spark.createDataFrame(old_df.rdd, new_schema)*. This makes the > deserialization and re-serialization of the dataframe the most expensive > operation in that "simple" ETL app. > > *To make things worse,* since it's a pyspark application, the RDD is > treated as Python RDD and all the data is moving from the JVM to Python and > back, without any real transformation. > This is resolved by creating the new DF on the JVM only: > > jschema = > spark._sc._jvm.org.apache.spark.sql.types.DataType.fromJson(sanitized_schema.json()) > sanitized_df = DataFrame(spark._jsparkSession.createDataFrame(df._jdf.rdd(), > jschema), spark) > > Is there another way to do a bulk rename operation? I'd like to avoid > creating some uber "select" statement with aliases, or multiple > withColumnRenamed operations, as much as possible, mainly for maintenance > reasons. > > Thanks >