Rafal Wojdyla created SPARK-38904: ------------------------------------- Summary: Low cost DataFrame schema swap util Key: SPARK-38904 URL: https://issues.apache.org/jira/browse/SPARK-38904 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.1 Reporter: Rafal Wojdyla
This question is related to [https://stackoverflow.com/a/37090151/1661491]. Let's assume I have a pyspark DataFrame with certain schema, and I would like to overwrite that schema with a new schema that I *{*}know{*}* is compatible, I could do: {code:python} df: DataFrame new_schema = ... df.rdd.toDF(schema=new_schema) {code} Unfortunately this triggers computation as described in the link above. Is there a way to do that at the metadata level (or lazy), without eagerly triggering computation or conversions? Edit, note: * the schema can be arbitrarily complicated (nested etc) * new schema includes updates to description, nullability and additional metadata (bonus points for updates to the type) * I would like to avoid writing a custom query expression generator, *{*}unless{*}* there's one already built into Spark that can generate query based on the schema/`StructType` Copied from: [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan] See POC of workaround/util in https://github.com/ravwojdyla/spark-schema-utils Also posted in [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj] -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org