[ https://issues.apache.org/jira/browse/SPARK-38904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523636#comment-17523636 ]
Rafal Wojdyla commented on SPARK-38904: --------------------------------------- [~hyukjin.kwon] ok, will give it a shot, and ping you if I get stuck. Also if you have any immediate tips would appreciate it. > Low cost DataFrame schema swap util > ----------------------------------- > > Key: SPARK-38904 > URL: https://issues.apache.org/jira/browse/SPARK-38904 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.2.1 > Reporter: Rafal Wojdyla > Priority: Major > > This question is related to [https://stackoverflow.com/a/37090151/1661491]. > Let's assume I have a pyspark DataFrame with certain schema, and I would like > to overwrite that schema with a new schema that I *know* is compatible, I > could do: > {code:python} > df: DataFrame > new_schema = ... > df.rdd.toDF(schema=new_schema) > {code} > Unfortunately this triggers computation as described in the link above. Is > there a way to do that at the metadata level (or lazy), without eagerly > triggering computation or conversions? > Edit, note: > * the schema can be arbitrarily complicated (nested etc) > * new schema includes updates to description, nullability and additional > metadata (bonus points for updates to the type) > * I would like to avoid writing a custom query expression generator, > *unless* there's one already built into Spark that can generate query based > on the schema/{{{}StructType{}}} > Copied from: > [https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan] > See POC of workaround/util in > [https://github.com/ravwojdyla/spark-schema-utils] > Also posted in > [https://lists.apache.org/thread/5ds0f7chzp1s3h10tvjm3r96g769rvpj] -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org