[ https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-35876. ---------------------------------- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33106 [https://github.com/apache/spark/pull/33106] > array_zip unexpected column names > --------------------------------- > > Key: SPARK-35876 > URL: https://issues.apache.org/jira/browse/SPARK-35876 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.1.2 > Reporter: Derk Crezee > Assignee: Kousuke Saruta > Priority: Major > Fix For: 3.2.0 > > > {{When I'm using the array_zip function in combination with renamed columns, > I get an unexpected schema written to disk.}} > {code:java} > // code placeholder > from pyspark.sql import * > from pyspark.sql.functions import * > spark = SparkSession.builder.getOrCreate() > data = [ > Row(a1=["a", "a"], b1=["b", "b"]), > ] > df = ( > spark.sparkContext.parallelize(data).toDF() > .withColumnRenamed("a1", "a2") > .withColumnRenamed("b1", "b2") > .withColumn("zipped", arrays_zip(col("a2"), col("b2"))) > ) > df.printSchema() > // root > // |-- a2: array (nullable = true) > // | |-- element: string (containsNull = true) > // |-- b2: array (nullable = true) > // | |-- element: string (containsNull = true) > // |-- zipped: array (nullable = true) > // | |-- element: struct (containsNull = false) > // | | |-- a2: string (nullable = true) > // | | |-- b2: string (nullable = true) > df.write.save("test.parquet") > spark.read.load("test.parquet").printSchema() > // root > // |-- a2: array (nullable = true) > // | |-- element: string (containsNull = true) > // |-- b2: array (nullable = true) > // | |-- element: string (containsNull = true) > // |-- zipped: array (nullable = true) > // | |-- element: struct (containsNull = true) > // | | |-- a1: string (nullable = true) > // | | |-- b1: string (nullable = true){code} > I would expect the schema of the DataFrame written to disk to be the same as > that printed out. It seems that instead of using the renamed version of the > column names, it uses the old column names. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org