Derk Crezee created SPARK-35876: ----------------------------------- Summary: array_zip unexpected column names Key: SPARK-35876 URL: https://issues.apache.org/jira/browse/SPARK-35876 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Reporter: Derk Crezee
{{When I'm using the array_zip function in combination with renamed columns, I get an unexpected schema written to disk.}} {code:java} // code placeholder data = [ Row(a1=["a", "a"], b1=["b", "b"]), ] df = ( spark.sparkContext.parallelize(data).toDF() .withColumnRenamed("a1", "a2") .withColumnRenamed("b1", "b2") .withColumn("zipped", arrays_zip(col("a2"), col("b2"))) ) df.printSchema() // root // |-- a2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- b2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- zipped: array (nullable = true) // | |-- element: struct (containsNull = false) // | | |-- a2: string (nullable = true) // | | |-- b2: string (nullable = true) df.write.save("test.parquet") spark.read.load("test.parquet").printSchema() // root // |-- a2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- b2: array (nullable = true) // | |-- element: string (containsNull = true) // |-- zipped: array (nullable = true) // | |-- element: struct (containsNull = true) // | | |-- a1: string (nullable = true) // | | |-- b1: string (nullable = true){code} I would expect the schema of the DataFrame written to disk to be the same as that printed out. It seems that instead of using the renamed version of the column names, it uses the old column names. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org