[ https://issues.apache.org/jira/browse/SPARK-21021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michel Lemay updated SPARK-21021: --------------------------------- Description: When reading back a partitioned parquet folder, column order gets messed up. Consider the following example: {code} case class Event(f1: String, f2: String, f3: String) val df = Seq(Event("v1", "v2", "v3")).toDF df.write.partitionBy("f1", "f2").parquet("out") val schema: StructType = StructType(StructField("f1", StringType, true) :: StructField("f2", StringType, true) :: StructField("f3", StringType, true) :: Nil) val dfRead = spark.read.schema(schema).parquet("out") dfRead.show +---+---+---+ | f3| f1| f2| +---+---+---+ | v3| v1| v2| +---+---+---+ dfRead.columns Array[String] = Array(f3, f1, f2) schema.fields Array(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true)) {code} This makes it really hard to have compatible schema when reading from multiple sources. was: When reading back a partitioned parquet folder, column order gets messed up. Consider the following example: {code:scala} case class Event(f1: String, f2: String, f3: String) val df = Seq(Event("v1", "v2", "v3")).toDF df.write.partitionBy("f1", "f2").parquet("out") val schema: StructType = StructType(StructField("f1", StringType, true) :: StructField("f2", StringType, true) :: StructField("f3", StringType, true) :: Nil) val dfRead = spark.read.schema(schema).parquet("out") dfRead.show +---+---+---+ | f3| f1| f2| +---+---+---+ | v3| v1| v2| +---+---+---+ dfRead.columns Array[String] = Array(f3, f1, f2) schema.fields Array(StructField(f1,StringType,true), StructField(f2,StringType,true), StructField(f3,StringType,true)) {code} This makes it really hard to have compatible schema when reading from multiple sources. > Reading partitioned parquet does not respect specified schema column order > -------------------------------------------------------------------------- > > Key: SPARK-21021 > URL: https://issues.apache.org/jira/browse/SPARK-21021 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Michel Lemay > Priority: Minor > > When reading back a partitioned parquet folder, column order gets messed up. > Consider the following example: > {code} > case class Event(f1: String, f2: String, f3: String) > val df = Seq(Event("v1", "v2", "v3")).toDF > df.write.partitionBy("f1", "f2").parquet("out") > val schema: StructType = StructType(StructField("f1", StringType, true) :: > StructField("f2", StringType, true) :: StructField("f3", StringType, true) :: > Nil) > val dfRead = spark.read.schema(schema).parquet("out") > dfRead.show > +---+---+---+ > | f3| f1| f2| > +---+---+---+ > | v3| v1| v2| > +---+---+---+ > dfRead.columns > Array[String] = Array(f3, f1, f2) > schema.fields > Array(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} > This makes it really hard to have compatible schema when reading from > multiple sources. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org