I see a Jira: https://issues.apache.org/jira/browse/SPARK-21021
On Thu, Apr 11, 2019 at 9:08 AM Dávid Szakállas <david.szakal...@gmail.com> wrote: > +dev for more visibility. Is this a known issue? Is there a plan for a fix? > > Thanks, > David > > Begin forwarded message: > > *From: *Dávid Szakállas <david.szakal...@gmail.com> > *Subject: **Dataset schema incompatibility bug when reading column > partitioned data* > *Date: *2019. March 29. 14:15:27 CET > *To: *u...@spark.apache.org > > We observed the following bug on Spark 2.4.0: > > scala> > spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet") > > scala> val schema = StructType(Seq(StructField("_1", > IntegerType),StructField("_2", IntegerType))) > > scala> spark.read.schema(schema).parquet("foo.parquet").as[(Int, Int)].show > +---+---+ > | _2| _1| > +---+---+ > | 2| 1| > +---+- --+ > > > That is, when reading column partitioned Parquet files the explicitly > specified schema is not adhered to, instead the partitioning columns are > appended the end of the column list. This is a quite severe issue as some > operations, such as union, fails if columns are in a different order in two > datasets. Thus we have to work around the issue with a select: > > val columnNames = schema.fields.map(_.name) > ds.select(columnNames.head, columnNames.tail: _*) > > > Thanks, > David Szakallas > Data Engineer | Whitepages, Inc. > > >