When writing a DataFrame into partitioned parquet files, the partition columns are removed from the data.
For example: df.write.mode(SaveMode.Append).partitionBy('year','month','day', 'hour').parquet(somePath) This creates a directory structure like: events |-> 2016 |-> 1 |-> 15 |-> 10 |-> part-r-00000-b6d0b99e-8673-4a12-8ab4-379421b008c8.gz.parquet The column values for the partitions are represented in the directory structure, but removed from the actual parquet file. When reading back, the values will be NULL, unless PartitionDiscovery is enabled in Spark. We did not realize this and missed 2 weeks of data analysis since those fields where returned as NULL. The logic that removes the columns from the data schema is in ResolvedDataSource (~ line 182): val dataSchema = StructType(data.schema.filterNot(f => partitionColumns.contains(f.name))) When using this data in something like Apache Drill, it returns NULL for all of the partition fields and thus makes it hard to work with. Why does it doe this (Parquet Spec?) and would this be considered a defect or at least have a configuration to not remove the columns from the data schema? Any help would be appreciated. Thanks, Ivan -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Partitioned-parquet-files-missing-partition-columns-from-data-tp16033.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org