When writing a DataFrame into partitioned parquet files, the partition
columns are removed from the data. 

For example:

 df.write.mode(SaveMode.Append).partitionBy('year','month','day',
'hour').parquet(somePath)

This creates a directory structure like:

events
  |->  2016
    |-> 1
      |-> 15
        |-> 10
          |-> part-r-00000-b6d0b99e-8673-4a12-8ab4-379421b008c8.gz.parquet

The column values for the partitions are represented in the directory
structure, but removed from the actual parquet file. When reading back, the
values will be NULL, unless PartitionDiscovery is enabled in Spark. We did
not realize this and missed 2 weeks of data analysis since those fields
where returned as NULL.

The logic that removes the columns from the data schema is in
ResolvedDataSource (~ line 182):

val dataSchema = StructType(data.schema.filterNot(f =>
partitionColumns.contains(f.name)))

When using this data in something like Apache Drill, it returns NULL for all
of the partition fields and thus makes it hard to work with.

Why does it doe this (Parquet Spec?) and would this be considered a defect
or at least have a configuration to not remove the columns from the data
schema?

Any help would be appreciated.

Thanks,

Ivan
            



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Partitioned-parquet-files-missing-partition-columns-from-data-tp16033.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to