Maybe I'm missing something, I thought parquet was generally a write-once format and the sqlContext interface to it seems that way as well.
d1.saveAsParquetFile("/foo/d1") // another day, another table, with same schema d2.saveAsParquetFile("/foo/d2") Will give a directory structure like /foo/d1/_metadata /foo/d1/part-r-1.parquet /foo/d1/part-r-2.parquet /foo/d1/_SUCCESS /foo/d2/_metadata /foo/d2/part-r-1.parquet /foo/d2/part-r-2.parquet /foo/d2/_SUCCESS // ParquetFileReader will fail, because /foo/d1 is a directory, not a parquet partition sqlContext.parquetFile("/foo") // works, but has the noted lack of pushdown sqlContext.parquetFile("/foo/d1").unionAll(sqlContext.parquetFile("/foo/d2")) Is there another alternative? On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust <mich...@databricks.com> wrote: > I think usually people add these directories as multiple partitions of the > same table instead of union. This actually allows us to efficiently prune > directories when reading in addition to standard column pruning. > > On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf <malouf.g...@gmail.com> > wrote: > >> I'm kind of surprised this was not run into before. Do people not >> segregate their data by day/week in the HDFS directory structure? >> >> >> On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Thanks! >>> >>> On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger <c...@koeninger.org> >>> wrote: >>> >>> > Opened >>> > >>> > https://issues.apache.org/jira/browse/SPARK-3462 >>> > >>> > I'll take a look at ColumnPruning and see what I can do >>> > >>> > On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust < >>> mich...@databricks.com> >>> > wrote: >>> > >>> >> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger <c...@koeninger.org> >>> >> wrote: >>> >>> >>> >>> Is there a reason in general not to push projections and predicates >>> down >>> >>> into the individual ParquetTableScans in a union? >>> >>> >>> >> >>> >> This would be a great case to add to ColumnPruning. Would be awesome >>> if >>> >> you could open a JIRA or even a PR :) >>> >> >>> > >>> > >>> >> >> >