Re: parquet predicate / projection pushdown into unionAll

Cody Koeninger Tue, 09 Sep 2014 12:02:23 -0700

Maybe I'm missing something, I thought parquet was generally a write-once
format and the sqlContext interface to it seems that way as well.


d1.saveAsParquetFile("/foo/d1")

// another day, another table, with same schema
d2.saveAsParquetFile("/foo/d2")

Will give a directory structure like

/foo/d1/_metadata
/foo/d1/part-r-1.parquet
/foo/d1/part-r-2.parquet
/foo/d1/_SUCCESS

/foo/d2/_metadata
/foo/d2/part-r-1.parquet
/foo/d2/part-r-2.parquet
/foo/d2/_SUCCESS

// ParquetFileReader will fail, because /foo/d1 is a directory, not a
parquet partition
sqlContext.parquetFile("/foo")

// works, but has the noted lack of pushdown
sqlContext.parquetFile("/foo/d1").unionAll(sqlContext.parquetFile("/foo/d2"))


Is there another alternative?



On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> I think usually people add these directories as multiple partitions of the
> same table instead of union.  This actually allows us to efficiently prune
> directories when reading in addition to standard column pruning.
>
> On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf <malouf.g...@gmail.com>
> wrote:
>
>> I'm kind of surprised this was not run into before.  Do people not
>> segregate their data by day/week in the HDFS directory structure?
>>
>>
>> On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Thanks!
>>>
>>> On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>> > Opened
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-3462
>>> >
>>> > I'll take a look at ColumnPruning and see what I can do
>>> >
>>> > On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust <
>>> mich...@databricks.com>
>>> > wrote:
>>> >
>>> >> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger <c...@koeninger.org>
>>> >> wrote:
>>> >>>
>>> >>> Is there a reason in general not to push projections and predicates
>>> down
>>> >>> into the individual ParquetTableScans in a union?
>>> >>>
>>> >>
>>> >> This would be a great case to add to ColumnPruning.  Would be awesome
>>> if
>>> >> you could open a JIRA or even a PR :)
>>> >>
>>> >
>>> >
>>>
>>
>>
>

Re: parquet predicate / projection pushdown into unionAll

Reply via email to