Parquet datasource optimization for distinct query

pnpritchard Wed, 16 Dec 2015 12:18:16 -0800

I have a parquet file that is partitioned by a column, like shown in
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery.
I like this storage technique because the datasource can "push-down" filters
on the partitioned column, making some queries a lot faster.


However, there is additional "push-down" optimization that could be done,
specifically on distinct queries. For example, a distinct on the partitioned
column only requires scanning the directory structure; no files need to be
read. This can make a huge difference when there is a lot of data.

This kind of optimization doesn't seem possible to implement currently
because the ParquetRelation only receives an array of Filter, and not the
full logical plan (which would include the distinct expression). Are there
any plans to change this? Has anyone else encountered this and figured out a
workaround?

Thanks!
Nick



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-datasource-optimization-for-distinct-query-tp25721.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parquet datasource optimization for distinct query

Reply via email to