I have a parquet file that is partitioned by a column, like shown in http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery. I like this storage technique because the datasource can "push-down" filters on the partitioned column, making some queries a lot faster.
However, there is additional "push-down" optimization that could be done, specifically on distinct queries. For example, a distinct on the partitioned column only requires scanning the directory structure; no files need to be read. This can make a huge difference when there is a lot of data. This kind of optimization doesn't seem possible to implement currently because the ParquetRelation only receives an array of Filter, and not the full logical plan (which would include the distinct expression). Are there any plans to change this? Has anyone else encountered this and figured out a workaround? Thanks! Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-datasource-optimization-for-distinct-query-tp25721.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org