[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106124#comment-15106124
 ] 

Simeon Simeonov commented on SPARK-12890:
-----------------------------------------

I've experienced this issue with a multi-level partitioned table loaded via 
`sqlContext.read.parquet()`. I'm not sure Spark is actually reading any data 
from the Parquet files but it does look at every Parquet file (perhaps reading 
meta-data?). I discovered this by accident because I had invalid Parquet files 
in the table tree left over from a failed job. Spark errored, which surprised 
me as I would have expected it to not look at any of the data when the query 
could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large 
partitioned tables.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-12890
>                 URL: https://issues.apache.org/jira/browse/SPARK-12890
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to