Ameen Tayyebi created SPARK-22913:
-------------------------------------

             Summary: Hive Partition Pruning, Fractional and Timestamp types
                 Key: SPARK-22913
                 URL: https://issues.apache.org/jira/browse/SPARK-22913
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Ameen Tayyebi
             Fix For: 2.3.0


Spark currently pushes the predicates it has in the SQL query to Hive 
Metastore. This only applies to predicates that are placed on top of 
partitioning columns. As more and more hive metastore implementations come 
around, this is an important optimization to allow data to be prefiltered to 
only relevant partitions. Consider the following example:

Table:
create external table data (key string, quantity long)
partitioned by (processing-date timestamp)

Query:
select * from data where processing-date = '2017-10-23 00:00:00'

Currently, no filters will be pushed to the hive metastore for the above query. 
The reason is that the code that tries to compute predicates to be sent to hive 
metastore, only deals with integral and string column types. It doesn't know 
how to handle fractional and timestamp columns.

I have tables in my metastore (AWS Glue) with millions of partitions of type 
timestamp and double. In my specific case, it takes Spark's master node about 
6.5 minutes to download all partitions for the table, and then filter the 
partitions client-side. The actual processing time of my query is only 6 
seconds. In other words, without partition pruning, I'm looking at 6.5 minutes 
of processing and with partition pruning, I'm looking at 6 seconds only.

I have a fix for this developed locally that I'll provide shortly as a pull 
request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to