[ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-22913: ------------------------------------ Assignee: Apache Spark > Hive Partition Pruning, Fractional and Timestamp types > ------------------------------------------------------ > > Key: SPARK-22913 > URL: https://issues.apache.org/jira/browse/SPARK-22913 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Ameen Tayyebi > Assignee: Apache Spark > Fix For: 2.3.0 > > > Spark currently pushes the predicates it has in the SQL query to Hive > Metastore. This only applies to predicates that are placed on top of > partitioning columns. As more and more hive metastore implementations come > around, this is an important optimization to allow data to be prefiltered to > only relevant partitions. Consider the following example: > Table: > create external table data (key string, quantity long) > partitioned by (processing-date timestamp) > Query: > select * from data where processing-date = '2017-10-23 00:00:00' > Currently, no filters will be pushed to the hive metastore for the above > query. The reason is that the code that tries to compute predicates to be > sent to hive metastore, only deals with integral and string column types. It > doesn't know how to handle fractional and timestamp columns. > I have tables in my metastore (AWS Glue) with millions of partitions of type > timestamp and double. In my specific case, it takes Spark's master node about > 6.5 minutes to download all partitions for the table, and then filter the > partitions client-side. The actual processing time of my query is only 6 > seconds. In other words, without partition pruning, I'm looking at 6.5 > minutes of processing and with partition pruning, I'm looking at 6 seconds > only. > I have a fix for this developed locally that I'll provide shortly as a pull > request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org