[ https://issues.apache.org/jira/browse/SPARK-22247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Duin resolved SPARK-22247. ---------------------------------- Resolution: Duplicate Fix Version/s: 2.3.0 > Hive partition filter very slow > ------------------------------- > > Key: SPARK-22247 > URL: https://issues.apache.org/jira/browse/SPARK-22247 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL > Affects Versions: 2.0.2, 2.1.1 > Reporter: Patrick Duin > Priority: Minor > Fix For: 2.3.0 > > > I found an issue where filtering partitions using a dataframe results in very > bad performance. > To reproduce: > Create a hive table with a lot of partitions and write a spark query on that > table that filters based on the partition column. > In my use case I've got a table with about 30k partitions. > I filter the partitions using some scala via spark-shell: > {{table.filter("partition=x or partition=y")}} > This results in a Hive thrift API call:{{ #get_partitions('db', 'table', > -1)}} which is very slow (minutes) and loads all metastore partitions in > memory. > Doing a more simple filter: > {{table.filter("partition=x)}} > Results in a Hive Thrift API call:{{ #get_partitions_by_filter('db', 'table', > 'partition = "x', -1)}} which is very fast (seconds) and only fetches > partition X into memory. > If possible Spark should translate the filter into the more performant Thrift > call or fallback to a more scalable solution where it filters our partitions > without having to loading them all into memory first (for instance fetching > the partitions in batches). > I've posted my original question on > [SO|https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions] -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org