[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15610024#comment-15610024 ]
Franck Tago commented on SPARK-15616: ------------------------------------- Does spark currently support pruning statistics for hive table partitions? The question is consider an rdd that consists of hive table 1 --- filter -- join / large hive table2 ----------- where the hive table1 and hive table 2 exceed the spark.sql.autoBroadcastJoinThreshold. However , the filter only removes data from a single partition that is small enough to fit spark.sql.autoBroadcastJoinThreshold. The question is , will spark perform a broadcast join in this case ? Using spark 2.0.0 and spark 2.0.1 my observations are that spark will not. Any ideas? > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > ------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org