[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15610024#comment-15610024
 ] 

Franck Tago commented on SPARK-15616:
-------------------------------------

Does spark currently support  pruning  statistics for  hive table partitions?  

The question is consider an rdd that consists of 

hive table 1  ---   filter  --     join
                                             /
     large hive table2 -----------

where the hive table1 and hive table 2  exceed the  
spark.sql.autoBroadcastJoinThreshold.

However ,  the filter only removes data from a single partition that is small 
enough to  fit  spark.sql.autoBroadcastJoinThreshold.

The question is  , will spark perform a broadcast join in this case ?

Using spark 2.0.0 and spark 2.0.1  my observations are that spark  will not. 

Any ideas?

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15616
>                 URL: https://issues.apache.org/jira/browse/SPARK-15616
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to