[jira] [Closed] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

Lijia Liu (JIRA) Wed, 26 Apr 2017 21:47:48 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lijia Liu closed SPARK-20475.
-----------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.3

> Whether use "broadcast join" depends on hive configuration
> ----------------------------------------------------------
>
>                 Key: SPARK-20475
>                 URL: https://issues.apache.org/jira/browse/SPARK-20475
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Lijia Liu
>             Fix For: 2.0.3
>
>
> Currently,  broadcast join in Spark only works while:
>   1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than 
> 0(default is 10485760).
>   2. The size of one of the hive tables less than 
> "spark.sql.autoBroadcastJoinThreshold". To get the size information of the 
> hive table from hive metastore,  "hive.stats.autogather" should be set to 
> true in hive or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS 
> noscan" has been run.
> But in Hive, it calculate the size of the file or directory corresponding to 
> the hive table to determine whether to use the map side join, and does not 
> depend on the hive metastore.
> This leads to two problems:
>   1. Spark will not use "broadcast join" when the hive parameter 
> "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE 
> <tableName> COMPUTE STATISTICS noscan" has not been run because the 
> information of the hive table has not saved in hive metastore . The mode of 
> work in Spark depends on the configuration of Hive.
>   2. For some reason, we set "hive.stats.autogather" to false in our Hive. 
> For the same SQL, Hive is 4 times faster than Spark because Hive used "map 
> side join" but Spark did not use "broadcast join".
> Is it possible to use the mechanism same to hive's to look up the size of a 
> hive tale in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20475) Whether use "broadcast join" depends on hive configuration

Reply via email to