[ https://issues.apache.org/jira/browse/SPARK-20475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lijia Liu closed SPARK-20475. ----------------------------- Resolution: Fixed Fix Version/s: 2.0.3 > Whether use "broadcast join" depends on hive configuration > ---------------------------------------------------------- > > Key: SPARK-20475 > URL: https://issues.apache.org/jira/browse/SPARK-20475 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.0 > Reporter: Lijia Liu > Fix For: 2.0.3 > > > Currently, broadcast join in Spark only works while: > 1. The value of "spark.sql.autoBroadcastJoinThreshold" bigger than > 0(default is 10485760). > 2. The size of one of the hive tables less than > "spark.sql.autoBroadcastJoinThreshold". To get the size information of the > hive table from hive metastore, "hive.stats.autogather" should be set to > true in hive or the command "ANALYZE TABLE <tableName> COMPUTE STATISTICS > noscan" has been run. > But in Hive, it calculate the size of the file or directory corresponding to > the hive table to determine whether to use the map side join, and does not > depend on the hive metastore. > This leads to two problems: > 1. Spark will not use "broadcast join" when the hive parameter > "hive.stats.autogather" is not set to ture or the command "ANALYZE TABLE > <tableName> COMPUTE STATISTICS noscan" has not been run because the > information of the hive table has not saved in hive metastore . The mode of > work in Spark depends on the configuration of Hive. > 2. For some reason, we set "hive.stats.autogather" to false in our Hive. > For the same SQL, Hive is 4 times faster than Spark because Hive used "map > side join" but Spark did not use "broadcast join". > Is it possible to use the mechanism same to hive's to look up the size of a > hive tale in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org