[ https://issues.apache.org/jira/browse/SPARK-22334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215107#comment-16215107 ]
Apache Spark commented on SPARK-22334: -------------------------------------- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/19560 > Check table size from HDFS in case the size in metastore is wrong. > ------------------------------------------------------------------ > > Key: SPARK-22334 > URL: https://issues.apache.org/jira/browse/SPARK-22334 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: jin xing > > Currently we use table properties('totalSize') to judge if to use broadcast > join. Table properties are from metastore. However they can be wrong. Hive > sometimes fails to update table properties after producing data > successfully(e,g, NameNode timeout from > https://github.com/apache/hive/blob/branch-1.2/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L180). > If 'totalSize' in table properties is much smaller than its real size on > HDFS, Spark can launch broadcast join by mistake and suffer OOM. > Could we add a defense config and check the size from HDFS when 'totalSize' > is below {{spark.sql.autoBroadcastJoinThreshold}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org