GitHub user jinxing64 opened a pull request: https://github.com/apache/spark/pull/19560
[SPARK-22334][SQL] Check table size from HDFS in case the size in metastore is wrong. ## What changes were proposed in this pull request? Currently we use table properties('totalSize') to judge if to use broadcast join. Table properties are from metastore. However they can be wrong. Hive sometimes fails to update table properties after producing data successfully(e,g, NameNode timeout from https://github.com/apache/hive/blob/branch-1.2/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L180). If 'totalSize' in table properties is much smaller than its real size on HDFS, Spark can launch broadcast join by mistake and suffer OOM. Could we add a defense config and check the size from HDFS when 'totalSize' is below `spark.sql.autoBroadcastJoinThreshold` ## How was this patch tested? Tests added You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinxing64/spark SPARK-22334 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19560.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19560 ---- commit 98975adf129c2359156e97e338f0e3e4f623372b Author: jinxing <jinxing6...@126.com> Date: 2017-10-23T13:02:42Z Check table size from HDFS in case the size in metastore is wrong. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org