GitHub user jinxing64 opened a pull request:

    https://github.com/apache/spark/pull/19560

    [SPARK-22334][SQL] Check table size from HDFS in case the size in metastore 
is wrong.

    ## What changes were proposed in this pull request?
    
    Currently we use table properties('totalSize') to judge if to use broadcast 
join. Table properties are from metastore. However they can be wrong. Hive 
sometimes fails to update table properties after producing data 
successfully(e,g, NameNode timeout from 
https://github.com/apache/hive/blob/branch-1.2/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L180).
 If 'totalSize' in table properties is much smaller than its real size on HDFS, 
Spark can launch broadcast join by mistake and suffer OOM.
    Could we add a defense config and check the size from HDFS when 'totalSize' 
is below `spark.sql.autoBroadcastJoinThreshold`
    
    ## How was this patch tested?
    
    Tests added

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinxing64/spark SPARK-22334

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19560.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19560
    
----
commit 98975adf129c2359156e97e338f0e3e4f623372b
Author: jinxing <jinxing6...@126.com>
Date:   2017-10-23T13:02:42Z

    Check table size from HDFS in case the size in metastore is wrong.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to