GitHub user sujithjay opened a pull request:

    https://github.com/apache/spark/pull/22168

    [SPARK-24985][SQL][WIP] Fix OOM in Full Outer Join in case of data skew

    ## What issue does this pull request address ?
    JIRA: 
[https://issues.apache.org/jira/browse/SPARK-24985](https://issues.apache.org/jira/browse/SPARK-24985)
    In the case of Full Outer Joins of large tables, in the presence of data 
skew around the join keys for either of the joined tables, OOMs exceptions 
occur. While its possible to increase the heap size to workaround, Spark should 
be resilient to such issues as skews can happen arbitrarily.
    
    ## What changes were proposed in this pull request?
    
    #16909 introduced `ExternalAppendOnlyUnsafeRowArray` & changed 
`SortMergeJoinExec` to use `ExternalAppendOnlyUnsafeRowArray` for every join, 
except 'Full Outer Join'. This PR makes changes to make 'Full Outer Joins' to 
use `ExternalAppendOnlyUnsafeRowArray`.
    
    ## How was this patch tested?
    #### Unit testing
    - Changed a test-case in `JoinSuite`.
    - Existing tests in `OuterJoinSuite` were used to verify correctness.
    
    #### Stress testing
    - This is still work in progress. I plan to verify this patch using a 
production workload.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sujithjay/spark SPARK-24985

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22168
    
----
commit ce70a4ef4d6410d0a738a5440dd2b7d91c7e4822
Author: sujithjay <sujith@...>
Date:   2018-08-21T08:20:48Z

    [SPARK-24985][SQL] Fix OOM in Full Outer Join in presence of data skew.
    
    Change SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for Full 
Outer Join. This would spill data into disk if the buffered rows exceed beyond 
a threshold, thus preventing OOM errors.
    
    Change corresponding test case in JoinSuite.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to