GitHub user uzadude opened a pull request:

    https://github.com/apache/spark/pull/19683

    [SPARK-21657][SQL] optimize explode quadratic memory consumpation

    ## What changes were proposed in this pull request?
    
    The issue has been raised in two Jira tickets: 
[SPARK-21657](https://issues.apache.org/jira/browse/SPARK-21657), 
[SPARK-16998](https://issues.apache.org/jira/browse/SPARK-16998). Basically, 
what happens is that in collection generators like explode/inline we create 
many rows from each row. Currently each exploded row contains also the column 
on which it was created. This causes, for example, if we have a 10k array in 
one row that this array will get copy 10k times - to each of the row. this 
results a qudratic memory consumption. However, it is a common case that the 
original column gets projected out after the explode, so we can avoid 
duplicating it.
    In this solution we propose to identify this situation in the optimizer and 
turn on a flag for omitting the original column in the generation process.
    
    ## How was this patch tested?
    
    1. We added a benchmark test to MiscBenchmark that shows x16 improvement in 
runtimes.
    2. We ran some of the other tests in MiscBenchmark and they show 15% 
improvements.
    3. We ran this code on a specific case from our production data with rows 
containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 
mins.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uzadude/spark optimize_explode

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19683.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19683
    
----
commit ce7c3694a99584348957dc756234bb667466be4e
Author: oraviv <ora...@paypal.com>
Date:   2017-11-07T11:34:21Z

    [SPARK-21657][SQL] optimize explode quadratic memory consumpation

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to