[ 
https://issues.apache.org/jira/browse/SPARK-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194014#comment-15194014
 ] 

Ian edited comment on SPARK-13872 at 3/14/16 7:53 PM:
------------------------------------------------------

A spark plan illustrating the scenario was attached.
1. A cartesian is probably dumb in first place, but it is valid query. 
    A slow performance is expected but not OOM.
2. The plan is also special, in that, the SortMergeJoin is done with the 
Cartesian Product at the same stage. If the SortMergeJoin is done in a separate 
stage the OOM can be avoided. On one hand, from query planning point of view, 
is it optimized to run SortMergeJoin with Cartesian Product on the sam stage? 
On the other hand, from result correctness point view, no matter how the 
execution is planned, OOM should not happen.


was (Author: ianlcsd):
A spark plan illustrating the scenario was attached.
1. A cartesian is probably dumb in first place, but it is valid query. 
    A slow performance is expected but not OOM.
2. The plan is also special, in that, the SortMergeJoin is done with the 
Cartesian Product at the same stage. If the SortMergeJoin is done in a separate 
stage the OOM can be avoided. From query planning point of view, 
    is it optimized to run SortMergeJoin with Cartesian Product on the sam 
stage?
    From result correctness point view, no matter how the execution is planned, 
OOM should not happen.

> Memory leak in SortMergeOuterJoin
> ---------------------------------
>
>                 Key: SPARK-13872
>                 URL: https://issues.apache.org/jira/browse/SPARK-13872
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Ian
>         Attachments: Screen Shot 2016-03-11 at 5.42.32 PM.png
>
>
> SortMergeJoin composes its partition/iterator from 
> org.apache.spark.sql.execution.Sort, which in turns designates the sorting to 
> UnsafeExternalRowSorter.
> UnsafeExternalRowSorter's implementation cleans up the resources when:
> 1. org.apache.spark.sql.catalyst.util.AbstractScalaRowIterator is fully 
> iterated.
> 2. task is done execution.
> In outer join case of SortMergeJoin, when the left or right iterator is not 
> fully iterated, the only chance for the resources to be cleaned up is at the 
> end of the spark task run. This probably ok most of the time, however when a 
> SortMergeOuterJoin is nested within a CartesianProduct, the "deferred" 
> resources cleanup becomes a none-ignorable memory leak, amplified by the loop 
> driven by the CartesianRdd's looping iteration.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to