mcdull_zhang created SPARK-38867:
------------------------------------

             Summary: Avoid OOM when bufferedPlan has a lot of duplicate keys 
in SortMergeJoin codegen
                 Key: SPARK-38867
                 URL: https://issues.apache.org/jira/browse/SPARK-38867
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.2.1
            Reporter: mcdull_zhang


WholeStageCodegenExec is wrapped in BufferedRowIterator.

BufferedRowIterator uses a LinkedList to hold the output of 
WholeStageCodegenExec.

When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to append 
the output to this LinkedList.

SortMergeJoin processes a record in streamedPlan each time. If all records in 
bufferedPlan can match this record, all records in bufferedPlan will be saved 
in LinkedList, resulting in OOM.

The above situation is very common in our internal use, so it is best to add a 
configuration to the codegen code. If there are enough pieces in the 
LinkedList, stop SortMergeJoin and let the parent consume it first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to