[ https://issues.apache.org/jira/browse/SPARK-38867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-38867: ------------------------------------ Assignee: Apache Spark > Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin > codegen > -------------------------------------------------------------------------------- > > Key: SPARK-38867 > URL: https://issues.apache.org/jira/browse/SPARK-38867 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.1 > Reporter: mcdull_zhang > Assignee: Apache Spark > Priority: Minor > > WholeStageCodegenExec is wrapped in BufferedRowIterator. > BufferedRowIterator uses a LinkedList to hold the output of > WholeStageCodegenExec. > When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to > append the output to this LinkedList. > SortMergeJoin processes a record in streamedPlan each time. If all records in > bufferedPlan can match this record, all records in bufferedPlan will be saved > in LinkedList, resulting in OOM. > The above situation is very common in our internal use, so it is best to add > a configuration to the codegen code. If there are enough pieces in the > LinkedList, stop SortMergeJoin and let the parent consume it first. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org