mcdull_zhang created SPARK-38867: ------------------------------------ Summary: Avoid OOM when bufferedPlan has a lot of duplicate keys in SortMergeJoin codegen Key: SPARK-38867 URL: https://issues.apache.org/jira/browse/SPARK-38867 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: mcdull_zhang
WholeStageCodegenExec is wrapped in BufferedRowIterator. BufferedRowIterator uses a LinkedList to hold the output of WholeStageCodegenExec. When the parent of SortMergeJoin cannot codegen, SortMergeJoin needs to append the output to this LinkedList. SortMergeJoin processes a record in streamedPlan each time. If all records in bufferedPlan can match this record, all records in bufferedPlan will be saved in LinkedList, resulting in OOM. The above situation is very common in our internal use, so it is best to add a configuration to the codegen code. If there are enough pieces in the LinkedList, stop SortMergeJoin and let the parent consume it first. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org