[ https://issues.apache.org/jira/browse/SPARK-24985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155839#comment-17155839 ]
Apache Spark commented on SPARK-24985: -------------------------------------- User 'sidedoorleftroad' has created a pull request for this issue: https://github.com/apache/spark/pull/29071 > Executing SQL with "Full Outer Join" on top of large tables when there is > data skew met OOM > ------------------------------------------------------------------------------------------- > > Key: SPARK-24985 > URL: https://issues.apache.org/jira/browse/SPARK-24985 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1 > Reporter: sheperd huang > Priority: Major > > When we run SQL with "Full Outer Join" on large tables when there is data > skew, we found it's quite easy to hit OOM. We once thought we hit > https://issues.apache.org/jira/browse/SPARK-13450. But taking a look at fix > in [https://github.com/apache/spark/pull/16909,] we found that PR hasn't > handled the "Full Outer Join" case. > The root cause of the OOM is there are a lot of rows with the same key. > See below code: > {code:java} > private def findMatchingRows(matchingKey: InternalRow): Unit = { > leftMatches.clear() > rightMatches.clear() > leftIndex = 0 > rightIndex = 0 > while (leftRowKey != null && keyOrdering.compare(leftRowKey, matchingKey) > == 0) { > leftMatches += leftRow.copy() > advancedLeft() > } > while (rightRowKey != null && keyOrdering.compare(rightRowKey, matchingKey) > == 0) { > rightMatches += rightRow.copy() > advancedRight() > } > {code} > It seems we haven't limited the data added to leftMatches and rightMatches. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org