[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14080385#comment-14080385 ]
Matei Zaharia commented on SPARK-983: ------------------------------------- Now that an ExternalSorter class from SPARK-2045 is in, I've submitted a much smaller PR that reuses that: https://github.com/apache/spark/pull/1677. Thanks both [~msiddalingaiah] and [~andrew xia] for your previous patches on this. > Support external sorting for RDD#sortByKey() > -------------------------------------------- > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 0.9.0 > Reporter: Reynold Xin > Priority: Critical > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)