[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028715#comment-14028715 ]
Madhu Siddalingaiah edited comment on SPARK-983 at 6/12/14 1:52 AM: -------------------------------------------------------------------- [Aaron Davidson|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ilikerps], can you make a recommendation on how to fill in this [fitsInMemory|https://github.com/msiddalingaiah/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SortedParitionsRDD.scala#L78] method? I have the disk spill/merge all working, I just need to complete the spill condition. I looked at [SizeTrackingAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyMap.scala], but it's not completely clear to me how it's working. Thanks! was (Author: msiddalingaiah): [Aaron Davidson|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ilikerps], can you make a recommendation on how to fill in this method? https://github.com/msiddalingaiah/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SortedParitionsRDD.scala#L78 I have the disk spill/merge all working, I just need to complete the spill condition. I looked at [SizeTrackingAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/SizeTrackingAppendOnlyMap.scala], but it's not completely clear to me how it's working. Thanks! > Support external sorting for RDD#sortByKey() > -------------------------------------------- > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature > Affects Versions: 0.9.0 > Reporter: Reynold Xin > Assignee: Madhu Siddalingaiah > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)