[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009875#comment-14009875
 ] 

Mark Hamstra edited comment on SPARK-983 at 5/27/14 4:40 PM:
-------------------------------------------------------------

I'm hoping these can be kept orthogonal, but I think that it is worth noting 
the existence of SPARK-1021 and the fact that sortByKey as it currently exists 
breaks Spark's "transformations of RDDs are lazy" contract.  I'm currently 
working on that issue, which is undoubtedly going to require at least some 
merge work to be compatible with the resolution of this issue.


was (Author: markhamstra):
I'm hoping these can be kept orthogonal, but I think that it is worth noting 
the existence of https://issues.apache.org/jira/browse/SPARK-1021 and the fact 
that sortByKey as it currently exists breaks Spark's "transformations of RDDs 
are lazy" contract.  I'm currently working on that issue, which is undoubtedly 
going to require at least some merge work to be compatible with the resolution 
of this issue.

> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to