[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006664#comment-14006664
 ] 

Madhu Siddalingaiah commented on SPARK-983:
-------------------------------------------

Looking at 
[OrderedRDDFunctions|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala],
 there's a shuffle step using RangePartitioner, then an in-memory sort of each 
partition by key. If we separate the partition sort and make that available as 
an independent API call, it could serve two purposes: sortByKey() and 
sortPartitions(). Then we could improve sortPartitions() to fall back to disk 
like 
[ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala].

The above approach would address this JIRA feature and support the equivalent 
of Hadoop secondary sort in a scalable way. There are plenty of time 
series-like use cases that could benefit from it. There's a lot more to it, but 
I'll code something up locally and see how it goes...

> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to