[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007479#comment-14007479
 ] 

Madhu Siddalingaiah commented on SPARK-983:
-------------------------------------------

I have the beginnings of a SortedIterator working for data that will fit in 
memory. It does more or less the same thing as partition sort in 
OrderedRDDFunctions, but it's an iterator. If we know that a partition cannot 
fit in memory, it's possible to split it up into chunks, sort each chunk, write 
to disk, and merge the chunks on disk.

To determine when to split, is it reasonable to use Runtime.freeMemory() / 
maxMemory() along with configuration limits? I could just keep adding to an 
in-memory sortable list until some memory threshold is reached, then sort/spill 
to disk, repeat until all data has been sorted and spilled. Then it's a basic 
merge operation.

Any comments?

> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to