[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

Aaron Davidson (JIRA) Sun, 25 May 2014 20:52:21 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008557#comment-14008557
 ]


Aaron Davidson commented on SPARK-983:
--------------------------------------

[~pwendell] or [~matei], any opinions on memory management best practices? 
Adding a new memoryFraction for sorting will only exacerbate the problems we 
see with them, but I'm not sure we can rely on Runtime.freeMemory() as even an 
intermediary solution. 

Perhaps this feature could draw from the same pool as shuffle.memoryFraction, 
as it's used for a similar purpose, and that pool already implements some 
notion of memory sharing.

> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

Reply via email to