[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

Aaron Davidson (JIRA) Sun, 25 May 2014 20:50:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008554#comment-14008554
 ]


Aaron Davidson commented on SPARK-983:
--------------------------------------

Historically, we have not used Runtime.freeMemory(), instead favoring the use 
of "memoryFractions", such as spark.shuffle.memoryFraction and 
spark.storage.memoryFraction, which are fractions of Runtime.maxMemory(). One 
problem with Runtime.freeMemory() is that the JVM could very happily sit near 0 
freeMemory, just waiting for significant new allocation to occur before freeing 
up some memory. If you just asked for some memory, though, it would be made 
available.

The downside to using the memoryFractions, though, is twofold:
(1) It can lead to OOMs if misconfigured, even if the sum of memoryFractions is 
< 1, because Java + Spark need some amount of working memory to keep running 
(and to garbage collect).
(2) It can lead to underutilized memory, if, for instance, the user is not 
caching any RDDs or currently doing a shuffle, as all that memory will remain 
reserved nevertheless.

In my opinion, the "right" solution is to have a memory manager for Spark, 
which allocates memory to various components (e.g., storage, shuffle, and 
sorting), which is willing to give one component more memory if other ones are 
not using it. However, this hasn't even been designed or agreed upon as a 
correct course, let alone implemented.


> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

Reply via email to