[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008554#comment-14008554 ]
Aaron Davidson commented on SPARK-983: -------------------------------------- Historically, we have not used Runtime.freeMemory(), instead favoring the use of "memoryFractions", such as spark.shuffle.memoryFraction and spark.storage.memoryFraction, which are fractions of Runtime.maxMemory(). One problem with Runtime.freeMemory() is that the JVM could very happily sit near 0 freeMemory, just waiting for significant new allocation to occur before freeing up some memory. If you just asked for some memory, though, it would be made available. The downside to using the memoryFractions, though, is twofold: (1) It can lead to OOMs if misconfigured, even if the sum of memoryFractions is < 1, because Java + Spark need some amount of working memory to keep running (and to garbage collect). (2) It can lead to underutilized memory, if, for instance, the user is not caching any RDDs or currently doing a shuffle, as all that memory will remain reserved nevertheless. In my opinion, the "right" solution is to have a memory manager for Spark, which allocates memory to various components (e.g., storage, shuffle, and sorting), which is willing to give one component more memory if other ones are not using it. However, this hasn't even been designed or agreed upon as a correct course, let alone implemented. > Support external sorting for RDD#sortByKey() > -------------------------------------------- > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature > Affects Versions: 0.9.0 > Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)