[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009145#comment-14009145 ]
Patrick Wendell commented on SPARK-983: --------------------------------------- We are actually looking at this problem in a few different places in the code base (we did this already for the external aggregations, and we also have SPARK-1777). Relying on GC's to decide when to spill is an interesting approach, but I'd rather have control of the heuristics ourselves. I think you'd get this thrashing behavior where a GC occurred and suddenly a million threads start writing to disk. In the past we've used a different mechanism (the size estimator) which approximates memory usage. It might make sense to introduce a simple memory allocation mechanism that is shared between the external aggregation maps, partition unrolling, etc. This is something where a design doc would be helpful. > Support external sorting for RDD#sortByKey() > -------------------------------------------- > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature > Affects Versions: 0.9.0 > Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)