[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009145#comment-14009145
 ] 

Patrick Wendell commented on SPARK-983:
---------------------------------------

We are actually looking at this problem in a few different places in the code 
base (we did this already for the external aggregations, and we also have 
SPARK-1777).

Relying on GC's to decide when to spill is an interesting approach, but I'd 
rather have control of the heuristics ourselves. I think you'd get this 
thrashing behavior where a GC occurred and suddenly a million threads start 
writing to disk. In the past we've used a different mechanism (the size 
estimator) which approximates memory usage.

It might make sense to introduce a simple memory allocation mechanism that is 
shared between the external aggregation maps, partition unrolling, etc. This is 
something where a design doc would be helpful.


> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to