[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

Aaron Davidson (JIRA) Thu, 12 Jun 2014 10:53:53 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029498#comment-14029498
 ]


Aaron Davidson commented on SPARK-983:
--------------------------------------

The idea for SizeTrackingAppendOnlyMap is that we can estimate the size of an 
object with SizeEstimator, but doing so can be relatively costly. Rather than 
running this estimation on every element added to the map, we amortize the cost 
by sampling exponentially less often (similar to amortization of adding to an 
ArrayList).

If you expect your sublists to be relatively large, it may be perfectly fine to 
just call SizeEstimator on each one.

> Support external sorting for RDD#sortByKey()
> --------------------------------------------
>
>                 Key: SPARK-983
>                 URL: https://issues.apache.org/jira/browse/SPARK-983
>             Project: Spark
>          Issue Type: New Feature
>    Affects Versions: 0.9.0
>            Reporter: Reynold Xin
>            Assignee: Madhu Siddalingaiah
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

Reply via email to