[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029498#comment-14029498 ]
Aaron Davidson commented on SPARK-983: -------------------------------------- The idea for SizeTrackingAppendOnlyMap is that we can estimate the size of an object with SizeEstimator, but doing so can be relatively costly. Rather than running this estimation on every element added to the map, we amortize the cost by sampling exponentially less often (similar to amortization of adding to an ArrayList). If you expect your sublists to be relatively large, it may be perfectly fine to just call SizeEstimator on each one. > Support external sorting for RDD#sortByKey() > -------------------------------------------- > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature > Affects Versions: 0.9.0 > Reporter: Reynold Xin > Assignee: Madhu Siddalingaiah > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)