[ https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092826#comment-14092826 ]
Apache Spark commented on SPARK-1777: ------------------------------------- User 'liyezhang556520' has created a pull request for this issue: https://github.com/apache/spark/pull/1892 > Pass "cached" blocks directly to disk if memory is not large enough > ------------------------------------------------------------------- > > Key: SPARK-1777 > URL: https://issues.apache.org/jira/browse/SPARK-1777 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Patrick Wendell > Assignee: Andrew Or > Priority: Critical > Fix For: 1.1.0 > > Attachments: spark-1777-design-doc.pdf > > > Currently in Spark we entirely unroll a partition and then check whether it > will cause us to exceed the storage limit. This has an obvious problem - if > the partition itself is enough to push us over the storage limit (and > eventually over the JVM heap), it will cause an OOM. > This can happen in cases where a single partition is very large or when > someone is running examples locally with a small heap. > https://github.com/apache/spark/blob/f6ff2a61d00d12481bfb211ae13d6992daacdcc2/core/src/main/scala/org/apache/spark/CacheManager.scala#L148 > We should think a bit about the most elegant way to fix this - it shares some > similarities with the external aggregation code. > A simple idea is to periodically check the size of the buffer as we are > unrolling and see if we are over the memory limit. If we are we could prepend > the existing buffer to the iterator and write that entire thing out to disk. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org