[jira] [Commented] (SPARK-1777) Pass "cached" blocks directly to disk if memory is not large enough

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092826#comment-14092826
 ] 

Apache Spark commented on SPARK-1777:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/1892

> Pass "cached" blocks directly to disk if memory is not large enough
> ---
>
> Key: SPARK-1777
> URL: https://issues.apache.org/jira/browse/SPARK-1777
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.1.0
>
> Attachments: spark-1777-design-doc.pdf
>
>
> Currently in Spark we entirely unroll a partition and then check whether it 
> will cause us to exceed the storage limit. This has an obvious problem - if 
> the partition itself is enough to push us over the storage limit (and 
> eventually over the JVM heap), it will cause an OOM.
> This can happen in cases where a single partition is very large or when 
> someone is running examples locally with a small heap.
> https://github.com/apache/spark/blob/f6ff2a61d00d12481bfb211ae13d6992daacdcc2/core/src/main/scala/org/apache/spark/CacheManager.scala#L148
> We should think a bit about the most elegant way to fix this - it shares some 
> similarities with the external aggregation code.
> A simple idea is to periodically check the size of the buffer as we are 
> unrolling and see if we are over the memory limit. If we are we could prepend 
> the existing buffer to the iterator and write that entire thing out to disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1777) Pass "cached" blocks directly to disk if memory is not large enough

2014-06-27 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046388#comment-14046388
 ] 

Andrew Or commented on SPARK-1777:
--

An easy way reproduce this: run spark shell local mode with default settings

sc.parallelize(1 to 20 * 1000 * 1000, 1).persist().count()

It's trying to unroll the entire partition to check if it fits in the cache, 
but by then it's too late.



> Pass "cached" blocks directly to disk if memory is not large enough
> ---
>
> Key: SPARK-1777
> URL: https://issues.apache.org/jira/browse/SPARK-1777
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.1.0
>
> Attachments: spark-1777-design-doc.pdf
>
>
> Currently in Spark we entirely unroll a partition and then check whether it 
> will cause us to exceed the storage limit. This has an obvious problem - if 
> the partition itself is enough to push us over the storage limit (and 
> eventually over the JVM heap), it will cause an OOM.
> This can happen in cases where a single partition is very large or when 
> someone is running examples locally with a small heap.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/CacheManager.scala#L106
> We should think a bit about the most elegant way to fix this - it shares some 
> similarities with the external aggregation code.
> A simple idea is to periodically check the size of the buffer as we are 
> unrolling and see if we are over the memory limit. If we are we could prepend 
> the existing buffer to the iterator and write that entire thing out to disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1777) Pass "cached" blocks directly to disk if memory is not large enough

2014-06-27 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046392#comment-14046392
 ] 

Andrew Or commented on SPARK-1777:
--

Fix: https://github.com/apache/spark/pull/1165

> Pass "cached" blocks directly to disk if memory is not large enough
> ---
>
> Key: SPARK-1777
> URL: https://issues.apache.org/jira/browse/SPARK-1777
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 1.1.0
>
> Attachments: spark-1777-design-doc.pdf
>
>
> Currently in Spark we entirely unroll a partition and then check whether it 
> will cause us to exceed the storage limit. This has an obvious problem - if 
> the partition itself is enough to push us over the storage limit (and 
> eventually over the JVM heap), it will cause an OOM.
> This can happen in cases where a single partition is very large or when 
> someone is running examples locally with a small heap.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/CacheManager.scala#L106
> We should think a bit about the most elegant way to fix this - it shares some 
> similarities with the external aggregation code.
> A simple idea is to periodically check the size of the buffer as we are 
> unrolling and see if we are over the memory limit. If we are we could prepend 
> the existing buffer to the iterator and write that entire thing out to disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)