Re: What should happen if we try to cache more data than the cluster can hold in memory?

Sean Owen Fri, 01 Aug 2014 09:40:21 -0700

Isn't this your worker running out of its memory for computations,
rather than for caching RDDs? so it has enough memory when you don't
actually use a lot of the heap for caching, but when the cache uses
its share, you actually run out of memory. If I'm right, and even I am
not sure I have this straight, then the answer is that you should tell
it to use less memory for caching.


On Fri, Aug 1, 2014 at 5:24 PM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> [Forking this thread.]
>
> According to the Spark Programming Guide, persisting RDDs with MEMORY_ONLY
> should not choke if the RDD cannot be held entirely in memory:
>
>> If the RDD does not fit in memory, some partitions will not be cached and
>> will be recomputed on the fly each time they're needed. This is the default
>> level.
>
>
> What I’m seeing per the discussion below is that when I try to cache more
> data than the cluster can hold in memory, I get:
>
> 14/08/01 15:41:23 WARN TaskSetManager: Loss was due to
> java.lang.OutOfMemoryError
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>     at java.util.Arrays.copyOfRange(Arrays.java:2694)
>     at java.lang.String.<init>(String.java:203)
>     at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)
>     at java.nio.CharBuffer.toString(CharBuffer.java:1201)
>     at org.apache.hadoop.io.Text.decode(Text.java:350)
>     at org.apache.hadoop.io.Text.decode(Text.java:327)
>     at org.apache.hadoop.io.Text.toString(Text.java:254)
>     at
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:458)
>     at
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:458)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>     at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>     at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>     at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>     at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>     at org.apache.spark.scheduler.Task.run(Task.scala:51)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
>
> Trying MEMORY_AND_DISK yields the same error.
>
> So what's the deal? I'm running 1.0.1 on EC2.
>
> Nick
>

Re: What should happen if we try to cache more data than the cluster can hold in memory?

Reply via email to