Isn't this your worker running out of its memory for computations, rather than for caching RDDs? so it has enough memory when you don't actually use a lot of the heap for caching, but when the cache uses its share, you actually run out of memory. If I'm right, and even I am not sure I have this straight, then the answer is that you should tell it to use less memory for caching.
On Fri, Aug 1, 2014 at 5:24 PM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > [Forking this thread.] > > According to the Spark Programming Guide, persisting RDDs with MEMORY_ONLY > should not choke if the RDD cannot be held entirely in memory: > >> If the RDD does not fit in memory, some partitions will not be cached and >> will be recomputed on the fly each time they're needed. This is the default >> level. > > > What I’m seeing per the discussion below is that when I try to cache more > data than the cluster can hold in memory, I get: > > 14/08/01 15:41:23 WARN TaskSetManager: Loss was due to > java.lang.OutOfMemoryError > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.Arrays.copyOfRange(Arrays.java:2694) > at java.lang.String.<init>(String.java:203) > at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561) > at java.nio.CharBuffer.toString(CharBuffer.java:1201) > at org.apache.hadoop.io.Text.decode(Text.java:350) > at org.apache.hadoop.io.Text.decode(Text.java:327) > at org.apache.hadoop.io.Text.toString(Text.java:254) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:458) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:458) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > Trying MEMORY_AND_DISK yields the same error. > > So what's the deal? I'm running 1.0.1 on EC2. > > Nick >