I have about 500 MB of data and I'm trying to process it on a single `local` instance. I'm getting an Out of Memory exception. Stack trace at the end.
Spark 1.1.1 My JVM has --Xmx2g spark.driver.memory = 1000M spark.executor.memory = 1000M spark.kryoserializer.buffer.mb = 256 spark.kryoserializer.buffer.max.mb = 256 The objects I'm dealing with are well constrained. Each can be no more than 500 bytes at the very most. I ran into problems with the kryo buffer being too small but I think that 256 MB should do the job. The docs say "This must be larger than any object you attempt to serialize". No danger of that. My input is a single file (on average each line is 500 bytes). I'm performing various filter, map, flatMap, groupByKey and reduceByKey. The only 'actions' I'm performing are foreach, which inserts values into a database. On input, I'm parsing the lines and then persisting with DISK_ONLY. I'm foreaching over the keys and then foreaching over the values of key value RDDs. The docs say that groupByKey returns (K, Iterable<V>). So the values (which can be large) shouldn't be serialized as a single list. So I don't think I should be loading anything larger than 256 MB at once. My code works for small sample toy data and I'm trying it out on a bit more. As I understand it, the way that Spark partitions data means that it (in most cases) any job that will run on a cluster will also run on a single instance, given enough time. I think I've given enough memory to cover my serialization needs as I understand them. Have I misunderstood? Joe Stack trace: INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 30.0 (TID 116, localhost, PROCESS_LOCAL, 993 bytes) INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 30.0 (TID 116) ... ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 30.0 (TID 116) java.lang.OutOfMemoryError: Java heap space at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35) at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:58) at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:151) at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:151) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:155) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:188) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ... WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 30.0 (TID 116, localhost): java.lang.OutOfMemoryError: Java heap space com.esotericsoftware.kryo.io.Output.<init>(Output.java:35) org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:58) org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:151) org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:151) org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:155) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:188) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)