I have about 500 MB of data and I'm trying to process it on a single
`local` instance. I'm getting an Out of Memory exception. Stack trace at
the end.

Spark 1.1.1
My JVM has --Xmx2g

spark.driver.memory = 1000M
spark.executor.memory = 1000M
spark.kryoserializer.buffer.mb = 256
spark.kryoserializer.buffer.max.mb = 256

The objects I'm dealing with are well constrained. Each can be no more than
500 bytes at the very most. I ran into problems with the kryo buffer being
too small but I think that 256 MB should do the job. The docs say "This
must be larger than any object you attempt to serialize". No danger of that.

My input is a single file (on average each line is 500 bytes). I'm
performing various filter, map, flatMap, groupByKey and reduceByKey. The
only 'actions' I'm performing are foreach, which inserts values into a
database.

On input, I'm parsing the lines and then persisting with DISK_ONLY.

I'm foreaching over the keys and then foreaching over the values of key
value RDDs. The docs say that groupByKey returns (K, Iterable<V>). So the
values (which can be large) shouldn't be serialized as a single list.

So I don't think I should be loading anything larger than 256 MB at once.

My code works for small sample toy data and I'm trying it out on a bit
more. As I understand it, the way that Spark partitions data means that it
(in most cases) any job that will run on a cluster will also run on a
single instance, given enough time.

I think I've given enough memory to cover my serialization needs as I
understand them. Have I misunderstood?

Joe

Stack trace:

INFO  org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in
stage 30.0 (TID 116, localhost, PROCESS_LOCAL, 993 bytes)
INFO  org.apache.spark.executor.Executor - Running task 0.0 in stage 30.0
(TID 116)

...

ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage
30.0 (TID 116)
java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)
at
org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:58)
at
org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:151)
at
org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:151)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:155)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:188)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

...

WARN  org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage
30.0 (TID 116, localhost): java.lang.OutOfMemoryError: Java heap space
        com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)

org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:58)

org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:151)

org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:151)

org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:155)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:188)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        java.lang.Thread.run(Thread.java:745)

Reply via email to