OutOfMemoryError during reduce tasks

Balazs Meszaros Thu, 19 Mar 2015 01:10:24 -0700

Hi,

I am trying to evaluate performance aspects of Spark in respect tovarious memory settings. What makes it more difficult is that I'm new toPython, but the problem at hand doesn't seem to originate from that.

I'm running a wordcount script [1] with different amounts of input data.There is always an OutOfMemoryError at the end of the reduce tasks [2]when I'm using a 1g input while 100m of data don't make a problem. Sparkis v1.2.1 (but with v1.3 I'm having the same problem) and it runs on aVM with Ubuntu 14.04, 8G RAM and 4VCPU. (If something else is ofinterest, please ask)

Onhttp://spark.apache.org/docs/1.2.1/tuning.html#memory-usage-of-reduce-tasksit's suggested to increase the parallelism which I've tried (even withover 4000 tasks) but nothing's changed. Other efforts withspark.executor.memory, spark.python.worker.memory and extraJavaOptionswith -Xmx4g (see code below) didn't solve the problem either.


What do you suggest to get rid of the Java heap filling up completely?

Thanks
Balazs


[1] Wordcount script

import sys
import time
from operator import add
from pyspark import SparkContext, SparkConf
from signal import signal, SIGPIPE, SIG_DFL

def encode(text):
    """
    For printing unicode characters to the console.
    """
    return text.encode('utf-8')

if __name__ == "__main__":
    start_time = time.time()

    if len(sys.argv) != 2:
        print >> sys.stderr, "Usage: wordcount <file>"
        exit(-1)
    conf = (SparkConf()
             .setMaster("local")
             .setAppName("PythonWordCount")
             .set("spark.executor.memory", "6g")
             .set("spark.python.worker.memory","6g")
             .set("spark.default.parallelism",120)
             .set("spark.driver.extraJavaOptions","-Xmx4g"))
    sc = SparkContext(conf = conf)
    lines = sc.textFile(sys.argv[1], 1)
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
# output would take too long and the important thing is the processing time
#    for (word, count) in output:
#        print encode("%s: %i" % (word, count))
#    print("%f seconds" % (time.time() - start_time))

    sc.stop()

    print("%f seconds" % (time.time() - start_time))

[2] OutOfMemoryError at reduce tasks
...

15/03/19 07:58:52 INFO ShuffleBlockFetcherIterator: Getting 30 non-emptyblocks out of 30 blocks15/03/19 07:58:52 INFO ShuffleBlockFetcherIterator: Started 0 remotefetches in 0 ms15/03/19 07:58:52 INFO TaskSetManager: Finished task 99.0 in stage 1.0(TID 129) in 1096 ms on localhost (100/120)15/03/19 07:58:52 INFO PythonRDD: Times: total = 351, boot = -530, init= 534, finish = 34715/03/19 07:58:52 ERROR Executor: Exception in task 100.0 in stage 1.0(TID 130)

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)

atjava.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)

    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)

atjava.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)

    at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
    at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:164)

atorg.apache.spark.scheduler.DirectTaskResult$$anonfun$writeExternal$1.apply$mcV$sp(TaskResult.scala:48)

    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1137)

atorg.apache.spark.scheduler.DirectTaskResult.writeExternal(TaskResult.scala:45)atjava.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1458)atjava.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)atjava.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)

    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)

atorg.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)atorg.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:226)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

    at java.lang.Thread.run(Thread.java:745)

15/03/19 07:58:52 ERROR SparkUncaughtExceptionHandler: Uncaughtexception in thread Thread[Executor task launch worker-3,5,main]

...

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

OutOfMemoryError during reduce tasks

Reply via email to