I'm trying to perform operations on a large RDD, that ends up being about 1.3
GB in memory when loaded in. It's being cached in memory during the first
operation, but when another task begins that uses the RDD, I'm getting this
error that says the RDD was lost:

14/06/30 09:48:17 INFO TaskSetManager: Serialized task 1.0:4 as 8245 bytes
in 0 ms
14/06/30 09:48:17 WARN TaskSetManager: Lost TID 15611 (task 1.0:3)
14/06/30 09:48:17 WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/Users/me/Desktop/spark-1.0.0/python/pyspark/worker.py", line 73, in
main
    command = pickleSer._read_with_length(infile)
  File "/Users/me/Desktop/spark-1.0.0/python/pyspark/serializers.py", line
142, in _read_with_length
    length = read_int(stream)
  File "/Users/me/Desktop/spark-1.0.0/python/pyspark/serializers.py", line
337, in read_int
    raise EOFError
EOFError

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:145)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
14/06/30 09:48:18 INFO AppClient$ClientActor: Executor updated:
app-20140630090515-0000/0 is now FAILED (Command exited with code 52)
14/06/30 09:48:18 INFO SparkDeploySchedulerBackend: Executor
app-20140630090515-0000/0 removed: Command exited with code 52
14/06/30 09:48:18 INFO SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/06/30 09:48:18 ERROR TaskSchedulerImpl: Lost executor 0 on localhost:
OutOfMemoryError
14/06/30 09:48:18 INFO TaskSetManager: Re-queueing tasks for 0 from TaskSet
1.0
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15610 (task 1.0:2)
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15609 (task 1.0:1)
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15612 (task 1.0:4)
14/06/30 09:48:18 WARN TaskSetManager: Lost TID 15608 (task 1.0:0)


The operation it fails on is a ReduceByKey(), and the RDD before the
operation is split into several thousand partitions (I'm doing term
weighting that requires a different partition initially for each document),
and the system has 6 GB of memory for the executor, so I'm not sure if it's
actually a memory error, as is mentioned 5 lines from the end of the error.
The serializer error portion is what's really confusing me, and I can't find
references to this particular error with Spark anywhere.

Does anyone have a clue as to what the actual error might be here, and what
a possible solution would be?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serializer-or-Out-of-Memory-issues-tp8533.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to