Spark 1.3.0: ExecutorLostFailure depending on input file size

Wyss Michael (wysm) Thu, 13 Aug 2015 11:09:50 -0700

Hi

I've been at this problem for a few days now and wasn't able to solve it.
I'm hoping that I'm missing something that you don't!
I'm trying to run a simple python application on a 2-node-cluster I set up
in standalone mode. A master and a worker, whereas the master also takes on
the role of a worker.
In the following code I'm trying to count the number of cakes occurring in a
500MB text file and it fails with a ExecutorLostFailure.
Interestingly the application runs through if I take a 100MB input file.


I used the package version of CDH5.4.4 with YARN and I'm running Spark
1.3.0.
Each node has 8GB of memory and these are some of my configurations:
- executor memory: 4g
- driver memory: 2g
- number of cores per worker: 1
- serializer: Kryo


SimpleApp.py:
------------------------------------------------------------------------
from pyspark import SparkContext, SparkConf

sc = SparkContext(appName="Simple App")
logFile = "/user/ubuntu/largeTextFile500m.txt"
logData = sc.textFile(logFile)
cakes = logData.filter(lambda s: "cake" in s).count()
print "Number of cakes: %i" % cakes
sc.stop()
------------------------------------------------------------------------


Starting application:
------------------------------------------------------------------------
spark-submit --master spark://master:7077 /home/ubuntu/SimpleApp.py
------------------------------------------------------------------------


Excerpts from the log:
------------------------------------------------------------------------
...
15/08/13 09:04:59 WARN ThreadLocalRandom: Failed to generate a seed from
SecureRandom within 3 seconds. Not enough entrophy?
...
15/08/13 09:05:09 ERROR TaskSchedulerImpl: Lost executor 1 on master: remote
Akka client disassociated
15/08/13 09:05:09 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet
0.0
15/08/13 09:05:09 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
master): ExecutorLostFailure (executor 1 lost)
...
15/08/13 09:05:09 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 1
...
15/08/13 09:05:13 ERROR TaskSchedulerImpl: Lost executor 0 on worker: remote
Akka client disassociated
15/08/13 09:05:13 INFO TaskSetManager: Re-queueing tasks for 0 from TaskSet
0.0
15/08/13 09:05:13 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 5,
worker): ExecutorLostFailure (executor 0 lost)
...
15/08/13 09:05:13 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 0
...
15/08/13 09:05:21 ERROR TaskSchedulerImpl: Lost executor 2 on master: remote
Akka client disassociated
15/08/13 09:05:21 INFO TaskSetManager: Re-queueing tasks for 2 from TaskSet
0.0
15/08/13 09:05:21 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 6,
master): ExecutorLostFailure (executor 2 lost)
...
15/08/13 09:05:21 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 2
...
15/08/13 09:05:29 ERROR TaskSchedulerImpl: Lost executor 3 on worker: remote
Akka client disassociated
15/08/13 09:05:29 INFO TaskSetManager: Re-queueing tasks for 3 from TaskSet
0.0
15/08/13 09:05:29 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7,
worker): ExecutorLostFailure (executor 3 lost)
...
15/08/13 09:05:29 ERROR SparkDeploySchedulerBackend: Asked to remove
non-existent executor 3
...
15/08/13 09:05:29 INFO DAGScheduler: Job 0 failed: count at
/home/ubuntu/SimpleApp.py:6, took 28.156765 s
Traceback (most recent call last):
  File "/home/ubuntu/SimpleApp.py", line 6, in <module>
    cakes = logData.filter(lambda s: "cake" in s).count()
  File "/usr/lib/spark/python/pyspark/rdd.py", line 933, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/lib/spark/python/pyspark/rdd.py", line 924, in sum
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 740, in reduce
    vals = self.mapPartitions(func).collect()
  File "/usr/lib/spark/python/pyspark/rdd.py", line 701, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
538, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError15/08/13 09:05:29 INFO DAGScheduler: Executor
lost: 3 (epoch 3)
15/08/13 09:05:29 INFO BlockManagerMasterActor: Trying to remove executor 3
from BlockManagerMaster.
15/08/13 09:05:29 INFO AppClient$ClientActor: Executor updated:
app-20150813090456-0000/5 is now RUNNING
15/08/13 09:05:29 INFO BlockManagerMasterActor: Removing block manager
BlockManagerId(3, worker, 4075)
: An error occurred while calling o41.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 7, worker): ExecutorLostFailure (executor 3 lost)
Driver stacktrace:
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

15/08/13 09:05:29 INFO BlockManagerMaster: Removed 3 successfully in
removeExecutor
15/08/13 09:05:29 INFO AppClient$ClientActor: Executor updated:
app-20150813090456-0000/5 is now LOADING


15/08/12 15:23:28 DEBUG DFSClient: DFSClient seqno: 20 status: SUCCESS
status: SUCCESS downstreamAckTimeNanos: 857203
    numAs = logData.filter(lambda s: "cake" in s).count()
  File "/usr/lib/spark/python/pyspark/rdd.py", line 933, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/lib/spark/python/pyspark/rdd.py", line 924, in sum
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 740, in reduce
    vals = self.mapPartitions(func).collect()
  File "/usr/lib/spark/python/pyspark/rdd.py", line 701, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File
"/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
538, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 4, master): ExecutorLostFailure (executor 4 lost)
Driver stacktrace:
        at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
------------------------------------------------------------------------

Any suggestions?

Spark 1.3.0: ExecutorLostFailure depending on input file size

Reply via email to