Used to hit this issue, but setting the following confs while creating the sparkcontext seems working.
.set("spark.rdd.compress","true") .set("spark.storage.memoryFraction","1") .set("spark.core.connection.ack.wait.timeout","6000") .set("spark.akka.frameSize","50") Most likely, one of the executor is stuck on a GC Pause and meanwhile master thinks its dead and throws timeout/cancel key exception. Thanks Best Regards On Sat, Nov 8, 2014 at 2:58 PM, <jan.zi...@centrum.cz> wrote: > Hi, > > I am getting ExecutorLostFailure when I run spark on YARN and in map I > perform very long tasks (couple of hours). Error Log is below. > > Do you know if it is possible to set something to make it possible for > Spark to perform these very long running jobs in map? > > Thank you very much for any advice. > > Best regards, > Jan > > Spark log: > 4533,931: [GC 394578K->20882K(1472000K), 0,0226470 secs] > Traceback (most recent call last): > File "/home/hadoop/spark_stuff/spark_lda.py", line 112, in <module> > models.saveAsTextFile(sys.argv[1]) > File "/home/hadoop/spark/python/pyspark/rdd.py", line 1324, in > saveAsTextFile > keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) > File > "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > File > "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line > 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > o36.saveAsTextFile. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 28 in stage 0.0 failed 4 times, most recent failure: Lost task 28.3 in > stage 0.0 (TID 41, ip-172-16-1-90.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > Yarn log: > 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-152.us-west-2.compute.internal:41091 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-152.us-west-2.compute.internal:39160 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-152.us-west-2.compute.internal:45058 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-241.us-west-2.compute.internal:54111 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-238.us-west-2.compute.internal:45772 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-241.us-west-2.compute.internal:59509 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-238.us-west-2.compute.internal:35720 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:11 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,59509) > 14/11/08 08:21:11 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,59509) > 14/11/08 08:21:11 ERROR network.ConnectionManager: Corresponding > SendingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,59509) not > found > 14/11/08 08:21:11 INFO cluster.YarnClientSchedulerBackend: Executor 10 > disconnected, so removing it > 14/11/08 08:21:11 ERROR cluster.YarnClientClusterScheduler: Lost executor > 10 on ip-172-16-1-241.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:11 INFO scheduler.TaskSetManager: Re-queueing tasks for 10 > from TaskSet 0.0 > 14/11/08 08:21:11 WARN scheduler.TaskSetManager: Lost task 28.0 in stage > 0.0 (TID 28, ip-172-16-1-241.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:11 INFO scheduler.DAGScheduler: Executor lost: 10 (epoch 0) > 14/11/08 08:21:11 INFO storage.BlockManagerMasterActor: Trying to remove > executor 10 from BlockManagerMaster. > 14/11/08 08:21:11 INFO storage.BlockManagerMaster: Removed 10 successfully > in removeExecutor > 14/11/08 08:21:20 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-194.us-west-2.compute.internal,45823) > 14/11/08 08:21:20 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-194.us-west-2.compute.internal,45823) > 14/11/08 08:21:20 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-194.us-west-2.compute.internal,45823) > 14/11/08 08:21:20 INFO cluster.YarnClientSchedulerBackend: Executor 5 > disconnected, so removing it > 14/11/08 08:21:20 ERROR cluster.YarnClientClusterScheduler: Lost executor > 5 on ip-172-16-1-194.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:20 INFO scheduler.TaskSetManager: Re-queueing tasks for 5 > from TaskSet 0.0 > 14/11/08 08:21:20 WARN scheduler.TaskSetManager: Lost task 21.0 in stage > 0.0 (TID 21, ip-172-16-1-194.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:20 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 1) > 14/11/08 08:21:20 INFO network.ConnectionManager: key already cancelled ? > sun.nio.ch.SelectionKeyImpl@3bb633cd > java.nio.channels.CancelledKeyException > at > sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at > sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > at > org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:289) > at > org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) > 14/11/08 08:21:20 INFO storage.BlockManagerMasterActor: Trying to remove > executor 5 from BlockManagerMaster. > 14/11/08 08:21:20 INFO storage.BlockManagerMaster: Removed 5 successfully > in removeExecutor > 14/11/08 08:21:21 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-92.us-west-2.compute.internal,50928) > 14/11/08 08:21:21 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-92.us-west-2.compute.internal,50928) > 14/11/08 08:21:21 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-92.us-west-2.compute.internal,50928) > 14/11/08 08:21:21 INFO cluster.YarnClientSchedulerBackend: Executor 27 > disconnected, so removing it > 14/11/08 08:21:21 ERROR cluster.YarnClientClusterScheduler: Lost executor > 27 on ip-172-16-1-92.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:21 INFO scheduler.TaskSetManager: Re-queueing tasks for 27 > from TaskSet 0.0 > 14/11/08 08:21:21 WARN scheduler.TaskSetManager: Lost task 27.0 in stage > 0.0 (TID 27, ip-172-16-1-92.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:21 INFO scheduler.DAGScheduler: Executor lost: 27 (epoch 2) > 14/11/08 08:21:21 INFO storage.BlockManagerMasterActor: Trying to remove > executor 27 from BlockManagerMaster. > 14/11/08 08:21:21 INFO storage.BlockManagerMaster: Removed 27 successfully > in removeExecutor > 14/11/08 08:21:21 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-152.us-west-2.compute.internal,41091) > 14/11/08 08:21:21 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-152.us-west-2.compute.internal,41091) > 14/11/08 08:21:21 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-152.us-west-2.compute.internal,41091) > 14/11/08 08:21:21 INFO cluster.YarnClientSchedulerBackend: Executor 20 > disconnected, so removing it > 14/11/08 08:21:21 ERROR cluster.YarnClientClusterScheduler: Lost executor > 20 on ip-172-16-1-152.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:21 INFO scheduler.TaskSetManager: Re-queueing tasks for 20 > from TaskSet 0.0 > 14/11/08 08:21:21 WARN scheduler.TaskSetManager: Lost task 29.0 in stage > 0.0 (TID 29, ip-172-16-1-152.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:21 INFO scheduler.DAGScheduler: Executor lost: 20 (epoch 3) > 14/11/08 08:21:21 INFO storage.BlockManagerMasterActor: Trying to remove > executor 20 from BlockManagerMaster. > 14/11/08 08:21:21 INFO storage.BlockManagerMaster: Removed 20 successfully > in removeExecutor > 14/11/08 08:21:26 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-23.us-west-2.compute.internal,51269) > 14/11/08 08:21:26 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-23.us-west-2.compute.internal,51269) > 14/11/08 08:21:26 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-23.us-west-2.compute.internal,51269) > 14/11/08 08:21:26 INFO cluster.YarnClientSchedulerBackend: Executor 6 > disconnected, so removing it > 14/11/08 08:21:26 ERROR cluster.YarnClientClusterScheduler: Lost executor > 6 on ip-172-16-1-23.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:26 INFO scheduler.TaskSetManager: Re-queueing tasks for 6 > from TaskSet 0.0 > 14/11/08 08:21:26 WARN scheduler.TaskSetManager: Lost task 24.0 in stage > 0.0 (TID 24, ip-172-16-1-23.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:26 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 4) > 14/11/08 08:21:26 INFO storage.BlockManagerMasterActor: Trying to remove > executor 6 from BlockManagerMaster. > 14/11/08 08:21:26 INFO storage.BlockManagerMaster: Removed 6 successfully > in removeExecutor > 14/11/08 08:21:26 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-90.us-west-2.compute.internal,46792) > 14/11/08 08:21:26 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-90.us-west-2.compute.internal,46792) > 14/11/08 08:21:26 ERROR network.ConnectionManager: Corresponding > SendingConnection to > ConnectionManagerId(ip-172-16-1-90.us-west-2.compute.internal,46792) not > found > 14/11/08 08:21:26 INFO cluster.YarnClientSchedulerBackend: Executor 21 > disconnected, so removing it > 14/11/08 08:21:26 ERROR cluster.YarnClientClusterScheduler: Lost executor > 21 on ip-172-16-1-90.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:26 INFO scheduler.TaskSetManager: Re-queueing tasks for 21 > from TaskSet 0.0 > 14/11/08 08:21:26 WARN scheduler.TaskSetManager: Lost task 25.0 in stage > 0.0 (TID 25, ip-172-16-1-90.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:26 INFO scheduler.DAGScheduler: Executor lost: 21 (epoch 5) > 14/11/08 08:21:26 INFO storage.BlockManagerMasterActor: Trying to remove > executor 21 from BlockManagerMaster. > 14/11/08 08:21:26 INFO storage.BlockManagerMaster: Removed 21 successfully > in removeExecutor > 14/11/08 08:21:29 INFO cluster.YarnClientSchedulerBackend: Executor 18 > disconnected, so removing it > 14/11/08 08:21:29 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,43883) > 14/11/08 08:21:29 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,43883) > 14/11/08 08:21:29 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,43883) > 14/11/08 08:21:29 ERROR cluster.YarnClientClusterScheduler: Lost executor > 18 on ip-172-16-1-222.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:29 INFO scheduler.TaskSetManager: Re-queueing tasks for 18 > from TaskSet 0.0 > 14/11/08 08:21:29 WARN scheduler.TaskSetManager: Lost task 26.0 in stage > 0.0 (TID 26, ip-172-16-1-222.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:29 INFO scheduler.DAGScheduler: Executor lost: 18 (epoch 6) > 14/11/08 08:21:29 INFO storage.BlockManagerMasterActor: Trying to remove > executor 18 from BlockManagerMaster. > 14/11/08 08:21:29 INFO storage.BlockManagerMaster: Removed 18 successfully > in removeExecutor > 14/11/08 08:21:30 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-194.us-west-2.compute.internal:50858/user/Executor#935992941] > with ID 31 > 14/11/08 08:21:30 INFO scheduler.TaskSetManager: Starting task 26.1 in > stage 0.0 (TID 30, ip-172-16-1-194.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:30 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-194.us-west-2.compute.internal:44263 with 776.3 MB RAM > 14/11/08 08:21:31 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-194.us-west-2.compute.internal:44263 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:33 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,40102) > 14/11/08 08:21:33 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,40102) > 14/11/08 08:21:33 ERROR network.ConnectionManager: Corresponding > SendingConnection to > ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,40102) not > found > 14/11/08 08:21:33 INFO cluster.YarnClientSchedulerBackend: Executor 26 > disconnected, so removing it > 14/11/08 08:21:33 ERROR cluster.YarnClientClusterScheduler: Lost executor > 26 on ip-172-16-1-222.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:33 INFO scheduler.TaskSetManager: Re-queueing tasks for 26 > from TaskSet 0.0 > 14/11/08 08:21:33 WARN scheduler.TaskSetManager: Lost task 23.0 in stage > 0.0 (TID 23, ip-172-16-1-222.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:33 INFO scheduler.DAGScheduler: Executor lost: 26 (epoch 7) > 14/11/08 08:21:33 INFO storage.BlockManagerMasterActor: Trying to remove > executor 26 from BlockManagerMaster. > 14/11/08 08:21:33 INFO storage.BlockManagerMaster: Removed 26 successfully > in removeExecutor > 14/11/08 08:21:36 INFO network.ConnectionManager: Removing > ReceivingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310) > 14/11/08 08:21:36 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310) > 14/11/08 08:21:36 INFO cluster.YarnClientSchedulerBackend: Executor 1 > disconnected, so removing it > 14/11/08 08:21:36 ERROR cluster.YarnClientClusterScheduler: Lost executor > 1 on ip-172-16-1-241.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:21:36 INFO scheduler.TaskSetManager: Re-queueing tasks for 1 > from TaskSet 0.0 > 14/11/08 08:21:36 WARN scheduler.TaskSetManager: Lost task 22.0 in stage > 0.0 (TID 22, ip-172-16-1-241.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:21:36 ERROR network.SendingConnection: Exception while reading > SendingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310) > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) > at > org.apache.spark.network.SendingConnection.read(Connection.scala:390) > at > org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/11/08 08:21:36 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 8) > 14/11/08 08:21:36 INFO storage.BlockManagerMasterActor: Trying to remove > executor 1 from BlockManagerMaster. > 14/11/08 08:21:36 INFO storage.BlockManagerMaster: Removed 1 successfully > in removeExecutor > 14/11/08 08:21:36 INFO network.ConnectionManager: Handling connection > error on connection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310) > 14/11/08 08:21:36 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310) > 14/11/08 08:21:36 INFO network.ConnectionManager: Removing > SendingConnection to > ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310) > 14/11/08 08:21:40 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-194.us-west-2.compute.internal:58099/user/Executor#-112835629] > with ID 34 > 14/11/08 08:21:40 INFO scheduler.TaskSetManager: Starting task 22.1 in > stage 0.0 (TID 31, ip-172-16-1-194.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:41 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-194.us-west-2.compute.internal:41093 with 776.3 MB RAM > 14/11/08 08:21:41 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-228.us-west-2.compute.internal:36136/user/Executor#318736262] > with ID 32 > 14/11/08 08:21:41 INFO scheduler.TaskSetManager: Starting task 23.1 in > stage 0.0 (TID 32, ip-172-16-1-228.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:41 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:33130/user/Executor#1744030597] > with ID 33 > 14/11/08 08:21:41 INFO scheduler.TaskSetManager: Starting task 25.1 in > stage 0.0 (TID 33, ip-172-16-1-90.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:41 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-92.us-west-2.compute.internal:55503/user/Executor#574084779] > with ID 35 > 14/11/08 08:21:41 INFO scheduler.TaskSetManager: Starting task 24.1 in > stage 0.0 (TID 34, ip-172-16-1-92.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:42 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-228.us-west-2.compute.internal:40128 with 776.3 MB RAM > 14/11/08 08:21:42 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-90.us-west-2.compute.internal:32839 with 776.3 MB RAM > 14/11/08 08:21:42 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-92.us-west-2.compute.internal:58081 with 776.3 MB RAM > 14/11/08 08:21:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-194.us-west-2.compute.internal:41093 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-228.us-west-2.compute.internal:40128 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-92.us-west-2.compute.internal:58081 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-90.us-west-2.compute.internal:32839 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:43 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-152.us-west-2.compute.internal:34268/user/Executor#-937582169] > with ID 36 > 14/11/08 08:21:43 INFO scheduler.TaskSetManager: Starting task 29.1 in > stage 0.0 (TID 35, ip-172-16-1-152.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:44 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-152.us-west-2.compute.internal:52550 with 776.3 MB RAM > 14/11/08 08:21:45 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-152.us-west-2.compute.internal:52550 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:46 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:34555/user/Executor#-94727554] > with ID 37 > 14/11/08 08:21:46 INFO scheduler.TaskSetManager: Starting task 27.1 in > stage 0.0 (TID 36, ip-172-16-1-90.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:46 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-228.us-west-2.compute.internal:34471/user/Executor#1412546630] > with ID 38 > 14/11/08 08:21:46 INFO scheduler.TaskSetManager: Starting task 21.1 in > stage 0.0 (TID 37, ip-172-16-1-228.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:47 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-90.us-west-2.compute.internal:46194 with 776.3 MB RAM > 14/11/08 08:21:47 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-228.us-west-2.compute.internal:42275 with 776.3 MB RAM > 14/11/08 08:21:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-90.us-west-2.compute.internal:46194 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-228.us-west-2.compute.internal:42275 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:21:50 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-23.us-west-2.compute.internal:37122/user/Executor#1404320204] > with ID 39 > 14/11/08 08:21:51 INFO scheduler.TaskSetManager: Starting task 28.1 in > stage 0.0 (TID 38, ip-172-16-1-23.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:21:51 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-23.us-west-2.compute.internal:33106 with 776.3 MB RAM > 14/11/08 08:21:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-23.us-west-2.compute.internal:33106 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:22:36 INFO cluster.YarnClientSchedulerBackend: Executor 39 > disconnected, so removing it > 14/11/08 08:22:36 ERROR cluster.YarnClientClusterScheduler: Lost executor > 39 on ip-172-16-1-23.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:22:36 INFO scheduler.TaskSetManager: Re-queueing tasks for 39 > from TaskSet 0.0 > 14/11/08 08:22:36 WARN scheduler.TaskSetManager: Lost task 28.1 in stage > 0.0 (TID 38, ip-172-16-1-23.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:22:36 INFO scheduler.DAGScheduler: Executor lost: 39 (epoch 9) > 14/11/08 08:22:36 INFO storage.BlockManagerMasterActor: Trying to remove > executor 39 from BlockManagerMaster. > 14/11/08 08:22:36 INFO storage.BlockManagerMaster: Removed 39 successfully > in removeExecutor > 14/11/08 08:22:57 INFO cluster.YarnClientSchedulerBackend: Executor 36 > disconnected, so removing it > 14/11/08 08:22:57 ERROR cluster.YarnClientClusterScheduler: Lost executor > 36 on ip-172-16-1-152.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:22:57 INFO scheduler.TaskSetManager: Re-queueing tasks for 36 > from TaskSet 0.0 > 14/11/08 08:22:57 WARN scheduler.TaskSetManager: Lost task 29.1 in stage > 0.0 (TID 35, ip-172-16-1-152.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:22:57 INFO scheduler.DAGScheduler: Executor lost: 36 (epoch 10) > 14/11/08 08:22:57 INFO storage.BlockManagerMasterActor: Trying to remove > executor 36 from BlockManagerMaster. > 14/11/08 08:22:57 INFO storage.BlockManagerMaster: Removed 36 successfully > in removeExecutor > 14/11/08 08:23:00 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:48033/user/Executor#-1088273404] > with ID 40 > 14/11/08 08:23:00 INFO scheduler.TaskSetManager: Starting task 29.2 in > stage 0.0 (TID 39, ip-172-16-1-90.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:23:01 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-90.us-west-2.compute.internal:39067 with 776.3 MB RAM > 14/11/08 08:23:03 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-90.us-west-2.compute.internal:39067 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:23:15 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-23.us-west-2.compute.internal:48860/user/Executor#-369895446] > with ID 41 > 14/11/08 08:23:15 INFO scheduler.TaskSetManager: Starting task 28.2 in > stage 0.0 (TID 40, ip-172-16-1-23.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:23:16 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-23.us-west-2.compute.internal:38093 with 776.3 MB RAM > 14/11/08 08:23:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-23.us-west-2.compute.internal:38093 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:23:32 INFO cluster.YarnClientSchedulerBackend: Executor 34 > disconnected, so removing it > 14/11/08 08:23:32 ERROR cluster.YarnClientClusterScheduler: Lost executor > 34 on ip-172-16-1-194.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:23:32 INFO scheduler.TaskSetManager: Re-queueing tasks for 34 > from TaskSet 0.0 > 14/11/08 08:23:32 WARN scheduler.TaskSetManager: Lost task 22.1 in stage > 0.0 (TID 31, ip-172-16-1-194.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:23:32 INFO scheduler.DAGScheduler: Executor lost: 34 (epoch 11) > 14/11/08 08:23:32 INFO storage.BlockManagerMasterActor: Trying to remove > executor 34 from BlockManagerMaster. > 14/11/08 08:23:32 INFO storage.BlockManagerMaster: Removed 34 successfully > in removeExecutor > 14/11/08 08:23:53 INFO cluster.YarnClientSchedulerBackend: Executor 41 > disconnected, so removing it > 14/11/08 08:23:53 ERROR cluster.YarnClientClusterScheduler: Lost executor > 41 on ip-172-16-1-23.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:23:53 INFO scheduler.TaskSetManager: Re-queueing tasks for 41 > from TaskSet 0.0 > 14/11/08 08:23:53 WARN scheduler.TaskSetManager: Lost task 28.2 in stage > 0.0 (TID 40, ip-172-16-1-23.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:23:53 INFO scheduler.DAGScheduler: Executor lost: 41 (epoch 12) > 14/11/08 08:23:53 INFO storage.BlockManagerMasterActor: Trying to remove > executor 41 from BlockManagerMaster. > 14/11/08 08:23:53 INFO storage.BlockManagerMaster: Removed 41 successfully > in removeExecutor > 14/11/08 08:23:57 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:58017/user/Executor#2094507560] > with ID 42 > 14/11/08 08:23:57 INFO scheduler.TaskSetManager: Starting task 28.3 in > stage 0.0 (TID 41, ip-172-16-1-90.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:23:58 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-90.us-west-2.compute.internal:41182 with 776.3 MB RAM > 14/11/08 08:24:00 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-90.us-west-2.compute.internal:41182 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:24:04 INFO cluster.YarnClientSchedulerBackend: Executor 35 > disconnected, so removing it > 14/11/08 08:24:04 ERROR cluster.YarnClientClusterScheduler: Lost executor > 35 on ip-172-16-1-92.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:24:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 35 > from TaskSet 0.0 > 14/11/08 08:24:04 WARN scheduler.TaskSetManager: Lost task 24.1 in stage > 0.0 (TID 34, ip-172-16-1-92.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:24:04 INFO scheduler.DAGScheduler: Executor lost: 35 (epoch 13) > 14/11/08 08:24:04 INFO storage.BlockManagerMasterActor: Trying to remove > executor 35 from BlockManagerMaster. > 14/11/08 08:24:04 INFO storage.BlockManagerMaster: Removed 35 successfully > in removeExecutor > 14/11/08 08:24:17 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:36395/user/Executor#-1907878650] > with ID 43 > 14/11/08 08:24:17 INFO scheduler.TaskSetManager: Starting task 24.2 in > stage 0.0 (TID 42, ip-172-16-1-90.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:24:18 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-90.us-west-2.compute.internal:46948 with 776.3 MB RAM > 14/11/08 08:24:20 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-90.us-west-2.compute.internal:46948 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:24:21 INFO cluster.YarnClientSchedulerBackend: Executor 40 > disconnected, so removing it > 14/11/08 08:24:21 ERROR cluster.YarnClientClusterScheduler: Lost executor > 40 on ip-172-16-1-90.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:24:21 INFO scheduler.TaskSetManager: Re-queueing tasks for 40 > from TaskSet 0.0 > 14/11/08 08:24:21 WARN scheduler.TaskSetManager: Lost task 29.2 in stage > 0.0 (TID 39, ip-172-16-1-90.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:24:21 INFO scheduler.DAGScheduler: Executor lost: 40 (epoch 14) > 14/11/08 08:24:21 INFO storage.BlockManagerMasterActor: Trying to remove > executor 40 from BlockManagerMaster. > 14/11/08 08:24:21 INFO storage.BlockManagerMaster: Removed 40 successfully > in removeExecutor > 14/11/08 08:24:31 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:34467/user/Executor#-1100688472] > with ID 44 > 14/11/08 08:24:31 INFO scheduler.TaskSetManager: Starting task 29.3 in > stage 0.0 (TID 43, ip-172-16-1-90.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:24:32 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-90.us-west-2.compute.internal:40126 with 776.3 MB RAM > 14/11/08 08:24:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-90.us-west-2.compute.internal:40126 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:24:48 INFO cluster.YarnClientSchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:53257/user/Executor#-745380917] > with ID 45 > 14/11/08 08:24:48 INFO scheduler.TaskSetManager: Starting task 22.2 in > stage 0.0 (TID 44, ip-172-16-1-90.us-west-2.compute.internal, > PROCESS_LOCAL, 1122 bytes) > 14/11/08 08:24:49 INFO storage.BlockManagerMasterActor: Registering block > manager ip-172-16-1-90.us-west-2.compute.internal:46252 with 776.3 MB RAM > 14/11/08 08:24:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 > in memory on ip-172-16-1-90.us-west-2.compute.internal:46252 (size: 596.9 > KB, free: 775.7 MB) > 14/11/08 08:25:16 INFO cluster.YarnClientSchedulerBackend: Executor 38 > disconnected, so removing it > 14/11/08 08:25:16 ERROR cluster.YarnClientClusterScheduler: Lost executor > 38 on ip-172-16-1-228.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:25:16 INFO scheduler.TaskSetManager: Re-queueing tasks for 38 > from TaskSet 0.0 > 14/11/08 08:25:16 WARN scheduler.TaskSetManager: Lost task 21.1 in stage > 0.0 (TID 37, ip-172-16-1-228.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:25:16 INFO scheduler.DAGScheduler: Executor lost: 38 (epoch 15) > 14/11/08 08:25:16 INFO storage.BlockManagerMasterActor: Trying to remove > executor 38 from BlockManagerMaster. > 14/11/08 08:25:16 INFO storage.BlockManagerMaster: Removed 38 successfully > in removeExecutor > 14/11/08 08:25:37 INFO cluster.YarnClientSchedulerBackend: Executor 42 > disconnected, so removing it > 14/11/08 08:25:37 ERROR cluster.YarnClientClusterScheduler: Lost executor > 42 on ip-172-16-1-90.us-west-2.compute.internal: remote Akka client > disassociated > 14/11/08 08:25:37 INFO scheduler.TaskSetManager: Re-queueing tasks for 42 > from TaskSet 0.0 > 14/11/08 08:25:37 WARN scheduler.TaskSetManager: Lost task 28.3 in stage > 0.0 (TID 41, ip-172-16-1-90.us-west-2.compute.internal): > ExecutorLostFailure (executor lost) > 14/11/08 08:25:37 ERROR scheduler.TaskSetManager: Task 28 in stage 0.0 > failed 4 times; aborting job > 14/11/08 08:25:37 INFO cluster.YarnClientClusterScheduler: Cancelling > stage 0 > 14/11/08 08:25:37 INFO cluster.YarnClientClusterScheduler: Stage 0 was > cancelled > 14/11/08 08:25:37 INFO scheduler.DAGScheduler: Failed to run > saveAsTextFile at NativeMethodAccessorImpl.java:-2 > 14/11/08 08:25:37 INFO scheduler.DAGScheduler: Executor lost: 42 (epoch 16) > 14/11/08 08:25:37 INFO storage.BlockManagerMasterActor: Trying to remove > executor 42 from BlockManagerMaster. > 14/11/08 08:25:37 INFO storage.BlockManagerMaster: Removed 42 successfully > in removeExecutor > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >