Used to hit this issue, but setting the following confs while creating the
sparkcontext seems working.

 .set("spark.rdd.compress","true")

      .set("spark.storage.memoryFraction","1")
      .set("spark.core.connection.ack.wait.timeout","6000")
      .set("spark.akka.frameSize","50")

Most likely, one of the executor is stuck on a GC Pause and meanwhile
master thinks its dead and throws timeout/cancel key exception.

Thanks
Best Regards

On Sat, Nov 8, 2014 at 2:58 PM, <jan.zi...@centrum.cz> wrote:

> Hi,
>
> I am getting ExecutorLostFailure when I run spark on YARN and in map I
> perform very long tasks (couple of hours). Error Log is below.
>
> Do you know if it is possible to set something to make it possible for
> Spark to perform these very long running jobs in map?
>
> Thank you very much for any advice.
>
> Best regards,
> Jan
>
> Spark log:
> 4533,931: [GC 394578K->20882K(1472000K), 0,0226470 secs]
> Traceback (most recent call last):
>   File "/home/hadoop/spark_stuff/spark_lda.py", line 112, in <module>
>     models.saveAsTextFile(sys.argv[1])
>   File "/home/hadoop/spark/python/pyspark/rdd.py", line 1324, in
> saveAsTextFile
>     keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
>   File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o36.saveAsTextFile.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 28 in stage 0.0 failed 4 times, most recent failure: Lost task 28.3 in
> stage 0.0 (TID 41, ip-172-16-1-90.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> Driver stacktrace:
>         at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> Yarn log:
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-152.us-west-2.compute.internal:41091 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-152.us-west-2.compute.internal:39160 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-152.us-west-2.compute.internal:45058 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-241.us-west-2.compute.internal:54111 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-238.us-west-2.compute.internal:45772 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-241.us-west-2.compute.internal:59509 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-238.us-west-2.compute.internal:35720 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:11 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,59509)
> 14/11/08 08:21:11 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,59509)
> 14/11/08 08:21:11 ERROR network.ConnectionManager: Corresponding
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,59509) not
> found
> 14/11/08 08:21:11 INFO cluster.YarnClientSchedulerBackend: Executor 10
> disconnected, so removing it
> 14/11/08 08:21:11 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 10 on ip-172-16-1-241.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:11 INFO scheduler.TaskSetManager: Re-queueing tasks for 10
> from TaskSet 0.0
> 14/11/08 08:21:11 WARN scheduler.TaskSetManager: Lost task 28.0 in stage
> 0.0 (TID 28, ip-172-16-1-241.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:11 INFO scheduler.DAGScheduler: Executor lost: 10 (epoch 0)
> 14/11/08 08:21:11 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 10 from BlockManagerMaster.
> 14/11/08 08:21:11 INFO storage.BlockManagerMaster: Removed 10 successfully
> in removeExecutor
> 14/11/08 08:21:20 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-194.us-west-2.compute.internal,45823)
> 14/11/08 08:21:20 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-194.us-west-2.compute.internal,45823)
> 14/11/08 08:21:20 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-194.us-west-2.compute.internal,45823)
> 14/11/08 08:21:20 INFO cluster.YarnClientSchedulerBackend: Executor 5
> disconnected, so removing it
> 14/11/08 08:21:20 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 5 on ip-172-16-1-194.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:20 INFO scheduler.TaskSetManager: Re-queueing tasks for 5
> from TaskSet 0.0
> 14/11/08 08:21:20 WARN scheduler.TaskSetManager: Lost task 21.0 in stage
> 0.0 (TID 21, ip-172-16-1-194.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:20 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 1)
> 14/11/08 08:21:20 INFO network.ConnectionManager: key already cancelled ?
> sun.nio.ch.SelectionKeyImpl@3bb633cd
> java.nio.channels.CancelledKeyException
>         at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>         at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>         at
> org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:289)
>         at
> org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
> 14/11/08 08:21:20 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 5 from BlockManagerMaster.
> 14/11/08 08:21:20 INFO storage.BlockManagerMaster: Removed 5 successfully
> in removeExecutor
> 14/11/08 08:21:21 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-92.us-west-2.compute.internal,50928)
> 14/11/08 08:21:21 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-92.us-west-2.compute.internal,50928)
> 14/11/08 08:21:21 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-92.us-west-2.compute.internal,50928)
> 14/11/08 08:21:21 INFO cluster.YarnClientSchedulerBackend: Executor 27
> disconnected, so removing it
> 14/11/08 08:21:21 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 27 on ip-172-16-1-92.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:21 INFO scheduler.TaskSetManager: Re-queueing tasks for 27
> from TaskSet 0.0
> 14/11/08 08:21:21 WARN scheduler.TaskSetManager: Lost task 27.0 in stage
> 0.0 (TID 27, ip-172-16-1-92.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:21 INFO scheduler.DAGScheduler: Executor lost: 27 (epoch 2)
> 14/11/08 08:21:21 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 27 from BlockManagerMaster.
> 14/11/08 08:21:21 INFO storage.BlockManagerMaster: Removed 27 successfully
> in removeExecutor
> 14/11/08 08:21:21 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-152.us-west-2.compute.internal,41091)
> 14/11/08 08:21:21 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-152.us-west-2.compute.internal,41091)
> 14/11/08 08:21:21 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-152.us-west-2.compute.internal,41091)
> 14/11/08 08:21:21 INFO cluster.YarnClientSchedulerBackend: Executor 20
> disconnected, so removing it
> 14/11/08 08:21:21 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 20 on ip-172-16-1-152.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:21 INFO scheduler.TaskSetManager: Re-queueing tasks for 20
> from TaskSet 0.0
> 14/11/08 08:21:21 WARN scheduler.TaskSetManager: Lost task 29.0 in stage
> 0.0 (TID 29, ip-172-16-1-152.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:21 INFO scheduler.DAGScheduler: Executor lost: 20 (epoch 3)
> 14/11/08 08:21:21 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 20 from BlockManagerMaster.
> 14/11/08 08:21:21 INFO storage.BlockManagerMaster: Removed 20 successfully
> in removeExecutor
> 14/11/08 08:21:26 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-23.us-west-2.compute.internal,51269)
> 14/11/08 08:21:26 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-23.us-west-2.compute.internal,51269)
> 14/11/08 08:21:26 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-23.us-west-2.compute.internal,51269)
> 14/11/08 08:21:26 INFO cluster.YarnClientSchedulerBackend: Executor 6
> disconnected, so removing it
> 14/11/08 08:21:26 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 6 on ip-172-16-1-23.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:26 INFO scheduler.TaskSetManager: Re-queueing tasks for 6
> from TaskSet 0.0
> 14/11/08 08:21:26 WARN scheduler.TaskSetManager: Lost task 24.0 in stage
> 0.0 (TID 24, ip-172-16-1-23.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:26 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 4)
> 14/11/08 08:21:26 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 6 from BlockManagerMaster.
> 14/11/08 08:21:26 INFO storage.BlockManagerMaster: Removed 6 successfully
> in removeExecutor
> 14/11/08 08:21:26 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-90.us-west-2.compute.internal,46792)
> 14/11/08 08:21:26 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-90.us-west-2.compute.internal,46792)
> 14/11/08 08:21:26 ERROR network.ConnectionManager: Corresponding
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-90.us-west-2.compute.internal,46792) not
> found
> 14/11/08 08:21:26 INFO cluster.YarnClientSchedulerBackend: Executor 21
> disconnected, so removing it
> 14/11/08 08:21:26 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 21 on ip-172-16-1-90.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:26 INFO scheduler.TaskSetManager: Re-queueing tasks for 21
> from TaskSet 0.0
> 14/11/08 08:21:26 WARN scheduler.TaskSetManager: Lost task 25.0 in stage
> 0.0 (TID 25, ip-172-16-1-90.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:26 INFO scheduler.DAGScheduler: Executor lost: 21 (epoch 5)
> 14/11/08 08:21:26 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 21 from BlockManagerMaster.
> 14/11/08 08:21:26 INFO storage.BlockManagerMaster: Removed 21 successfully
> in removeExecutor
> 14/11/08 08:21:29 INFO cluster.YarnClientSchedulerBackend: Executor 18
> disconnected, so removing it
> 14/11/08 08:21:29 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,43883)
> 14/11/08 08:21:29 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,43883)
> 14/11/08 08:21:29 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,43883)
> 14/11/08 08:21:29 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 18 on ip-172-16-1-222.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:29 INFO scheduler.TaskSetManager: Re-queueing tasks for 18
> from TaskSet 0.0
> 14/11/08 08:21:29 WARN scheduler.TaskSetManager: Lost task 26.0 in stage
> 0.0 (TID 26, ip-172-16-1-222.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:29 INFO scheduler.DAGScheduler: Executor lost: 18 (epoch 6)
> 14/11/08 08:21:29 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 18 from BlockManagerMaster.
> 14/11/08 08:21:29 INFO storage.BlockManagerMaster: Removed 18 successfully
> in removeExecutor
> 14/11/08 08:21:30 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-194.us-west-2.compute.internal:50858/user/Executor#935992941]
> with ID 31
> 14/11/08 08:21:30 INFO scheduler.TaskSetManager: Starting task 26.1 in
> stage 0.0 (TID 30, ip-172-16-1-194.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:30 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-194.us-west-2.compute.internal:44263 with 776.3 MB RAM
> 14/11/08 08:21:31 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-194.us-west-2.compute.internal:44263 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:33 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,40102)
> 14/11/08 08:21:33 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,40102)
> 14/11/08 08:21:33 ERROR network.ConnectionManager: Corresponding
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-222.us-west-2.compute.internal,40102) not
> found
> 14/11/08 08:21:33 INFO cluster.YarnClientSchedulerBackend: Executor 26
> disconnected, so removing it
> 14/11/08 08:21:33 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 26 on ip-172-16-1-222.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:33 INFO scheduler.TaskSetManager: Re-queueing tasks for 26
> from TaskSet 0.0
> 14/11/08 08:21:33 WARN scheduler.TaskSetManager: Lost task 23.0 in stage
> 0.0 (TID 23, ip-172-16-1-222.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:33 INFO scheduler.DAGScheduler: Executor lost: 26 (epoch 7)
> 14/11/08 08:21:33 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 26 from BlockManagerMaster.
> 14/11/08 08:21:33 INFO storage.BlockManagerMaster: Removed 26 successfully
> in removeExecutor
> 14/11/08 08:21:36 INFO network.ConnectionManager: Removing
> ReceivingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310)
> 14/11/08 08:21:36 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310)
> 14/11/08 08:21:36 INFO cluster.YarnClientSchedulerBackend: Executor 1
> disconnected, so removing it
> 14/11/08 08:21:36 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 1 on ip-172-16-1-241.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:21:36 INFO scheduler.TaskSetManager: Re-queueing tasks for 1
> from TaskSet 0.0
> 14/11/08 08:21:36 WARN scheduler.TaskSetManager: Lost task 22.0 in stage
> 0.0 (TID 22, ip-172-16-1-241.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:21:36 ERROR network.SendingConnection: Exception while reading
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310)
> java.nio.channels.ClosedChannelException
>         at
> sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
>         at
> org.apache.spark.network.SendingConnection.read(Connection.scala:390)
>         at
> org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> 14/11/08 08:21:36 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 8)
> 14/11/08 08:21:36 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 1 from BlockManagerMaster.
> 14/11/08 08:21:36 INFO storage.BlockManagerMaster: Removed 1 successfully
> in removeExecutor
> 14/11/08 08:21:36 INFO network.ConnectionManager: Handling connection
> error on connection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310)
> 14/11/08 08:21:36 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310)
> 14/11/08 08:21:36 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionManagerId(ip-172-16-1-241.us-west-2.compute.internal,43310)
> 14/11/08 08:21:40 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-194.us-west-2.compute.internal:58099/user/Executor#-112835629]
> with ID 34
> 14/11/08 08:21:40 INFO scheduler.TaskSetManager: Starting task 22.1 in
> stage 0.0 (TID 31, ip-172-16-1-194.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:41 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-194.us-west-2.compute.internal:41093 with 776.3 MB RAM
> 14/11/08 08:21:41 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-228.us-west-2.compute.internal:36136/user/Executor#318736262]
> with ID 32
> 14/11/08 08:21:41 INFO scheduler.TaskSetManager: Starting task 23.1 in
> stage 0.0 (TID 32, ip-172-16-1-228.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:41 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:33130/user/Executor#1744030597]
> with ID 33
> 14/11/08 08:21:41 INFO scheduler.TaskSetManager: Starting task 25.1 in
> stage 0.0 (TID 33, ip-172-16-1-90.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:41 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-92.us-west-2.compute.internal:55503/user/Executor#574084779]
> with ID 35
> 14/11/08 08:21:41 INFO scheduler.TaskSetManager: Starting task 24.1 in
> stage 0.0 (TID 34, ip-172-16-1-92.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:42 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-228.us-west-2.compute.internal:40128 with 776.3 MB RAM
> 14/11/08 08:21:42 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-90.us-west-2.compute.internal:32839 with 776.3 MB RAM
> 14/11/08 08:21:42 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-92.us-west-2.compute.internal:58081 with 776.3 MB RAM
> 14/11/08 08:21:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-194.us-west-2.compute.internal:41093 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-228.us-west-2.compute.internal:40128 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-92.us-west-2.compute.internal:58081 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:43 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-90.us-west-2.compute.internal:32839 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:43 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-152.us-west-2.compute.internal:34268/user/Executor#-937582169]
> with ID 36
> 14/11/08 08:21:43 INFO scheduler.TaskSetManager: Starting task 29.1 in
> stage 0.0 (TID 35, ip-172-16-1-152.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:44 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-152.us-west-2.compute.internal:52550 with 776.3 MB RAM
> 14/11/08 08:21:45 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-152.us-west-2.compute.internal:52550 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:46 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:34555/user/Executor#-94727554]
> with ID 37
> 14/11/08 08:21:46 INFO scheduler.TaskSetManager: Starting task 27.1 in
> stage 0.0 (TID 36, ip-172-16-1-90.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:46 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-228.us-west-2.compute.internal:34471/user/Executor#1412546630]
> with ID 38
> 14/11/08 08:21:46 INFO scheduler.TaskSetManager: Starting task 21.1 in
> stage 0.0 (TID 37, ip-172-16-1-228.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:47 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-90.us-west-2.compute.internal:46194 with 776.3 MB RAM
> 14/11/08 08:21:47 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-228.us-west-2.compute.internal:42275 with 776.3 MB RAM
> 14/11/08 08:21:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-90.us-west-2.compute.internal:46194 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-228.us-west-2.compute.internal:42275 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:50 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-23.us-west-2.compute.internal:37122/user/Executor#1404320204]
> with ID 39
> 14/11/08 08:21:51 INFO scheduler.TaskSetManager: Starting task 28.1 in
> stage 0.0 (TID 38, ip-172-16-1-23.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:21:51 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-23.us-west-2.compute.internal:33106 with 776.3 MB RAM
> 14/11/08 08:21:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-23.us-west-2.compute.internal:33106 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:22:36 INFO cluster.YarnClientSchedulerBackend: Executor 39
> disconnected, so removing it
> 14/11/08 08:22:36 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 39 on ip-172-16-1-23.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:22:36 INFO scheduler.TaskSetManager: Re-queueing tasks for 39
> from TaskSet 0.0
> 14/11/08 08:22:36 WARN scheduler.TaskSetManager: Lost task 28.1 in stage
> 0.0 (TID 38, ip-172-16-1-23.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:22:36 INFO scheduler.DAGScheduler: Executor lost: 39 (epoch 9)
> 14/11/08 08:22:36 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 39 from BlockManagerMaster.
> 14/11/08 08:22:36 INFO storage.BlockManagerMaster: Removed 39 successfully
> in removeExecutor
> 14/11/08 08:22:57 INFO cluster.YarnClientSchedulerBackend: Executor 36
> disconnected, so removing it
> 14/11/08 08:22:57 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 36 on ip-172-16-1-152.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:22:57 INFO scheduler.TaskSetManager: Re-queueing tasks for 36
> from TaskSet 0.0
> 14/11/08 08:22:57 WARN scheduler.TaskSetManager: Lost task 29.1 in stage
> 0.0 (TID 35, ip-172-16-1-152.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:22:57 INFO scheduler.DAGScheduler: Executor lost: 36 (epoch 10)
> 14/11/08 08:22:57 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 36 from BlockManagerMaster.
> 14/11/08 08:22:57 INFO storage.BlockManagerMaster: Removed 36 successfully
> in removeExecutor
> 14/11/08 08:23:00 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:48033/user/Executor#-1088273404]
> with ID 40
> 14/11/08 08:23:00 INFO scheduler.TaskSetManager: Starting task 29.2 in
> stage 0.0 (TID 39, ip-172-16-1-90.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:23:01 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-90.us-west-2.compute.internal:39067 with 776.3 MB RAM
> 14/11/08 08:23:03 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-90.us-west-2.compute.internal:39067 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:23:15 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-23.us-west-2.compute.internal:48860/user/Executor#-369895446]
> with ID 41
> 14/11/08 08:23:15 INFO scheduler.TaskSetManager: Starting task 28.2 in
> stage 0.0 (TID 40, ip-172-16-1-23.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:23:16 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-23.us-west-2.compute.internal:38093 with 776.3 MB RAM
> 14/11/08 08:23:17 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-23.us-west-2.compute.internal:38093 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:23:32 INFO cluster.YarnClientSchedulerBackend: Executor 34
> disconnected, so removing it
> 14/11/08 08:23:32 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 34 on ip-172-16-1-194.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:23:32 INFO scheduler.TaskSetManager: Re-queueing tasks for 34
> from TaskSet 0.0
> 14/11/08 08:23:32 WARN scheduler.TaskSetManager: Lost task 22.1 in stage
> 0.0 (TID 31, ip-172-16-1-194.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:23:32 INFO scheduler.DAGScheduler: Executor lost: 34 (epoch 11)
> 14/11/08 08:23:32 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 34 from BlockManagerMaster.
> 14/11/08 08:23:32 INFO storage.BlockManagerMaster: Removed 34 successfully
> in removeExecutor
> 14/11/08 08:23:53 INFO cluster.YarnClientSchedulerBackend: Executor 41
> disconnected, so removing it
> 14/11/08 08:23:53 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 41 on ip-172-16-1-23.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:23:53 INFO scheduler.TaskSetManager: Re-queueing tasks for 41
> from TaskSet 0.0
> 14/11/08 08:23:53 WARN scheduler.TaskSetManager: Lost task 28.2 in stage
> 0.0 (TID 40, ip-172-16-1-23.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:23:53 INFO scheduler.DAGScheduler: Executor lost: 41 (epoch 12)
> 14/11/08 08:23:53 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 41 from BlockManagerMaster.
> 14/11/08 08:23:53 INFO storage.BlockManagerMaster: Removed 41 successfully
> in removeExecutor
> 14/11/08 08:23:57 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:58017/user/Executor#2094507560]
> with ID 42
> 14/11/08 08:23:57 INFO scheduler.TaskSetManager: Starting task 28.3 in
> stage 0.0 (TID 41, ip-172-16-1-90.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:23:58 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-90.us-west-2.compute.internal:41182 with 776.3 MB RAM
> 14/11/08 08:24:00 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-90.us-west-2.compute.internal:41182 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:24:04 INFO cluster.YarnClientSchedulerBackend: Executor 35
> disconnected, so removing it
> 14/11/08 08:24:04 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 35 on ip-172-16-1-92.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:24:04 INFO scheduler.TaskSetManager: Re-queueing tasks for 35
> from TaskSet 0.0
> 14/11/08 08:24:04 WARN scheduler.TaskSetManager: Lost task 24.1 in stage
> 0.0 (TID 34, ip-172-16-1-92.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:24:04 INFO scheduler.DAGScheduler: Executor lost: 35 (epoch 13)
> 14/11/08 08:24:04 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 35 from BlockManagerMaster.
> 14/11/08 08:24:04 INFO storage.BlockManagerMaster: Removed 35 successfully
> in removeExecutor
> 14/11/08 08:24:17 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:36395/user/Executor#-1907878650]
> with ID 43
> 14/11/08 08:24:17 INFO scheduler.TaskSetManager: Starting task 24.2 in
> stage 0.0 (TID 42, ip-172-16-1-90.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:24:18 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-90.us-west-2.compute.internal:46948 with 776.3 MB RAM
> 14/11/08 08:24:20 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-90.us-west-2.compute.internal:46948 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:24:21 INFO cluster.YarnClientSchedulerBackend: Executor 40
> disconnected, so removing it
> 14/11/08 08:24:21 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 40 on ip-172-16-1-90.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:24:21 INFO scheduler.TaskSetManager: Re-queueing tasks for 40
> from TaskSet 0.0
> 14/11/08 08:24:21 WARN scheduler.TaskSetManager: Lost task 29.2 in stage
> 0.0 (TID 39, ip-172-16-1-90.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:24:21 INFO scheduler.DAGScheduler: Executor lost: 40 (epoch 14)
> 14/11/08 08:24:21 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 40 from BlockManagerMaster.
> 14/11/08 08:24:21 INFO storage.BlockManagerMaster: Removed 40 successfully
> in removeExecutor
> 14/11/08 08:24:31 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:34467/user/Executor#-1100688472]
> with ID 44
> 14/11/08 08:24:31 INFO scheduler.TaskSetManager: Starting task 29.3 in
> stage 0.0 (TID 43, ip-172-16-1-90.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:24:32 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-90.us-west-2.compute.internal:40126 with 776.3 MB RAM
> 14/11/08 08:24:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-90.us-west-2.compute.internal:40126 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:24:48 INFO cluster.YarnClientSchedulerBackend: Registered
> executor:
> Actor[akka.tcp://sparkexecu...@ip-172-16-1-90.us-west-2.compute.internal:53257/user/Executor#-745380917]
> with ID 45
> 14/11/08 08:24:48 INFO scheduler.TaskSetManager: Starting task 22.2 in
> stage 0.0 (TID 44, ip-172-16-1-90.us-west-2.compute.internal,
> PROCESS_LOCAL, 1122 bytes)
> 14/11/08 08:24:49 INFO storage.BlockManagerMasterActor: Registering block
> manager ip-172-16-1-90.us-west-2.compute.internal:46252 with 776.3 MB RAM
> 14/11/08 08:24:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-90.us-west-2.compute.internal:46252 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:25:16 INFO cluster.YarnClientSchedulerBackend: Executor 38
> disconnected, so removing it
> 14/11/08 08:25:16 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 38 on ip-172-16-1-228.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:25:16 INFO scheduler.TaskSetManager: Re-queueing tasks for 38
> from TaskSet 0.0
> 14/11/08 08:25:16 WARN scheduler.TaskSetManager: Lost task 21.1 in stage
> 0.0 (TID 37, ip-172-16-1-228.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:25:16 INFO scheduler.DAGScheduler: Executor lost: 38 (epoch 15)
> 14/11/08 08:25:16 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 38 from BlockManagerMaster.
> 14/11/08 08:25:16 INFO storage.BlockManagerMaster: Removed 38 successfully
> in removeExecutor
> 14/11/08 08:25:37 INFO cluster.YarnClientSchedulerBackend: Executor 42
> disconnected, so removing it
> 14/11/08 08:25:37 ERROR cluster.YarnClientClusterScheduler: Lost executor
> 42 on ip-172-16-1-90.us-west-2.compute.internal: remote Akka client
> disassociated
> 14/11/08 08:25:37 INFO scheduler.TaskSetManager: Re-queueing tasks for 42
> from TaskSet 0.0
> 14/11/08 08:25:37 WARN scheduler.TaskSetManager: Lost task 28.3 in stage
> 0.0 (TID 41, ip-172-16-1-90.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> 14/11/08 08:25:37 ERROR scheduler.TaskSetManager: Task 28 in stage 0.0
> failed 4 times; aborting job
> 14/11/08 08:25:37 INFO cluster.YarnClientClusterScheduler: Cancelling
> stage 0
> 14/11/08 08:25:37 INFO cluster.YarnClientClusterScheduler: Stage 0 was
> cancelled
> 14/11/08 08:25:37 INFO scheduler.DAGScheduler: Failed to run
> saveAsTextFile at NativeMethodAccessorImpl.java:-2
> 14/11/08 08:25:37 INFO scheduler.DAGScheduler: Executor lost: 42 (epoch 16)
> 14/11/08 08:25:37 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 42 from BlockManagerMaster.
> 14/11/08 08:25:37 INFO storage.BlockManagerMaster: Removed 42 successfully
> in removeExecutor
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to