many fetch failure in "BlockManager"

余根茂 Mon, 25 Aug 2014 00:04:39 -0700

*HI ALL:*


*My job is cpu intensive, and its resource configuration is 400 worker
* 1 core * 3G. There are many fetch failure, like:*



14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss
was due to fetch failure from BlockManagerId(slave1:33500)

14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure

14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission

14-08-23 08:34:53 INFO [spark-akka.actor.default-dispatcher-71]
DAGScheduler: Resubmitting failed stages

14-08-23 08:35:06 WARN [Result resolver thread-2] TaskSetManager: Loss
was due to fetch failure from BlockManagerId(slave2:34792)

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
DAGScheduler: Executor lost: 118 (epoch 3)

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-38]
BlockManagerMasterActor: Trying to remove executor 118 from
BlockManagerMaster.

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63]
BlockManagerMaster: Removed 118 successfully in removeExecutor

14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-43]
DAGScheduler: Resubmitting failed stages

*stage 4 will be marked for resubmission. After a period of time:
block manager slave1:33500 will be registered again*

14-08-23 08:36:16 INFO [spark-akka.actor.default-dispatcher-58]
BlockManagerInfo: Registering block manager slave1:33500 with 1766.4
MB RAM

*unfortunately, stage 4 will be resubmitted again and again, and meet
many fetch failure. After 14-08-23 09:03:37, there is no log in
master, and print log again at  14-08-24 00:43:15*

14-08-23 09:03:37 INFO [Result resolver thread-3]
YarnClusterScheduler: Removed TaskSet 4.0, whose tasks have all
completed, from pool

14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28]
DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for
resubmision due to a fetch failure

14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28]
DAGScheduler: The failed fetch was from Stage 5 (repartition at
test.scala:82); marking it for resubmission

14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-71]
DAGScheduler: Resubmitting failed stages

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed
container container_1400565786114_133451_01_000171 (state: COMPLETE,
exit status: -100)

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container
marked as failed: container_1400565786114_133451_01_000171

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed
container container_1400565786114_133451_01_000172 (state: COMPLETE,
exit status: -100)

14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container
marked as failed: container_1400565786114_133451_01_000172

14-08-24 00:43:20 INFO [Thread-854] ApplicationMaster: Allocating 2
containers to make up for (potentially) lost containers

14-08-24 00:43:20 INFO [Thread-854] YarnAllocationHandler: Will
Allocate 2 executor containers, each with 3456 memory

*Strangely, TaskSet4.0 will be removed as its tasks have completed,
while Stage 4 was marked for resubmission. In Executor there are many
"java.net.ConnectException: Connection timed out", like:*


14-08-23 08:19:14 WARN [pool-3-thread-1] SendingConnection: Error
finishing connection to java.net.ConnectException: Connection timed
out

         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

         at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)

         at 
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318)

         at 
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)

         at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

         at java.lang.Thread.run(Thread.java:662)


 *I often meet such problems, i.e. BlockManager Connection Fail, and
Spark can not recover effectively, and job will hang or fail
directly.*


*Any Suggestions? And are there any guides about resource for job in
view of computing, cache, shuffle, etc.*


*Thank You!*

many fetch failure in "BlockManager"

Reply via email to