*HI ALL:*
*My job is cpu intensive, and its resource configuration is 400 worker * 1 core * 3G. There are many fetch failure, like:* 14-08-23 08:34:52 WARN [Result resolver thread-3] TaskSetManager: Loss was due to fetch failure from BlockManagerId(slave1:33500) 14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37] DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for resubmision due to a fetch failure 14-08-23 08:34:52 INFO [spark-akka.actor.default-dispatcher-37] DAGScheduler: The failed fetch was from Stage 5 (repartition at test.scala:82); marking it for resubmission 14-08-23 08:34:53 INFO [spark-akka.actor.default-dispatcher-71] DAGScheduler: Resubmitting failed stages 14-08-23 08:35:06 WARN [Result resolver thread-2] TaskSetManager: Loss was due to fetch failure from BlockManagerId(slave2:34792) 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for resubmision due to a fetch failure 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] DAGScheduler: The failed fetch was from Stage 5 (repartition at test.scala:82); marking it for resubmission 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] DAGScheduler: Executor lost: 118 (epoch 3) 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-38] BlockManagerMasterActor: Trying to remove executor 118 from BlockManagerMaster. 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-63] BlockManagerMaster: Removed 118 successfully in removeExecutor 14-08-23 08:35:06 INFO [spark-akka.actor.default-dispatcher-43] DAGScheduler: Resubmitting failed stages *stage 4 will be marked for resubmission. After a period of time: block manager slave1:33500 will be registered again* 14-08-23 08:36:16 INFO [spark-akka.actor.default-dispatcher-58] BlockManagerInfo: Registering block manager slave1:33500 with 1766.4 MB RAM *unfortunately, stage 4 will be resubmitted again and again, and meet many fetch failure. After 14-08-23 09:03:37, there is no log in master, and print log again at 14-08-24 00:43:15* 14-08-23 09:03:37 INFO [Result resolver thread-3] YarnClusterScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool 14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28] DAGScheduler: Marking Stage 4 (repartition at test.scala:97) for resubmision due to a fetch failure 14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-28] DAGScheduler: The failed fetch was from Stage 5 (repartition at test.scala:82); marking it for resubmission 14-08-23 09:03:37 INFO [spark-akka.actor.default-dispatcher-71] DAGScheduler: Resubmitting failed stages 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed container container_1400565786114_133451_01_000171 (state: COMPLETE, exit status: -100) 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container marked as failed: container_1400565786114_133451_01_000171 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Completed container container_1400565786114_133451_01_000172 (state: COMPLETE, exit status: -100) 14-08-24 00:43:15 INFO [Thread-854] YarnAllocationHandler: Container marked as failed: container_1400565786114_133451_01_000172 14-08-24 00:43:20 INFO [Thread-854] ApplicationMaster: Allocating 2 containers to make up for (potentially) lost containers 14-08-24 00:43:20 INFO [Thread-854] YarnAllocationHandler: Will Allocate 2 executor containers, each with 3456 memory *Strangely, TaskSet4.0 will be removed as its tasks have completed, while Stage 4 was marked for resubmission. In Executor there are many "java.net.ConnectException: Connection timed out", like:* 14-08-23 08:19:14 WARN [pool-3-thread-1] SendingConnection: Error finishing connection to java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) *I often meet such problems, i.e. BlockManager Connection Fail, and Spark can not recover effectively, and job will hang or fail directly.* *Any Suggestions? And are there any guides about resource for job in view of computing, cache, shuffle, etc.* *Thank You!*