You should check executor log to find out why it failed. There might have more explanation.
--Xuefu On Sun, Jan 10, 2016 at 11:21 PM, Jone Zhang <joyoungzh...@gmail.com> wrote: > *I have submited a application many times.* > *Most of applications running correctly.See attach 1.* > *But one of the them breaks as expected.See attach 2.1 and 2.2.* > > *Why a small data size task running so long, and can't find any helpful > information in yarn logs.* > > *Part of the log information is as follows* > 16/01/11 12:45:19 INFO storage.BlockManagerMasterEndpoint: Trying to > remove executor 1 from BlockManagerMaster. > 16/01/11 12:45:19 INFO storage.BlockManagerMasterEndpoint: Removing block > manager BlockManagerId(1, 10.226.148.160, 44366) > 16/01/11 12:45:19 INFO storage.BlockManagerMaster: Removed 1 successfully > in removeExecutor > 16/01/11 12:50:32 INFO storage.BlockManagerInfo: Removed > broadcast_2_piece0 on 10.219.58.123:39594 in memory (size: 92.2 KB, free: > 441.4 MB) > 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 2 with > no recent heartbeats: 604535 ms exceeds timeout 600000 ms > 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 2 > (already removed): Executor heartbeat timed out after 604535 ms > 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 1 with > no recent heartbeats: 609228 ms exceeds timeout 600000 ms > 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 1 > (already removed): Executor heartbeat timed out after 609228 ms > 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 4 with > no recent heartbeats: 615098 ms exceeds timeout 600000 ms > 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 4 > (already removed): Executor heartbeat timed out after 615098 ms > 16/01/11 12:55:20 WARN spark.HeartbeatReceiver: Removing executor 3 with > no recent heartbeats: 616730 ms exceeds timeout 600000 ms > 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to > kill executor(s) 2 > 16/01/11 12:55:20 ERROR cluster.YarnClusterScheduler: Lost an executor 3 > (already removed): Executor heartbeat timed out after 616730 ms > 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to > kill 2 does not exist! > 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested > to kill executor(s) . > 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to > kill executor(s) 1 > 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to > kill 1 does not exist! > 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested > to kill executor(s) . > 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to > kill executor(s) 4 > 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to > kill 4 does not exist! > 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested > to kill executor(s) . > 16/01/11 12:55:20 INFO cluster.YarnClusterSchedulerBackend: Requesting to > kill executor(s) 3 > 16/01/11 12:55:20 WARN cluster.YarnClusterSchedulerBackend: Executor to > kill 3 does not exist! > 16/01/11 12:55:20 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested > to kill executor(s) . > 16/01/11 14:29:55 WARN client.RemoteDriver: Shutting down driver because > RPC channel was closed. > 16/01/11 14:29:55 INFO client.RemoteDriver: Shutting down remote driver. > 16/01/11 14:29:55 INFO scheduler.DAGScheduler: Asked to cancel job 1 > 16/01/11 14:29:55 INFO client.RemoteDriver: Failed to run job > 2fbbb881-988b-4454-ad9e-a20783aaf38e > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:503) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:371) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Cancelling stage 2 > 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Removed TaskSet 2.0, > whose tasks have all completed, from pool > 16/01/11 14:29:55 INFO cluster.YarnClusterScheduler: Stage 2 was cancelled > 16/01/11 14:29:55 INFO scheduler.DAGScheduler: ShuffleMapStage 2 > (mapPartitionsToPair at MapTran.java:31) failed in 6278.824 s > 16/01/11 14:29:55 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/metrics/json,null} > 16/01/11 14:29:55 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/01/11 14:29:55 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/api,null} > 16/01/11 14:29:55 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/,null} > 16/01/11 14:29:55 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static,null} > > > *Best wishes.* > *Thanks.* > >