[ https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or closed SPARK-8374. ---------------------------- Resolution: Duplicate > Job frequently hangs after YARN preemption > ------------------------------------------ > > Key: SPARK-8374 > URL: https://issues.apache.org/jira/browse/SPARK-8374 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.4.0 > Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 > Reporter: Shay Rojansky > Priority: Critical > > After upgrading to Spark 1.4.0, jobs that get preempted very frequently will > not reacquire executors and will therefore hang. To reproduce: > 1. I run Spark job A that acquires all grid resources > 2. I run Spark job B in a higher-priority queue that acquires all grid > resources. Job A is fully preempted. > 3. Kill job B, releasing all resources > 4. Job A should at this point reacquire all grid resources, but occasionally > doesn't. Repeating the preemption scenario makes the bad behavior occur > within a few attempts. > (see logs at bottom). > Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption > issues, maybe the work there is related to the new issues. > The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've > downgraded to 1.3.1 just because of this issue). > Logs > ------ > When job B (the preemptor first acquires an application master, the following > is logged by job A (the preemptee): > {noformat} > ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, > g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) > INFO DAGScheduler: Executor lost: 447 (epoch 0) > INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from > BlockManagerMaster. > INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, > g023.grid.eaglerd.local, 41406) > INFO BlockManagerMaster: Removed 447 successfully in removeExecutor > {noformat} > (It's strange for errors/warnings to be logged for preemption) > Later, when job B's AM starts requesting its resources, I get lots of the > following in job A: > {noformat} > ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc > client disassociated > INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 > WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, > g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) > WARN ReliableDeliverySupervisor: Association with remote system > [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address > is now gated for [5000] ms. Reason is: [Disassociated]. > {noformat} > Finally, when I kill job B, job A emits lots of the following: > {noformat} > INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 > WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! > {noformat} > And finally after some time: > {noformat} > WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: > 165964 ms exceeds timeout 120000 ms > ERROR YarnScheduler: Lost an executor 466 (already removed): Executor > heartbeat timed out after 165964 ms > {noformat} > At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org