Shay Rojansky created SPARK-8374:
------------------------------------

             Summary: Job frequently hangs after YARN preemption
                 Key: SPARK-8374
                 URL: https://issues.apache.org/jira/browse/SPARK-8374
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 1.4.0
         Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
            Reporter: Shay Rojansky
            Priority: Critical


After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
not reacquire executors and will therefore hang. To reproduce:

1. I run Spark job A that acquires all grid resources
2. I run Spark job B in a higher-priority queue that acquires all grid 
resources. Job A is fully preempted.
3. Kill job B, releasing all resources
4. Job A should at this point reacquire all grid resources, but occasionally 
doesn't. Repeating the preemption scenario makes the bad behavior occur within 
a few attempts.

(see logs at bottom).

Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
issues, maybe the work there is related to the new issues.

The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
downgraded to 1.3.1 just because of this issue).

Logs
------
When job B (the preemptor first acquires an application master, the following 
is logged by job A (the preemptee):

{noformat}
ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
INFO DAGScheduler: Executor lost: 447 (epoch 0)
INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
BlockManagerMaster.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
g023.grid.eaglerd.local, 41406)
INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
{noformat}

(It's strange for errors/warnings to be logged for preemption)

Later, when job B's AM starts requesting its resources, I get lots of the 
following in job A:

{noformat}
ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
{noformat}

Finally, when I kill job B, job A emits lots of the following:

{noformat}
INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
{noformat}

And finally after some time:

{noformat}
WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 
ms exceeds timeout 120000 ms
ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat 
timed out after 165964 ms
{noformat}

At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to