[
https://issues.apache.org/jira/browse/SPARK-22620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-22620:
---------------------------------
Labels: bulk-closed (was: )
> Deadlock in blacklisting when executor is removed due to a heartbeat timeout
> ----------------------------------------------------------------------------
>
> Key: SPARK-22620
> URL: https://issues.apache.org/jira/browse/SPARK-22620
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.2.0
> Reporter: Thomas Graves
> Priority: Critical
> Labels: bulk-closed
>
> I ran into a deadlock dealing with blacklisting where the application was
> stuck in a state that it thought it had executors to schedule on but it
> really didn't because those executors were dead. So all executors were
> blacklisted and it couldn't run any tasks.
> Note for this to happen the application has to be in a state where many tasks
> are going to fail (at least enough tasks fail to cause all executors to be
> blacklisted) and an executor had to get a heartbeat timeout.
> Small disclaimer in that I ran into this on spark 2.1 with the blacklisting
> feature backported. Looking at the code for master branch I don't see this
> fixed though.
> Normally there is code TaskSetManager. abortIfCompletelyBlacklisted that
> prevents the app from becoming deadlocked when no executors are available.
> Unfortunately this did not take affect because in this case a dead executor
> was in TaskSchedulerImpl.hostToExecutors and it wasn't a blacklisted host
> because it had a heartbeat timeout so it thought it still had executors to
> run on.
> Now how it got into this situation:
> The HeartBeatReceiver was called because it timed out a based on no
> heartbeat. This called TaskSchedulerImpl.executorLost, this called
> TaskSchedulerImpl.removeExecutor which removed it from the
> executorIdToRunningTaskIds and hostToExecutors. It does other handling here
> and sends a CompletionEvent in DAG.taskEnded.
> TaskSchedulerImpl.removeExecutor returns and TaskSchedulerImpl.executorLost
> calls into DAG.executorLost and then
> CoarseGrainedSchedulerBackend.reviveOffers which both send messages async.
> This returns and HeartBeatReceiver then calls sc.killAndReplaceExecutor
> asynchronously.
> Now there are a bunch of messages sent which are being processed by other
> threads. You can see the order in which things ran from the logs below. In
> this situation, the CoarseGrainedSchedulerBakcend.makeOffers is called before
> sc.killAndReplaceExecutor can add the host to executorsPendingToRemove.
> CoarseGrainedSchedulerBackend.makeOffers uses
> CoarseGrainedSchedulerBackend.executorDataMap and executorsPendingToRemove to
> check to see if the executor exists and it is alive. At this point its in the
> executorDataMap and its not Pending to remove. Since its still there it
> calls into TaskSchedulerImpl.resourceOffers which adds the executor back into
> executorIdToRunningTaskIds and hostToExecutors. There is nothing else after
> this to remove it from hostToExecutors so the TaskSetManager.
> abortIfCompletelyBlacklisted routine always sees an executor that is actually
> dead.
> While looking at this I also found a bug where it doesn't remove the executor
> from the CoarseGrainedSchedulerBackend.executorDataMap as well. Normally the
> call to CoarseGrainedSchedulerBackend.removeExecutor would remove that but
> since it was a heartbeat timeout this is never called. The only way that
> would have helped in this situation is if it was removed from executorDataMap
> before the called to TaskSchedulerImpl.resourceOffers ->
> CoarseGrainedSchedulerBackend.makeOffers, in which case it would have never
> been added back to executorIdToRunningTaskIds and hostToExecutors. Note other
> stats like totalCoreCount and totalRegisteredExecutors will also be off since
> removeExecutor is not called.
> Below are the logs that show the race. Notice how the "Requesting to kill
> executor(s) 137" came in the middle of the removal which is in
> CoarseGrainedSchedulerBackend.killExecutors and but it doesn't actually run
> through to add it to executorsPendingToRemove "Actual list of executor(s) to
> be killed is 137" until after it was added back to the hostToExecutors list
> ("Host added was in lost list earlier: foo1.com")
> 2017-11-14 15:56:44,376 [task-result-getter-0] INFO
> org.apache.spark.scheduler.TaskSetManager - Finished task 3203.0 in stage
> 7.0 (TID 11011) in 1396594 ms on foo3.com (executor 8) (2971/4004)
> 2017-11-14 15:56:50,729 [dispatcher-event-loop-23] WARN
> org.apache.spark.HeartbeatReceiver - Removing executor 137 with no recent
> heartbeats: 639438 ms exceeds timeout 600000 ms
> 2017-11-14 15:56:50,731 [dispatcher-event-loop-23] ERROR
> org.apache.spark.scheduler.cluster.YarnClusterScheduler - Lost executor 137
> on foo1.com: Executor heartbeat timed out after 639438 ms
> 2017-11-14 15:56:50,773 [dispatcher-event-loop-23] WARN
> org.apache.spark.scheduler.TaskSetManager - Lost task 2696.0 in stage 7.0
> (TID 11061, foo1.com, executor 137): ExecutorLostFailure (executor 137 exited
> caused by one of the running tasks) Reason: Executor heartbeat timed out
> after 639438 ms
> 2017-11-14 15:56:50,778 [dag-scheduler-event-loop] INFO
> org.apache.spark.scheduler.DAGScheduler - Executor lost: 137 (epoch 0)
> 2017-11-14 15:56:50,779 [dispatcher-event-loop-20] INFO
> org.apache.spark.storage.BlockManagerMasterEndpoint - Trying to remove
> executor 137 from BlockManagerMaster.
> 2017-11-14 15:56:50,779 [kill-executor-thread] INFO
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Requesting
> to kill executor(s) 137
> 2017-11-14 15:56:50,780 [dispatcher-event-loop-20] INFO
> org.apache.spark.storage.BlockManagerMasterEndpoint - Removing block manager
> BlockManagerId(137, foo1.com, 42569, None)
> 2017-11-14 15:56:50,780 [dag-scheduler-event-loop] INFO
> org.apache.spark.storage.BlockManagerMaster - Removed 137 successfully in
> removeExecutor
> 2017-11-14 15:56:50,781 [dag-scheduler-event-loop] INFO
> org.apache.spark.scheduler.DAGScheduler - Host added was in lost list
> earlier: foo1.com
> 2017-11-14 15:56:50,782 [kill-executor-thread] INFO
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - Actual list
> of executor(s) to be killed is 137
> 2017-11-14 15:56:50,827 [dispatcher-event-loop-6] INFO
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint - Driver requested
> to kill executor(s) 137.
> 2017-11-14 15:56:51,616 [Reporter] INFO
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor
> container(s), each with 10 core(s) and 15360 MB memory (including 8192 MB of
> overhead)
> Note there may be another bug here with speculation as well because some time
> later I see this same host in a log message that makes it sound like it
> thinks the task is still running on that dead host:
> 2017-11-14 16:46:40,970 [task-scheduler-speculation] INFO
> org.apache.spark.scheduler.TaskSetManager - Marking task 2696 in stage 7.0
> (on gsbl560n05.blue.ygrid.yahoo.com) as speculatable because it ran more than
> 2254168 ms
> We can split this off into another jira if we find its actually a bug. It
> looks like the issue there is the TaskSetManager.copiesRunning is not
> decremented because we are using external shuffle service but needs further
> investigation.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]