Sumeet created SPARK-35011:
------------------------------

             Summary: Avoid Block Manager registerations when StopExecutor msg 
is in-flight.
                 Key: SPARK-35011
                 URL: https://issues.apache.org/jira/browse/SPARK-35011
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.1, 3.2.0
            Reporter: Sumeet


*Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, 
driver reports dead executors as alive.



*Problem:*

I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
so, when the executors were torn down due to 
"spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods 
being removed from K8s, however, under the "Executors" tab in SparkUI, I could 
see some executors listed as alive. 

[spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
 also returned a value greater than 1. 

 

*Cause:*

 
 *  "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on 
executorEndpoint
 * "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal 
data structures and publishes "SparkListenerExecutorRemoved" on the 
"listenerBus".
 * Executor has still not processed "StopExecutor" from the Driver
 * Driver receives heartbeat from the Executor, since it cannot find the 
"executorId" in its data structures, it responds with 
"HeartbeatResponse(reregisterBlockManager = true)"
 * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and 
"SparkListenerBlockManagerAdded" is published on the "listenerBus"
 * Executor starts processing the "StopExecutor" and exits
 * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and 
updates "AppStatusStore"
 * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of 
executors which returns the dead executor as alive.

 

*Proposed Solution:*

Maintain a Cache of recently removed executors on Driver. During the 
registration in BlockManagerMasterEndpoint if the BlockManager belongs to a 
recently removed executor, return None indicating the registration is ignored 
since the executor will be shutting down soon.

On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed 
executor, return true indicating the driver knows about it, thereby preventing 
reregisteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to