[ https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-34949: ------------------------------------ Assignee: Apache Spark > Executor.reportHeartBeat reregisters blockManager even when Executor is > shutting down > ------------------------------------------------------------------------------------- > > Key: SPARK-34949 > URL: https://issues.apache.org/jira/browse/SPARK-34949 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.2.0 > Environment: Resource Manager: K8s > Reporter: Sumeet > Assignee: Apache Spark > Priority: Major > Labels: Executor, heartbeat > > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a > "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the > "listenerBus" > * "CoarseGrainedExecutorBackend" starts the executor shutdown > * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and > removes the executor from "executorLastSeen" > * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" > cannot find the "executorId" in "executorLastSeen" and hence responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * The Executor now calls "env.blockManager.reregister()" and reregisters > itself thus creating inconsistency > > *Proposed Solution:* > The "reportHeartBeat" method is not aware of the fact that Executor is > shutting down, it should check "executorShutdown" before reregistering. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org