[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35011: -- Fix Version/s: (was: 3.0.4) (was: 3.1.3) (was: 3.2.0) > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 >Reporter: Sumeet >Priority: Major > Labels: BlockManager, core > > *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, > driver reports dead executors as alive. > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on > executorEndpoint > * "CoarseGrainedSchedulerBackend" removes that executor from Driver's > internal data structures and publishes "SparkListenerExecutorRemoved" on the > "listenerBus". > * Executor has still not processed "StopExecutor" from the Driver > * Driver receives heartbeat from the Executor, since it cannot find the > "executorId" in its data structures, it responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" > and "SparkListenerBlockManagerAdded" is published on the "listenerBus" > * Executor starts processing the "StopExecutor" and exits > * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and > updates "AppStatusStore" > * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list > of executors which returns the dead executor as alive. > > *Proposed Solution:* > Maintain a Cache of recently removed executors on Driver. During the > registration in BlockManagerMasterEndpoint if the BlockManager belongs to a > recently removed executor, return None indicating the registration is ignored > since the executor will be shutting down soon. > On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed > executor, return true indicating the driver knows about it, thereby > preventing reregisteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35011: -- Fix Version/s: 3.2.0 > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 >Reporter: Sumeet >Assignee: Sumeet >Priority: Major > Labels: BlockManager, core > Fix For: 3.2.0, 3.1.3, 3.0.4 > > > *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, > driver reports dead executors as alive. > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on > executorEndpoint > * "CoarseGrainedSchedulerBackend" removes that executor from Driver's > internal data structures and publishes "SparkListenerExecutorRemoved" on the > "listenerBus". > * Executor has still not processed "StopExecutor" from the Driver > * Driver receives heartbeat from the Executor, since it cannot find the > "executorId" in its data structures, it responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" > and "SparkListenerBlockManagerAdded" is published on the "listenerBus" > * Executor starts processing the "StopExecutor" and exits > * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and > updates "AppStatusStore" > * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list > of executors which returns the dead executor as alive. > > *Proposed Solution:* > Maintain a Cache of recently removed executors on Driver. During the > registration in BlockManagerMasterEndpoint if the BlockManager belongs to a > recently removed executor, return None indicating the registration is ignored > since the executor will be shutting down soon. > On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed > executor, return true indicating the driver knows about it, thereby > preventing reregisteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35011: -- Fix Version/s: (was: 3.2.0) > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 >Reporter: Sumeet >Assignee: Sumeet >Priority: Major > Labels: BlockManager, core > Fix For: 3.1.3, 3.0.4 > > > *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, > driver reports dead executors as alive. > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on > executorEndpoint > * "CoarseGrainedSchedulerBackend" removes that executor from Driver's > internal data structures and publishes "SparkListenerExecutorRemoved" on the > "listenerBus". > * Executor has still not processed "StopExecutor" from the Driver > * Driver receives heartbeat from the Executor, since it cannot find the > "executorId" in its data structures, it responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" > and "SparkListenerBlockManagerAdded" is published on the "listenerBus" > * Executor starts processing the "StopExecutor" and exits > * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and > updates "AppStatusStore" > * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list > of executors which returns the dead executor as alive. > > *Proposed Solution:* > Maintain a Cache of recently removed executors on Driver. During the > registration in BlockManagerMasterEndpoint if the BlockManager belongs to a > recently removed executor, return None indicating the registration is ignored > since the executor will be shutting down soon. > On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed > executor, return true indicating the driver knows about it, thereby > preventing reregisteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-35011: --- Fix Version/s: 3.0.4 > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 >Reporter: Sumeet >Assignee: Sumeet >Priority: Major > Labels: BlockManager, core > Fix For: 3.2.0, 3.1.3, 3.0.4 > > > *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, > driver reports dead executors as alive. > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on > executorEndpoint > * "CoarseGrainedSchedulerBackend" removes that executor from Driver's > internal data structures and publishes "SparkListenerExecutorRemoved" on the > "listenerBus". > * Executor has still not processed "StopExecutor" from the Driver > * Driver receives heartbeat from the Executor, since it cannot find the > "executorId" in its data structures, it responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" > and "SparkListenerBlockManagerAdded" is published on the "listenerBus" > * Executor starts processing the "StopExecutor" and exits > * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and > updates "AppStatusStore" > * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list > of executors which returns the dead executor as alive. > > *Proposed Solution:* > Maintain a Cache of recently removed executors on Driver. During the > registration in BlockManagerMasterEndpoint if the BlockManager belongs to a > recently removed executor, return None indicating the registration is ignored > since the executor will be shutting down soon. > On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed > executor, return true indicating the driver knows about it, thereby > preventing reregisteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35011: -- Fix Version/s: 3.1.3 > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1, 3.2.0 >Reporter: Sumeet >Assignee: Sumeet >Priority: Major > Labels: BlockManager, core > Fix For: 3.2.0, 3.1.3 > > > *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, > driver reports dead executors as alive. > *Problem:* > I was testing Dynamic Allocation on K8s with about 300 executors. While doing > so, when the executors were torn down due to > "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor > pods being removed from K8s, however, under the "Executors" tab in SparkUI, I > could see some executors listed as alive. > [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] > also returned a value greater than 1. > > *Cause:* > * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on > executorEndpoint > * "CoarseGrainedSchedulerBackend" removes that executor from Driver's > internal data structures and publishes "SparkListenerExecutorRemoved" on the > "listenerBus". > * Executor has still not processed "StopExecutor" from the Driver > * Driver receives heartbeat from the Executor, since it cannot find the > "executorId" in its data structures, it responds with > "HeartbeatResponse(reregisterBlockManager = true)" > * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" > and "SparkListenerBlockManagerAdded" is published on the "listenerBus" > * Executor starts processing the "StopExecutor" and exits > * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and > updates "AppStatusStore" > * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list > of executors which returns the dead executor as alive. > > *Proposed Solution:* > Maintain a Cache of recently removed executors on Driver. During the > registration in BlockManagerMasterEndpoint if the BlockManager belongs to a > recently removed executor, return None indicating the registration is ignored > since the executor will be shutting down soon. > On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed > executor, return true indicating the driver knows about it, thereby > preventing reregisteration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35011) Avoid Block Manager registerations when StopExecutor msg is in-flight.
[ https://issues.apache.org/jira/browse/SPARK-35011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sumeet updated SPARK-35011: --- Description: *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, driver reports dead executors as alive. *Problem:* I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] also returned a value greater than 1. *Cause:* * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on executorEndpoint * "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal data structures and publishes "SparkListenerExecutorRemoved" on the "listenerBus". * Executor has still not processed "StopExecutor" from the Driver * Driver receives heartbeat from the Executor, since it cannot find the "executorId" in its data structures, it responds with "HeartbeatResponse(reregisterBlockManager = true)" * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and "SparkListenerBlockManagerAdded" is published on the "listenerBus" * Executor starts processing the "StopExecutor" and exits * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and updates "AppStatusStore" * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of executors which returns the dead executor as alive. *Proposed Solution:* Maintain a Cache of recently removed executors on Driver. During the registration in BlockManagerMasterEndpoint if the BlockManager belongs to a recently removed executor, return None indicating the registration is ignored since the executor will be shutting down soon. On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed executor, return true indicating the driver knows about it, thereby preventing reregisteration. was: *Note:* This is a follow-up on SPARK-34949, even after the heartbeat fix, driver reports dead executors as alive. *Problem:* I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive. [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100] also returned a value greater than 1. *Cause:* * "CoarseGrainedSchedulerBackend" issues async "StopExecutor" on executorEndpoint * "CoarseGrainedSchedulerBackend" removes that executor from Driver's internal data structures and publishes "SparkListenerExecutorRemoved" on the "listenerBus". * Executor has still not processed "StopExecutor" from the Driver * Driver receives heartbeat from the Executor, since it cannot find the "executorId" in its data structures, it responds with "HeartbeatResponse(reregisterBlockManager = true)" * "BlockManager" on the Executor reregisters with the "BlockManagerMaster" and "SparkListenerBlockManagerAdded" is published on the "listenerBus" * Executor starts processing the "StopExecutor" and exits * "AppStatusListener" picks the "SparkListenerBlockManagerAdded" event and updates "AppStatusStore" * "statusTracker.getExecutorInfos" refers "AppStatusStore" to get the list of executors which returns the dead executor as alive. *Proposed Solution:* Maintain a Cache of recently removed executors on Driver. During the registration in BlockManagerMasterEndpoint if the BlockManager belongs to a recently removed executor, return None indicating the registration is ignored since the executor will be shutting down soon. On BlockManagerHeartbeat, if the BlockManager belongs to a recently removed executor, return true indicating the driver knows about it, thereby preventing reregisteration. > Avoid Block Manager registerations when StopExecutor msg is in-flight. > -- > > Key: SPARK-35011 > URL: https://issues.apache.org/jira/browse/SPARK-35011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.1.1 >Reporter: Sumeet >Priority: Major > Labels: BlockManager, core > > *Note:* This is a follow-up on SPARK-34949, even