[jira] [Commented] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-08-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400577#comment-17400577
 ] 

Apache Spark commented on SPARK-34949:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/33770

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1, 3.2.0
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Assignee: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
> Fix For: 3.1.2, 3.2.0
>
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314202#comment-17314202
 ] 

Apache Spark commented on SPARK-34949:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/32043

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314200#comment-17314200
 ] 

Apache Spark commented on SPARK-34949:
--

User 'sumeetgajjar' has created a pull request for this issue:
https://github.com/apache/spark/pull/32043

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-02 Thread Sumeet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314194#comment-17314194
 ] 

Sumeet commented on SPARK-34949:


I am working on this.

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Priority: Minor
>  Labels: Executor, heartbeat
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org