[ https://issues.apache.org/jira/browse/SPARK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575500#comment-17575500 ]
Apache Spark commented on SPARK-39984: -------------------------------------- User 'kevin85421' has created a pull request for this issue: https://github.com/apache/spark/pull/37411 > Check workerLastHeartbeat with master before HeartbeatReceiver expires an > executor > ---------------------------------------------------------------------------------- > > Key: SPARK-39984 > URL: https://issues.apache.org/jira/browse/SPARK-39984 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.4.0 > Reporter: Kai-Hsun Chen > Priority: Major > > Currently, the driver’s HeartbeatReceiver will expire an executor if it does > not receive any heartbeat from the executor for 120 seconds. However, 120 > seconds is too long, but we will face other challenges when we try to lower > the timeout threshold. To elaborate, when an executor is performing GC, it > cannot reply any message. > > Hence, this PR aims to provide a method to lower the timeout. Worker will > send heartbeats to master periodically, and thus if HeartbeatReceiver asks > master the information about the latest heartbeat from the worker which the > executor is on, HeartbeatReceiver can determine whether the heartbeat loss is > caused by network issues or other issues (e.g. GC). If the heartbeat loss is > not caused by network issues, the HeartbeatReceiver will put the executor > into a waitingList rather than expiring it immediately. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org