LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-552060448 Thanks for the comment @jiangxb1987 > Please correct me if I'm wrong but I don't see approach to retry when `GetLocations*` requests timeout. `GetLocations` event never timeout. https://github.com/apache/spark/blob/e1ea806b3075d279b5f08a29fe4c1ad6d3c4191a/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L85 `BlockManagerHeartbeat` event could timeout, and if timeout we treat it as an executor lost. https://github.com/apache/spark/blob/70987d8144f4f2c094f3b82d0c4a98e818366225/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L271 But in a busy block manager, executors are not dead in deed but treated as lost by this mistakenly timeout. That's what this PR to fix. > so other events do not have timeout? or they will retry if timeout? Previously, I am not confirm that. But I think yes. They do not timeout. I only see `BlockManagerHeartbeat ` with timeout parameter. ```scala driverEndpoint.askSync[T](BlockManagerHeartbeat, new RpcTimeout(..)) ``` > I won't call an async message Heartbeat. Sorry, I still keep it sync. https://github.com/apache/spark/blob/7b8b398633789b65d116ce716d6fb1afcded0427/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L270
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org