[GitHub] [spark] LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint
LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-552390189 https://github.com/apache/spark/pull/25971#issuecomment-551008777 our testing also illustrated this. Honestly speaking, this patching could not resolve all problems about hot driver (there are too many pieces to consider). But I think it could fix one of them to help a long running thrift-server to be product level in a way. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint
LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-552390189 https://github.com/apache/spark/pull/25971#issuecomment-551008777 our testing also illustrated this. Honestly speaking, this patching could not resolve all problems about hot driver (there are too many pieces to consider). But I think it could one of them to help a long running thrift-server to be product level in a way. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint
LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-552389230 I spent much more attention on the driver death since we use thriftserver as long running service. So the failed task/jobs may be retried successful, or resubmitted by upper layer scheduler tools/users. But remarkably the driver (thriftserver) can live longer and more stable with this patching. @Ngone51 Image this, in our production, driver will be busy but not be busy all its life time (unlike pressure testing), jobs/tasks may fail and executors may lost sometimes. For a long running service, we can endure jobs/tasks occasionally failed when driver is busy as long as driver is still alive. But we can not accept the service enters downtime once mass executors lost caused by hot driver. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint
LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-552389230 I spent much more attention on the driver death since we use thriftserver as long running service. So the failed task/jobs may be retried successful, or resubmitted by upper layer scheduler tools/users. But remarkably the driver (thriftserver) can live longer and more stable with this patching. @Ngone51 Image this, in our production, driver will be busy but not be busy all its life time (unlike pressure testing), jobs/tasks may fail and executors may lost sometimes. For a long running service, we can endure jobs/tasks occasionally failed when driver is busy as long as driver is still alive. But we can not accept the service always enters downtime once mass executors lost caused by host driver. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint
LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-552060448 Thanks for the comment @jiangxb1987 > Please correct me if I'm wrong but I don't see approach to retry when `GetLocations*` requests timeout. `GetLocations` event never timeout. https://github.com/apache/spark/blob/e1ea806b3075d279b5f08a29fe4c1ad6d3c4191a/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L85 `BlockManagerHeartbeat` event could timeout, and if timeout we treat it as an executor lost. https://github.com/apache/spark/blob/70987d8144f4f2c094f3b82d0c4a98e818366225/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L271 But in a busy block manager, executors are not dead in deed but treated as lost by this mistakenly timeout. That's what this PR to fix. > so other events do not have timeout? or they will retry if timeout? Previously, I am not confirm that. But I think yes. They do not timeout. I only see `BlockManagerHeartbeat ` with timeout parameter. ```scala driverEndpoint.askSync[T](BlockManagerHeartbeat, new RpcTimeout(..)) ``` > I won't call an async message Heartbeat. Sorry, I still keep it sync. https://github.com/apache/spark/blob/7b8b398633789b65d116ce716d6fb1afcded0427/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L270 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint
LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-551150480 In deed many handling of other events are synchronized. Heartbeat could be asynchronous handled. This fix could avoid too many executors lost as a victim. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint
LantaoJin edited a comment on issue #25971: [SPARK-29298][CORE] Separate block manager heartbeat endpoint from driver endpoint URL: https://github.com/apache/spark/pull/25971#issuecomment-551146116 Maybe but I didn’t check all failure handing of all types of event. What’s I can confirm is executors lost frequently could make the situation worse and it’s fatal in our case. The performance here improved in deed after heartbeat event was separated in practice. But I don’t make sure other events like `removeExecutor` can handle timeout gracefully. In our practice, the availability is increased evidently in a busy blockmanagermaster and jobs finally succeed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org