[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r361350788 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,25 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + Review comment: > Since we assume killing executors may take a while, looks reasonable to call scheduler.executorLost and re-submit tasks earlier. I noticed a wired problem of this. That is when we call `scheduler.executorLost` in `HeartbeatReceiver`, `scheduler` knows which executor lost, but `CoarseGrainedSchedulerBackend` doesn't. So, even if we want to re-submit tasks earlier, tasks can still be launched on that lost executor. But for other two places where we call `scheduler.executorLost`, it guarantees that `CoarseGrainedSchedulerBackend` has realized about the lost/killed executor. So, I think that `HeartbeatReceiver` may has used it incorrectly, here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r361348573 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,25 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + Review comment: No serious hurt, but gives waring/error says that the executor has already been removed, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r361243811 ## File path: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala ## @@ -153,7 +154,6 @@ class HeartbeatReceiverSuite heartbeatReceiverClock.advance(executorTimeout) heartbeatReceiverRef.askSync[Boolean](ExpireDeadHosts) // Only the second executor should be expired as a dead host -verify(scheduler).executorLost(meq(executorId2), any()) Review comment: Remove this because we no longer invoke `scheduler.executorLost` in `HeartbeatReceiver`'s main thread. And asserts below have already proved that `executorId2` has been expired. So, I think it should be find to remove. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r361243807 ## File path: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala ## @@ -153,7 +154,6 @@ class HeartbeatReceiverSuite heartbeatReceiverClock.advance(executorTimeout) heartbeatReceiverRef.askSync[Boolean](ExpireDeadHosts) // Only the second executor should be expired as a dead host -verify(scheduler).executorLost(meq(executorId2), any()) Review comment: Remove this because we no longer invoke `scheduler.executorLost` in `HeartbeatReceiver`'s main thread. And asserts below have already proved that `executorId2` has been expired. So, I think it should be find to remove. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r361243811 ## File path: core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala ## @@ -153,7 +154,6 @@ class HeartbeatReceiverSuite heartbeatReceiverClock.advance(executorTimeout) heartbeatReceiverRef.askSync[Boolean](ExpireDeadHosts) // Only the second executor should be expired as a dead host -verify(scheduler).executorLost(meq(executorId2), any()) Review comment: Remove this because we no longer invoke `scheduler.executorLost` in `HeartbeatReceiver`'s main thread. And asserts below have already proved that `executorId2` has been expired. So, I think it should be fine to remove. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360858546 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) Review comment: I think this is needed for non-`TaskSchedulerImpl` ? This can be a behavior change for other TaskScheduler, I'll revert this later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360819689 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) - // Asynchronously kill the executor to avoid blocking the current thread +// Asynchronously kill the executor to avoid blocking the current thread killExecutorThread.submit(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { // Note: we want to get an executor back after expiring this one, // so do not simply call `sc.killExecutor` here (SPARK-8119) sc.killAndReplaceExecutor(executorId) +// In case of the executors which are not gracefully shut down, we should remove Review comment: Yeah, only. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360819557 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) Review comment: If we fail to remove it from `CoarseGrainedSchedulerBackend`, then, it means the executor has been removed previously(e.g. `onDisconnected()`). And `scheduler.executorLost()` must has been called, too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360809654 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) - // Asynchronously kill the executor to avoid blocking the current thread +// Asynchronously kill the executor to avoid blocking the current thread killExecutorThread.submit(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { // Note: we want to get an executor back after expiring this one, // so do not simply call `sc.killExecutor` here (SPARK-8119) sc.killAndReplaceExecutor(executorId) +// In case of the executors which are not gracefully shut down, we should remove Review comment: Yeah, in this case, `CoarseGrainedSchedulerBackend` can't handle non-graceful executor shutdown well. But I am not sure what do you mean by "only", here? What other factors do you also care about? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360796114 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) - // Asynchronously kill the executor to avoid blocking the current thread +// Asynchronously kill the executor to avoid blocking the current thread killExecutorThread.submit(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { // Note: we want to get an executor back after expiring this one, // so do not simply call `sc.killExecutor` here (SPARK-8119) sc.killAndReplaceExecutor(executorId) +// In case of the executors which are not gracefully shut down, we should remove Review comment: No, no. Please pay attention to the comment above and note that, here, we're not saying that this issue happens due to `CoarseGrainedSchedulerBackend` may shut down executors non-gracefully. But we say that it may happens due to the executor itself is not gracefully shut down. It's possible that `CoarseGrainedSchedulerBackend` asks someone executor to shutdown and the executor may shutdown non-gracefully. And it's also possible that the host where the executor located suddenly crushed that makes the executor shutdown non-gracefully. Shutdown non-gracefully here only means that an executor shutdown but `CoarseGrainedSchedulerBackend` doesn't receive disconnect event from it. It really doesn't matter what causes the executor shutdown non-gracefully. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360790014 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) - // Asynchronously kill the executor to avoid blocking the current thread +// Asynchronously kill the executor to avoid blocking the current thread killExecutorThread.submit(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { // Note: we want to get an executor back after expiring this one, // so do not simply call `sc.killExecutor` here (SPARK-8119) sc.killAndReplaceExecutor(executorId) +// In case of the executors which are not gracefully shut down, we should remove Review comment: Match `CoarseGrainedSchedulerBackend` covers all scheduler backend(Standalone, YARN, K8S, Mesos) that officially supported by Spark and we can guarantee that this issue can be resolved internally. But for other plugged scheduler backend, Spark seems has nothing to do with current `SchedulerBackend` interface at least for this issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360782760 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) Review comment: Hmm...This may can be a problem... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360782486 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) - // Asynchronously kill the executor to avoid blocking the current thread +// Asynchronously kill the executor to avoid blocking the current thread killExecutorThread.submit(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { // Note: we want to get an executor back after expiring this one, // so do not simply call `sc.killExecutor` here (SPARK-8119) sc.killAndReplaceExecutor(executorId) +// In case of the executors which are not gracefully shut down, we should remove Review comment: > Do you mean CoarseGrainedSchedulerBackend is the only one that may shut down executors non-gracefully? No. IIUC, it's executor's own behavior but not related to scheduler backend. An executor can be self-lost(no heartbeat) but still alive, e.g. due to network problem. > And if it shuts down gracefully, is it OK to have 2 RemoveExecutor events? Yes, we have protection for this case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend
Ngone51 commented on a change in pull request #26980: [SPARK-27348][Core] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend URL: https://github.com/apache/spark/pull/26980#discussion_r360714352 ## File path: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala ## @@ -199,14 +201,20 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, clock: Clock) if (now - lastSeenMs > executorTimeoutMs) { logWarning(s"Removing executor $executorId with no recent heartbeats: " + s"${now - lastSeenMs} ms exceeds timeout $executorTimeoutMs ms") -scheduler.executorLost(executorId, SlaveLost("Executor heartbeat " + - s"timed out after ${now - lastSeenMs} ms")) Review comment: Since remove executor in `CoarseGrainedSchedulerBackend` will always do `scheduler.executorLost()`, so it's no longer needed here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org