[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-20 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r457795979



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * Provides more detail when an executor is being decommissioned.
+ * @param message Human readable reason for why the decommissioning is 
happening.
+ * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in 
other places) is
+ * being decommissioned too. Used to infer if the 
shuffle data might
+ * be lost if external shuffle service is enabled.

Review comment:
   Actually, sorry ... "even if" is indeed okay here :-) 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-20 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r457770003



##
File path: core/src/main/scala/org/apache/spark/deploy/master/Master.scala
##
@@ -871,7 +871,7 @@ private[deploy] class Master(
 logInfo("Telling app of decommission executors")
 exec.application.driver.send(ExecutorUpdated(
   exec.id, ExecutorState.DECOMMISSIONED,
-  Some("worker decommissioned"), None, workerLost = false))
+  Some("worker decommissioned"), None, workerLost = true))

Review comment:
   On second thoughts, this is a bigger change: I will just explain why 
workerLost = true better in the code. 
   
   The main problem is that the Deploy messages are b/w the Master and the 
Driver, while the ExecutorDecommissionInfo are b/w the Driver and the 
Executors. In any case, each cluster manager would need to have its own way of 
somehow communicating the decommissioning metadata to the driver, and this 
workerLost could be the Master's internal way.
   
   The alternatives would be to either introduce another 
`ExecutorDecommissioned(id, MasterToDriverExecutorDecommissionInfo)` struct, or 
enhance `ExecutorUpdated.workerLost` to be a full Struct/Any that can take the 
MasterToDriverExecutorDecommissionInfo. 
   
   This all seems like a bit too much machinery for now, so I will just comment 
it inline in the code. 
   
   Thank you for raising this question :-) 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-20 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r457765945



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorDecommissionInfo.scala
##
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * Provides more detail when an executor is being decommissioned.
+ * @param message Human readable reason for why the decommissioning is 
happening.
+ * @param isHostDecommissioned Whether the host (aka the `node` or `worker` in 
other places) is
+ * being decommissioned too. Used to infer if the 
shuffle data might
+ * be lost if external shuffle service is enabled.

Review comment:
   Changed it to "in case" which felt more accurate: Barring a few 
deployments, external shuffle service is host local: So if a host is 
decommissioned then its shuffle service will invariably go and thus so would 
its shuffle. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-20 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r457764507



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -166,11 +166,12 @@ private[spark] class CoarseGrainedExecutorBackend(
 exitExecutor(1, "Received LaunchTask command but executor was null")
   } else {
 if (decommissioned) {
-  logError("Asked to launch a task while decommissioned.")
+  val msg = "Asked to launch a task while decommissioned."
+  logError(msg)

Review comment:
   It's not broken into two lines for style. I wanted to reuse the "msg" 
down below. Same deal elsewhere in this PR. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-20 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r457764260



##
File path: 
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##
@@ -166,11 +166,12 @@ private[spark] class CoarseGrainedExecutorBackend(
 exitExecutor(1, "Received LaunchTask command but executor was null")
   } else {
 if (decommissioned) {
-  logError("Asked to launch a task while decommissioned.")
+  val msg = "Asked to launch a task while decommissioned."
+  logError(msg)
   driver match {
 case Some(endpoint) =>
   logInfo("Sending DecommissionExecutor to driver.")
-  endpoint.send(DecommissionExecutor(executorId))
+  endpoint.send(DecommissionExecutor(executorId, 
ExecutorDecommissionInfo(msg, false)))

Review comment:
   Good idea. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-20 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r457764133



##
File path: core/src/main/scala/org/apache/spark/deploy/master/Master.scala
##
@@ -871,7 +871,7 @@ private[deploy] class Master(
 logInfo("Telling app of decommission executors")
 exec.application.driver.send(ExecutorUpdated(
   exec.id, ExecutorState.DECOMMISSIONED,
-  Some("worker decommissioned"), None, workerLost = false))
+  Some("worker decommissioned"), None, workerLost = true))

Review comment:
   The short answer is that I sort of hacked it here :-)
   
   So on line 186 of StandaloneAppClient.scala in this PR: I actually plumb the 
workerLost (= true) here into ExecutorDecommissionInfo.isHostDecommissioned. 
   
   The proper thing to do here would be to not communicate executor 
decommissioning via ExecutorUpdated in the standalone codepath. But instead 
have a separate ExecutorDecommissioned message that can take the 
ExecutorDecommissionInfo. The ExecutorDecommissionInfo should be prepared by 
the cluster manager directly, since it has the most context about the decom is 
happening (ie exactly what pieces would be going away). 
   
   This is just a slightly bigger change and maybe I should just do it here. 
Let me try it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-16 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r456023348



##
File path: core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala
##
@@ -101,7 +101,8 @@ private[spark] trait TaskScheduler {
   /**
* Process a decommissioning executor.
*/
-  def executorDecommission(executorId: String): Unit
+  def executorDecommission(

Review comment:
   Okay. Was just trying to avoid changing a bunch of mock/test 
implementations of this interface. Will revert back to having this be purely 
abstract.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-15 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r455401121



##
File path: core/src/main/scala/org/apache/spark/scheduler/DecommissionInfo.scala
##
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * Provides more detail about a decommissioning event.
+ * @param message Human readable reason for why the decommissioning is 
happening.
+ * @param isWorkerDecommissioned Whether the worker is being decommissioned 
too.
+ *   Used to know if the shuffle data might be 
lost too.
+ */
+private[spark]
+case class DecommissionInfo(message: String, isWorkerDecommissioned: Boolean)

Review comment:
   Yeah. Its actually used in https://github.com/apache/spark/pull/29014 
(which builds atop this one). I pulled the usage part to that PR since it was 
more closely coupled there.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-15 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r455399122



##
File path: core/src/main/scala/org/apache/spark/scheduler/DecommissionInfo.scala
##
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * Provides more detail about a decommissioning event
+ * @param message human readable reason for why the decommissioning is 
happening
+ * @param workerAlsoDecommissioned whether the worker is being decommissioned 
too.
+ * Used to know if the shuffle data might be 
lost too.
+ */
+private[spark]
+case class DecommissionInfo(message: String, workerAlsoDecommissioned: Boolean)

Review comment:
   Fair enough. I can see why the "Worker" nomenclature is confusing. 
Although `ExecutorProcessLost` has a `workerLost` field that can (in theory) be 
set by any scheduler backend. I will rename/reword to make this independance 
clearer. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-13 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r453796305



##
File path: core/src/main/scala/org/apache/spark/scheduler/DecommissionInfo.scala
##
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * Provides more detail about a decommissioning event
+ * @param message human readable reason for why the decommissioning is 
happening
+ * @param workerAlsoDecommissioned whether the worker is being decommissioned 
too.
+ * Used to know if the shuffle data might be 
lost too.
+ */
+private[spark]
+case class DecommissionInfo(message: String, workerAlsoDecommissioned: Boolean)

Review comment:
   DecommissionInfo is just an `info` struct. It's not the loss reason 
itself, so it shouldn't overload ExecutorLossReason. 
   
   I didn't get the part about TaskScheduler.executorDecommission supporting 
other cluster managers: TaskScheduler is about task scheduling not coarse 
grained executor scheduling. I think it should be totally unaware of whether 
the decommission signal came via Yarn or K8s or whatever other cluster manager 
we have.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-13 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r453796305



##
File path: core/src/main/scala/org/apache/spark/scheduler/DecommissionInfo.scala
##
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+/**
+ * Provides more detail about a decommissioning event
+ * @param message human readable reason for why the decommissioning is 
happening
+ * @param workerAlsoDecommissioned whether the worker is being decommissioned 
too.
+ * Used to know if the shuffle data might be 
lost too.
+ */
+private[spark]
+case class DecommissionInfo(message: String, workerAlsoDecommissioned: Boolean)

Review comment:
   DecommissionInfo is just an `info` struct. It's not the loss reason 
itself, so it shouldn't overload ExecutorLossReason. 
   
   I didn't get the part about TaskScheduler.executorDecommission supporting 
other cluster managers: TaskScheduler is about task scheduling not coarse 
grained executor scheduling. I think it should be totally unaware of whether 
the decommission signal came via Yarn or K8s or whatever other cluster 
management strategy we have.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] agrawaldevesh commented on a change in pull request #29032: [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor

2020-07-13 Thread GitBox


agrawaldevesh commented on a change in pull request #29032:
URL: https://github.com/apache/spark/pull/29032#discussion_r453794030



##
File path: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##
@@ -136,6 +136,8 @@ private[spark] class TaskSchedulerImpl(
   // IDs of the tasks running on each executor
   private val executorIdToRunningTaskIds = new HashMap[String, HashSet[Long]]
 
+  private val executorsPendingDecommission = new HashMap[String, 
DecommissionInfo]

Review comment:
   I was debating about this. I think I am now convinced of moving it to 
#29014 :-). 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org