[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-11-25 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-64323860
  
Andrew's got a patch for this: #3447


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-11-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-64316156
  
It looks like this patch may have introduced a race-condition / bug during 
multi-master failover: https://issues.apache.org/jira/browse/SPARK-4592.  I'm 
working on a fix, but thought I'd mention the JIRA here in case any of this 
patch's reviewers would be interested in providing feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59833561
  
Don't worry about it. This test is a little flaky and will be fixed 
shortly. I highly doubt that the test failure is caused by this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread mccheah
Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59824518
  
The PR doesn't seem to be related to the unit tests that failed. How shall 
we tackle this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59823485
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21926/consoleFull)
 for   PR 2828 at commit 
[`83f8bc9`](https://github.com/apache/spark/commit/83f8bc9a18c9d2663127f4f8142ae3f5273db2d2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59823491
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21926/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2828


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59815368
  
This looks good to me.  Thanks!  I'm going to merge this into `master`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59814881
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21926/consoleFull)
 for   PR 2828 at commit 
[`83f8bc9`](https://github.com/apache/spark/commit/83f8bc9a18c9d2663127f4f8142ae3f5273db2d2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59814026
  
sure, I created the JIRA: https://issues.apache.org/jira/browse/SPARK-4011



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r19102413
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -166,26 +178,47 @@ private[spark] class Worker(
 }
   }
 
-  def registerWithMaster() {
-tryRegisterAllMasters()
-var retries = 0
-registrationRetryTimer = Some {
-  context.system.scheduler.schedule(REGISTRATION_TIMEOUT, 
REGISTRATION_TIMEOUT) {
-Utils.tryOrExit {
-  retries += 1
-  if (registered) {
-registrationRetryTimer.foreach(_.cancel())
-  } else if (retries >= REGISTRATION_RETRIES) {
-logError("All masters are unresponsive! Giving up.")
-System.exit(1)
-  } else {
-tryRegisterAllMasters()
+  private def retryConnectToMaster() {
+logInfo("ping")
+Utils.tryOrExit {
+  connectionAttemptCount += 1
+  if (registered) {
+registrationRetryTimer.foreach(_.cancel())
+registrationRetryTimer = None
+  } else if (connectionAttemptCount <= TOTAL_REGISTRATION_RETRIES) {
+tryRegisterAllMasters()
+if (connectionAttemptCount == INITIAL_REGISTRATION_RETRIES) {
+  registrationRetryTimer.foreach(_.cancel())
+  registrationRetryTimer = Some {
+
context.system.scheduler.schedule(PROLONGED_REGISTRATION_RETRY_INTERVAL,
+  PROLONGED_REGISTRATION_RETRY_INTERVAL)(retryConnectToMaster)
   }
 }
+  } else {
+logError("All masters are unresponsive! Giving up.")
+System.exit(1)
   }
 }
   }
 
+  def registerWithMaster() {
+// DisassociatedEvent may be triggered multiple times, so don't 
attempt registration
+// if there are outstanding registration attempts scheduled.
+registrationRetryTimer match {
+  case None =>
+registered = false
+tryRegisterAllMasters()
+connectionAttemptCount = 0
+registrationRetryTimer = Some {
+  
context.system.scheduler.schedule(INITIAL_REGISTRATION_RETRY_INTERVAL,
+  INITIAL_REGISTRATION_RETRY_INTERVAL)(retryConnectToMaster)
--- End diff --

I'd indent this line by another two spaces to make it more obvious that 
it's a continuation of the previous line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r19102299
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -64,8 +66,17 @@ private[spark] class Worker(
   // Send a heartbeat every (heartbeat timeout) / 4 milliseconds
   val HEARTBEAT_MILLIS = conf.getLong("spark.worker.timeout", 60) * 1000 / 
4
 
-  val REGISTRATION_TIMEOUT = 20.seconds
-  val REGISTRATION_RETRIES = 3
+  val INITIAL_REGISTRATION_RETRIES = 6
--- End diff --

Maybe add a comment above this line to say that this is modeled after 
Hadoop's design.  This will help future maintainers to understand this code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59812443
  
As a general principle, you should use the most private access modifiers 
that are sufficient.  We can always make methods / fields _more_ visible, but 
it's much harder to remove / change functionality once it's been exposed to 
other components.

W.r.t. refactoring, I agree with Mark: a large-scale refactoring of access 
modifiers should happen in a separate PR, not here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59811803
  
A legitimate concern, and certainly something that could be worked up into 
a JIRA issue and separate pull request.  But it's not a very pressing issue 
since nothing is in the public API, and a larger refactoring of Worker 
shouldn't be conflated with this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r19101917
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -166,26 +178,47 @@ private[spark] class Worker(
 }
   }
 
-  def registerWithMaster() {
-tryRegisterAllMasters()
-var retries = 0
-registrationRetryTimer = Some {
-  context.system.scheduler.schedule(REGISTRATION_TIMEOUT, 
REGISTRATION_TIMEOUT) {
-Utils.tryOrExit {
-  retries += 1
-  if (registered) {
-registrationRetryTimer.foreach(_.cancel())
-  } else if (retries >= REGISTRATION_RETRIES) {
-logError("All masters are unresponsive! Giving up.")
-System.exit(1)
-  } else {
-tryRegisterAllMasters()
+  private def retryConnectToMaster() {
+logInfo("ping")
--- End diff --

This log message could be more informative.  I'd say something like

```scala
logInfo(s"Attempting to connect to master (attempt # 
$connectionAttemptCount)")
```

I'd also move this into the `Utils.tryOrExit` block so that we print the 
incremented `connectionAttemptCount`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59810532
  
@markhamstra , yeah, my concern is just this, though Worker is marked as 
private[spark], is it a good practice to expose every detail in the 
implementation to the other components?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59810037
  
@CodingCat, Worker is private[spark], so what is the nature of your 
concern?  In fact, I'm wondering whether we really want the changes in this PR 
that make some methods inaccessible from the rest of spark.  I haven't looked 
at the accessibility of Worker's methods in detail to say for certain what the 
correct modifier should be in each case; but if we want to change them, that's 
a refactoring that can and should be addressed in another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59807896
  
@JoshRosen , this is awesome to test Spark integration with Docker

@mccheah , this PR is LGTM now, except that we exposed too many 
should-be-private members in Worker (not your fault, existing in the current 
code).. not sure about the reason@pwendell @markhamstra you have some 
insights about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-20 Thread mccheah
Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59803881
  
@JoshRosen agreed with @ash211, this is really good.

Are there any actual comments on the PR, or can it be merged? =)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-18 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59639085
  
This is EXCELLENT work @JoshRosen !  Looking forward to future integration 
tests that cover these sorts of behaviors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-18 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59602394
  
**tl;dr**: _this patch looks pretty good to me based on the testing that 
I've done so far.  For my own interest / fun, I'd like to find a way to extend 
my test coverage to include the "worker-initiated reconnect" and "master 
restart" cases, but my tests shouldn't necessarily block the merging / review 
of this patch._

To summarize my understanding of the failure scenarios that this PR 
addresses:

- A worker becomes disassociated from the master, due to either:
  - The master failing and restarting
  - A transient network issue

These scenarios are similar but there's one subtle distinction.  In the 
first scenario, the master forgets all previously-registered workers.  In the 
second scenario, the master can remember that a worker was 
previously-registered even though it may now be disassociated.

In some of these scenarios, a disconnection may be reflected at the master, 
worker, or both (perhaps at different times).  For example, a master might 
deregister a worker if it has not received Spark-level heartbeats from it, or a 
worker might disassociate from a master due to the Akka failure detector being 
triggered.
 
After this PR, there are two paths that can lead to a worker reconnection:

- A master (which stayed alive) receives a heartbeat from 
previously-registered but now de-registered worker and asks that worker to 
reconnect.
- A worker discovers that it has become disassociated from the master and 
attempts to initiate a reconnection.

I've been working on building a Docker-based integration testing framework 
for testing these sorts of Spark Standalone fault-tolerance issues (to 
hopefully be released publicly sometime soon).

I thought it would be interesting to test the "master stays alive but 
deregisters workers due to not receiving heartbeats" case by simulating network 
issues.  In my testing framework, I added a 
[Jepsen](https://github.com/aphyr/jepsen)-inspired network fault-injector which 
updates `iptables` rules in a boot2docker VM in order to temporarily break 
network links.  Here's the actual code that I wrote to test this PR:

```scala
test("workers should reconnect to master if disconnected due to transient 
network issues") {
  // Regression test for SPARK-3736
  val env = Seq(
"SPARK_MASTER_OPTS" -> "-Dspark.worker.timeout=2",
"SPARK_WORKER_OPTS" -> "-Dspark.worker.timeout=2 -Dspark.akka.timeout=1 
-Dspark.akka.failure-detector.threshold=1 -Dspark.akka.heartbeat.interval=1"
  )
  cluster = SparkClusters.createStandaloneCluster(env, numWorkers = 1)
  val master = cluster.masters.head
  val worker = cluster.workers.head
  master.getState.liveWorkerIPs.size should be (1)
  println("Cluster launched with one worker")

  networkFaultInjector.dropTraffic(master.container, worker.container)
  networkFaultInjector.dropTraffic(worker.container, master.container)
  eventually(timeout(30 seconds), interval(1 seconds)) {
master.getState.liveWorkerIPs.size should be (0)
  }
  println("Master shows that zero workers are registered after network 
connection fails")

  networkFaultInjector.restore()
  eventually(timeout(30 seconds), interval(1 seconds)) {
master.getState.liveWorkerIPs.size should be (1)
  }
  println("Master shows one worker after network connection is restored")
}
```

While running this against the current Spark master: after I kill the 
network connection between the master and worker, the master more-or-less 
immediately times out the worker and disconnects it.  However, the worker 
doesn't realize that it has become deregistered from the master.  This happens 
because the master detects worker liveness using our own heartbeat mechanism, 
whereas the worker detects master liveness using Akka's failure-detection 
mechanisms (to see this, note that the worker's `masterDisconnected()` function 
is only invoked from the `DisassociatedEvent` message handler).

As a result, we end up in a scenario where the master receives a heartbeat 
from the de-registered workers that does not realize that it has been 
deregistered.  This PR addresses this case by having the master explicitly ask 
the worker to reconnect (via the `ReconnectWorker` message).  Thanks to this 
mechanism, my test passes with this PR's code!

I'm still working on testing the case where the worker receives a 
DisassociationEvent and initiates the reconnection itself.  To do this, I'll 
need to figure out how to configure the Akka failure detector so that it 
quickly fails in my testing suite.  I'll also need to add a way to query the 
worker to ask whether it has become disconnected from the master so that I can 
drop packets for long enough in order to cau

[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59598071
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21873/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59598067
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21873/consoleFull)
 for   PR 2828 at commit 
[`fe0e02f`](https://github.com/apache/spark/commit/fe0e02feaa8ac3e01ea7e90240e46a3d5a276864).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59596562
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21873/consoleFull)
 for   PR 2828 at commit 
[`fe0e02f`](https://github.com/apache/spark/commit/fe0e02feaa8ac3e01ea7e90240e46a3d5a276864).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-17 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r19011614
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

Thank you, @ash211 

I think prolonging the interval between attempts and adding some jitter 
there is very reasonablemaybe in this patch, we can also change the current 
implementation of registrationRetryTimer?@mccheah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r19002923
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

I dug into Hadoop source and actually found out that the default policy for 
Hadoop reconnects is to retry every 10 seconds for 6 attempts, and then every 
60 seconds for 10 attempts.  Each attempt also has a fuzz factor applied of 
[0.5t, 1.5t] to prevent a thundering herd of reconnect attempts across the 
cluster.

I don't have a strong opinion on infinite vs ~10min of retries -- I'd vote 
for following Hadoop's lead unless presented with compelling arguments to do 
something different.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59439376
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21822/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59439367
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21822/consoleFull)
 for   PR 2828 at commit 
[`94ddeca`](https://github.com/apache/spark/commit/94ddeca5390c5746b767b99ab8086d651e474978).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59435763
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21818/consoleFull)
 for   PR 2828 at commit 
[`a698e35`](https://github.com/apache/spark/commit/a698e356b05129ef2e7b9fadd73a1f2d9184c5a0).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59435771
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21818/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18988742
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

@ash211 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18988140
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

not sure about the motivation of that Hadoop let tasktracker retries 
forever.might be different with our case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18988031
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

hmmm.now, I think exit after several retries might be better, 

In your case, without restarting the worker after the restarting master may 
bring some problems, especially when the user didn't set RECOVERY_MODE,  all 
application information is lost, for instance, the application whose resource 
requirement hasn't been filled will not be served anymorethe complete 
system will run in a weird status, so you eventually need to restart the 
applications (i.e. kill executors -> restart , which is equivalent to restart 
all workers)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18986941
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

One other option is to have the logic factored out into a separate block 
that passes in an optional number of times to retry, and set the option to None 
in reconnect and set the option to the appropriate number on initial startup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18986702
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

I'm okay with leaving it retrying indefinitely. The user may not notice the 
error until much later, and then reboot the master. If the workers decide to 
stop trying, the user will need to bounce the workers as well.

I agree with having the logic being very similar is a bit of a pain, but 
these are really two different scenarios, so I could foresee such 
nearly-duplicated logic being justified in either direction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59430399
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21822/consoleFull)
 for   PR 2828 at commit 
[`94ddeca`](https://github.com/apache/spark/commit/94ddeca5390c5746b767b99ab8086d651e474978).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18986488
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

I seethen I will vote to do something different with Hadoop by reusing 
registrationRetryTimerotherwise the inconsistency of the logic in the two 
*similar* code blocks makes the program a bit fishy 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18986188
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

In that case we can't directly use registrationRetryTimer, as that 
explicitly kills the worker after a certain number of retries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18985880
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -243,6 +249,10 @@ private[spark] class Worker(
 System.exit(1)
   }
 
+case ReconnectWorker(masterUrl) =>
+  logWarning(s"Master with url $masterUrl requested this worker to 
reconnect.")
--- End diff --

Why Warning?  Seems more natural to me for the disconnect/failure to reply 
to be a WARN, but the subsequent reconnect request and related actions to just 
be INFO-level events.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18985861
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

according to @ash211 , "The preferred alternative is to follow what Hadoop 
does – when there's a disconnect, attempt to reconnect at a particular 
interval until successful (I think it repeats indefinitely every 10sec).", I 
think we can do the same thing...just let the thread try infinitely 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59425990
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21818/consoleFull)
 for   PR 2828 at commit 
[`a698e35`](https://github.com/apache/spark/commit/a698e356b05129ef2e7b9fadd73a1f2d9184c5a0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59425292
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18978981
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

The logic would need to be refactored a bit, but it might be doable. It 
uses the registered flag to determine if it should stop attempts to 
re-register, and otherwise attempts to reconnect.

If we toggle the registered flag upon disassociation as well we might be 
able to just call registerWithMaster(). However, I'm not sure what the 
implications of that are. Also, do we necessarily want the worker to give up 
reconnection after a certain number of retries in this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18978412
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -362,9 +372,19 @@ private[spark] class Worker(
 }
   }
 
-  def masterDisconnected() {
+  private def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  private def scheduleAttemptsToReconnectToMaster() {
--- End diff --

is it possible to reuse registrationRetryTimer in Worker?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18977288
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -341,7 +341,11 @@ private[spark] class Master(
 case Some(workerInfo) =>
   workerInfo.lastHeartbeat = System.currentTimeMillis()
 case None =>
-  logWarning("Got heartbeat from unregistered worker " + workerId)
+  if (workers.map(_.id).contains(workerId)) {
--- End diff --

The above observation is correct - only workers that have previously 
registered with the master are allowed to reconnect. Workers that are 
connecting for the first time shouldn't be allowed to spawn a heartbeat and 
have the master send back a reconnection message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18977018
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -341,7 +341,11 @@ private[spark] class Master(
 case Some(workerInfo) =>
   workerInfo.lastHeartbeat = System.currentTimeMillis()
 case None =>
-  logWarning("Got heartbeat from unregistered worker " + workerId)
+  if (workers.map(_.id).contains(workerId)) {
--- End diff --

@ash211 what he is trying to do seems to be that, only before we decide 
this worker is DEAD, we allow the reconnect


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18976619
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -365,6 +375,16 @@ private[spark] class Worker(
   def masterDisconnected() {
 logError("Connection to master failed! Waiting for master to 
reconnect...")
 connected = false
+scheduleAttemptsToReconnectToMaster()
+  }
+
+  def scheduleAttemptsToReconnectToMaster() {
--- End diff --

private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18976594
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -365,6 +375,16 @@ private[spark] class Worker(
   def masterDisconnected() {
--- End diff --

shall this method be private? we call it somewhere else?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18976148
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -341,7 +341,11 @@ private[spark] class Master(
 case Some(workerInfo) =>
   workerInfo.lastHeartbeat = System.currentTimeMillis()
 case None =>
-  logWarning("Got heartbeat from unregistered worker " + workerId)
+  if (workers.map(_.id).contains(workerId)) {
--- End diff --

Should this have a not in it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/2828#discussion_r18976036
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala 
---
@@ -94,6 +96,7 @@ private[spark] class Worker(
   val finishedExecutors = new HashMap[String, ExecutorRunner]
   val drivers = new HashMap[String, DriverRunner]
   val finishedDrivers = new HashMap[String, DriverRunner]
+  var scheduledReconnectMessage: Option[Cancellable] = None
--- End diff --

scheduledReconnectTask? when I looked at this variable, I expected it to be 
some case class representing the message itself


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59408043
  
One remark is that there are no automated tests in this commit for now.

I was unsuccessful in setting up TestKit to emulate a worker and master 
sending messages to each other. I also have not seen any other unit tests that 
test message passing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2828#issuecomment-59406399
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3736] Workers reconnect when disassocia...

2014-10-16 Thread mccheah
GitHub user mccheah opened a pull request:

https://github.com/apache/spark/pull/2828

[SPARK-3736] Workers reconnect when disassociated from the master.

Before, if the master node is killed and restarted, the worker nodes
would not attempt to reconnect to the Master. Therefore, when the Master
node was restarted, the worker nodes needed to be restarted as well.

Now, when the Master node is disconnected, the worker nodes will
continuously ping the master node in attempts to reconnect to it. Once
the master node restarts, it will detect one of the registration
requests from its former workers. The result is that the cluster
re-enters a healthy state.

In addition, when the master does not receive a heartbeat from the
worker, the worker was removed; however, when the worker sent a
heartbeat to the master, the master used to ignore the heartbeat. Now,
a master that receives a heartbeat from a worker that had been
disconnected will request the worker to re-attempt the registration
process, at which point the worker will send a RegisterWorker request
and be re-connected accordingly.

Re-connection attempts per worker are submitted every N seconds, where N
is configured by the property spark.worker.reconnect.interval - this has
a default of 60 seconds right now.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mccheah/spark reconnect-dead-workers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2828.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2828


commit b5b34af964199af296e12490413225f55d93a6cd
Author: mcheah 
Date:   2014-10-15T22:27:21Z

[SPARK-3736] Workers reconnect when disassociated from the master.

Before, if the master node is killed and restarted, the worker nodes
would not attempt to reconnect to the Master. Therefore, when the Master
node was restarted, the worker nodes needed to be restarted as well.

Now, when the Master node is disconnected, the worker nodes will
continuously ping the master node in attempts to reconnect to it. Once
the master node restarts, it will detect one of the registration
requests from its former workers. The result is that the cluster
re-enters a healthy state.

In addition, when the master does not receive a heartbeat from the
worker, the worker was removed; however, when the worker sent a
heartbeat to the master, the master used to ignore the heartbeat. Now,
a master that receives a heartbeat from a worker that had been
disconnected will request the worker to re-attempt the registration
process, at which point the worker will send a RegisterWorker request
and be re-connected accordingly.

Re-connection attempts per worker are submitted every N seconds, where N
is configured by the property spark.worker.reconnect.interval - this has
a default of 60 seconds right now.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org