[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1685:


Description: 
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
will not be able to make the full number of retry attempts before an earlier 
retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 

  was:
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actors restarting, registerWithMaster being called again on 
restart, and another retryTimer being scheduled without canceling the already 
running retryTimer.  Assuming that all of the rest of the restart logic is 
correct for these actors (which I don't believe is actually a given), having 
multiple retryTimers running presents at least a condition in which the 
restarted actor will not be able to make the full number of retry attempts 
before an earlier retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 


 retryTimer not canceled on actor restart in Worker and AppClient
 

 Key: SPARK-1685
 URL: https://issues.apache.org/jira/browse/SPARK-1685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra

 Both deploy.worker.Worker and deploy.client.AppClient try to 
 registerWithMaster when those Actors start.  The attempt at registration is 
 accomplished by starting a retryTimer via the Akka scheduler that will use 
 the registered timeout interval and retry number to make repeated attempts to 
 register with all known Masters before giving up and either marking as dead 
 or calling System.exit.
 The receive methods of these actors can, however, throw exceptions, which 
 will lead to the actor restarting, registerWithMaster being called again on 
 restart, and another retryTimer being scheduled without canceling the already 
 running retryTimer.  Assuming that all of the rest of the restart logic is 
 correct for these actors (which I don't believe is actually a given), having 
 multiple retryTimers running presents at least a condition in which the 
 restarted actor will not be able to make the full number of retry attempts 
 before an earlier retryTimer takes the give up action.
 Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient

2014-04-30 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1685:


Description: 
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
may not be able to make the full number of retry attempts before an earlier 
retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 

  was:
Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster 
when those Actors start.  The attempt at registration is accomplished by 
starting a retryTimer via the Akka scheduler that will use the registered 
timeout interval and retry number to make repeated attempts to register with 
all known Masters before giving up and either marking as dead or calling 
System.exit.

The receive methods of these actors can, however, throw exceptions, which will 
lead to the actor restarting, registerWithMaster being called again on restart, 
and another retryTimer being scheduled without canceling the already running 
retryTimer.  Assuming that all of the rest of the restart logic is correct for 
these actors (which I don't believe is actually a given), having multiple 
retryTimers running presents at least a condition in which the restarted actor 
will not be able to make the full number of retry attempts before an earlier 
retryTimer takes the give up action.

Canceling the retryTimer in the actor's postStop hook should suffice. 


 retryTimer not canceled on actor restart in Worker and AppClient
 

 Key: SPARK-1685
 URL: https://issues.apache.org/jira/browse/SPARK-1685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Mark Hamstra
Assignee: Mark Hamstra

 Both deploy.worker.Worker and deploy.client.AppClient try to 
 registerWithMaster when those Actors start.  The attempt at registration is 
 accomplished by starting a retryTimer via the Akka scheduler that will use 
 the registered timeout interval and retry number to make repeated attempts to 
 register with all known Masters before giving up and either marking as dead 
 or calling System.exit.
 The receive methods of these actors can, however, throw exceptions, which 
 will lead to the actor restarting, registerWithMaster being called again on 
 restart, and another retryTimer being scheduled without canceling the already 
 running retryTimer.  Assuming that all of the rest of the restart logic is 
 correct for these actors (which I don't believe is actually a given), having 
 multiple retryTimers running presents at least a condition in which the 
 restarted actor may not be able to make the full number of retry attempts 
 before an earlier retryTimer takes the give up action.
 Canceling the retryTimer in the actor's postStop hook should suffice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)