[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1685: Description: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. was: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actors restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. retryTimer not canceled on actor restart in Worker and AppClient Key: SPARK-1685 URL: https://issues.apache.org/jira/browse/SPARK-1685 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Mark Hamstra Assignee: Mark Hamstra Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1685) retryTimer not canceled on actor restart in Worker and AppClient
[ https://issues.apache.org/jira/browse/SPARK-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-1685: Description: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor may not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. was: Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor will not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. retryTimer not canceled on actor restart in Worker and AppClient Key: SPARK-1685 URL: https://issues.apache.org/jira/browse/SPARK-1685 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Mark Hamstra Assignee: Mark Hamstra Both deploy.worker.Worker and deploy.client.AppClient try to registerWithMaster when those Actors start. The attempt at registration is accomplished by starting a retryTimer via the Akka scheduler that will use the registered timeout interval and retry number to make repeated attempts to register with all known Masters before giving up and either marking as dead or calling System.exit. The receive methods of these actors can, however, throw exceptions, which will lead to the actor restarting, registerWithMaster being called again on restart, and another retryTimer being scheduled without canceling the already running retryTimer. Assuming that all of the rest of the restart logic is correct for these actors (which I don't believe is actually a given), having multiple retryTimers running presents at least a condition in which the restarted actor may not be able to make the full number of retry attempts before an earlier retryTimer takes the give up action. Canceling the retryTimer in the actor's postStop hook should suffice. -- This message was sent by Atlassian JIRA (v6.2#6252)