[ 
https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837522#comment-15837522
 ] 

Gour Saha commented on SLIDER-1189:
-----------------------------------

[~billie.rinaldi] this is a very good find. This fix will significantly improve 
Slider applications stability.

Overall the patch looks good. Few comments -

1. The method resetAMData should be renamed to something like 
readAMDataFromRegistry or something more appropriate.

2. In registerWithServer method, do you think it makes sense to introduce a 
registerRetryCount variable (similar to heartBeatRetryCount), to basically try 
to connect to AM a few times with newly read values rather than reading from 
registry for every exception? It will minimize the number of registry hits if 
the AM is down for a really long time.

> Agent never connects to new AM
> ------------------------------
>
>                 Key: SLIDER-1189
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1189
>             Project: Slider
>          Issue Type: Bug
>          Components: agent
>            Reporter: Billie Rinaldi
>            Assignee: Billie Rinaldi
>            Priority: Critical
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1189.1.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited 
> for a bit, then restarted the RM. The AM is restarted, but running agents 
> never connect to the new AM. The AM data is re-read from the ZK registry once 
> if the heartbeat retry threshold is reached, at which point the agent tries 
> re-registering with the AM. However, if the AM data is stale at that point, 
> it never re-reads the data from the ZK registry, and retries registering with 
> the nonexistent AM forever (until it is timed out due to heartbeat loss and 
> killed by the new AM).
> Note this happens when AM restart is delayed more than about a minute, which 
> can occur if the RM is down or the RM is up but busy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to