[ https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gour Saha updated SLIDER-1189: ------------------------------ Fix Version/s: (was: Slider 1.0.0) Slider 0.92 > Agent never connects to new AM if AM restart takes too long > ----------------------------------------------------------- > > Key: SLIDER-1189 > URL: https://issues.apache.org/jira/browse/SLIDER-1189 > Project: Slider > Issue Type: Bug > Components: agent > Reporter: Billie Rinaldi > Assignee: Billie Rinaldi > Priority: Critical > Fix For: Slider 0.92 > > Attachments: SLIDER-1189.1.patch, SLIDER-1189.2.patch, > SLIDER-1189.3.patch > > > In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited > for a bit, then restarted the RM. The AM is restarted, but running agents > never connect to the new AM. The AM data is re-read from the ZK registry once > if the heartbeat retry threshold is reached, at which point the agent tries > re-registering with the AM. However, if the AM data is stale at that point, > it never re-reads the data from the ZK registry, and retries registering with > the nonexistent AM forever (until it is timed out due to heartbeat loss and > killed by the new AM). > Note this happens when AM restart is delayed more than about a minute, which > can occur if the RM is down or the RM is up but busy. -- This message was sent by Atlassian JIRA (v6.3.15#6346)