[ https://issues.apache.org/jira/browse/SLIDER-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
kyungwan nam updated SLIDER-1221: --------------------------------- Attachment: SLIDER-1221.patch I'm attaching the patch. > the way to cope against SliderAM split brain > -------------------------------------------- > > Key: SLIDER-1221 > URL: https://issues.apache.org/jira/browse/SLIDER-1221 > Project: Slider > Issue Type: Bug > Reporter: kyungwan nam > Attachments: SLIDER-1221.patch > > > I have met a problem like “Slider-AM split brain”. > normally, AM is failed, RM will launch new one. > but, even without failing AM, It can happens if there is something like > networking issue between AM and RM. > because, RM is launching the new AM if there is no heartbeat from the AM for > some time (yarn.am.liveness-monitor.expiry-interval-ms) > in that case, previous AM and new AM can coexist and containers keep > connection with previous AM. > it could cause lots of problems. > new AM couldn't know the containers launched by previous AM. > as a result, simultaneous the containers could be killed after some time. > slider-agent should register to the new SliderAM as soon as possible. > I think it could be improved as follows. > - SliderAM record the time at which heartbeat response is arrived from the RM. > - SliderAM send a message “stale SliderAM” to the slider-agent if there is no > AM-RM heartbeat for some time (“stale.slider.am.interval”) > - when slider-agent receive “stale SliderAM”, slider-agent should try to > discover the new SliderAM. if discovered, register to the new one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)