[ 
https://issues.apache.org/jira/browse/SLIDER-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated SLIDER-1221:
---------------------------------
    Attachment: SLIDER-1221.patch

I'm attaching the patch.

> the way to cope against SliderAM split brain
> --------------------------------------------
>
>                 Key: SLIDER-1221
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1221
>             Project: Slider
>          Issue Type: Bug
>            Reporter: kyungwan nam
>         Attachments: SLIDER-1221.patch
>
>
> I have met a problem like “Slider-AM split brain”.
> normally, AM is failed, RM will launch new one.
> but, even without failing AM, It can happens if there is something like 
> networking issue between AM and RM.
> because, RM is launching the new AM if there is no heartbeat from the AM for 
> some time (yarn.am.liveness-monitor.expiry-interval-ms)
> in that case, previous AM and new AM can coexist and containers keep 
> connection with previous AM.
> it could cause lots of problems.
> new AM couldn't know the containers launched by previous AM.
> as a result, simultaneous the containers could be killed after some time.
> slider-agent should register to the new SliderAM as soon as possible.
> I think it could be improved as follows.
> - SliderAM record the time at which heartbeat response is arrived from the RM.
> - SliderAM send a message “stale SliderAM” to the slider-agent if there is no 
> AM-RM heartbeat for some time (“stale.slider.am.interval”)
> - when slider-agent receive “stale SliderAM”, slider-agent should try to 
> discover the new SliderAM. if discovered, register to the new one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to