kyungwan nam created SLIDER-1221:
------------------------------------

             Summary: the way to cope against SliderAM split brain
                 Key: SLIDER-1221
                 URL: https://issues.apache.org/jira/browse/SLIDER-1221
             Project: Slider
          Issue Type: Bug
            Reporter: kyungwan nam


I have met a problem like “Slider-AM split brain”.
normally, AM is failed, RM will launch new one.
but, even without failing AM, It can happens if there is something like 
networking issue between AM and RM.
because, RM is launching the new AM if there is no heartbeat from the AM for 
some time (yarn.am.liveness-monitor.expiry-interval-ms)
in that case, previous AM and new AM can coexist and containers keep connection 
with previous AM.
it could cause lots of problems.
new AM couldn't know the containers launched by previous AM.
as a result, simultaneous the containers could be killed after some time.

slider-agent should register to the new SliderAM as soon as possible.
I think it could be improved as follows.

- SliderAM record the time at which heartbeat response is arrived from the RM.
- SliderAM send a message “stale SliderAM” to the slider-agent if there is no 
AM-RM heartbeat for some time (“stale.slider.am.interval”)
- when slider-agent receive “stale SliderAM”, slider-agent should try to 
discover the new SliderAM. if discovered, register to the new one.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to