Hey Tengfei,

the Aurora health checks cannot differentiate a service instance which has 
deadlocked from one which is extremely slow. The decision to restart is then 
performed by the executor without central coordination by the scheduler. Your 
best course of action will therefore be to prevent the overload in the first 
place, for example via load shedding and graceful degradation. You can find 
further details in the Google SRE Book [1].

Specifically, you will want to do tight(er) health checking in your 
loadbalancers, so that instances drop out of rotation before they hit their 
capacity limit. In addition, I have had a good experience by also protecting 
instance with a limiting HAProxy/Nginx that runs as a side-car within Aurora 
tasks.

I hope this gets you started.

Best regards,
Stephan

[1] 
https://landing.google.com/sre/book/chapters/addressing-cascading-failures.html


On 18.06.18, 21:45, "Tengfei Mu" <tengfei...@gmail.com> wrote:

    Hi,
    
    We have had a few incidents when service under unexpected traffic/load
    spike then container starts to respond slow/fail health check, which caused
    massive instance rescheduling in Aurora. This could be a bad cycle that
    instances rescheduled (being started) causing more load on other instances,
    then more and more instances hammered down. Any one can share some best
    practice/lessons for preventing such outage caused by dynamic rescheduling
    in production cluster?
    
    
    Best,
    Tengfei
    

Reply via email to