Hey Tengfei, the Aurora health checks cannot differentiate a service instance which has deadlocked from one which is extremely slow. The decision to restart is then performed by the executor without central coordination by the scheduler. Your best course of action will therefore be to prevent the overload in the first place, for example via load shedding and graceful degradation. You can find further details in the Google SRE Book [1].
Specifically, you will want to do tight(er) health checking in your loadbalancers, so that instances drop out of rotation before they hit their capacity limit. In addition, I have had a good experience by also protecting instance with a limiting HAProxy/Nginx that runs as a side-car within Aurora tasks. I hope this gets you started. Best regards, Stephan [1] https://landing.google.com/sre/book/chapters/addressing-cascading-failures.html On 18.06.18, 21:45, "Tengfei Mu" <tengfei...@gmail.com> wrote: Hi, We have had a few incidents when service under unexpected traffic/load spike then container starts to respond slow/fail health check, which caused massive instance rescheduling in Aurora. This could be a bad cycle that instances rescheduled (being started) causing more load on other instances, then more and more instances hammered down. Any one can share some best practice/lessons for preventing such outage caused by dynamic rescheduling in production cluster? Best, Tengfei