Hi, We have had a few incidents when service under unexpected traffic/load spike then container starts to respond slow/fail health check, which caused massive instance rescheduling in Aurora. This could be a bad cycle that instances rescheduled (being started) causing more load on other instances, then more and more instances hammered down. Any one can share some best practice/lessons for preventing such outage caused by dynamic rescheduling in production cluster?
Best, Tengfei