DavidMcLaughlin opened a new issue #30: Count number of times partitioned tasks reenter the cluster as healthy URL: https://github.com/apache/aurora/issues/30 Currently when a task is PARTITIONED and LOST, Aurora reschedules a replacement. Later on, the task can send a message saying it was healthy and then Aurora will kill the old task. Receiving this signal is a huge indicator that you could avoid unnecessary churn in the cluster by extending timeouts. Add a metric to monitor how often this use case happens.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services