[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567175#comment-15567175 ]
David McLaughlin commented on AURORA-1791: ------------------------------------------ This was the proposal in Maxim's design doc: {quote} The new approach has an inherent risk of an update getting “stuck” due to an instance persistently failing health checks (e.g. due to deadlock during application startup or failure to bind to health port). To mitigate this, the initial_interval_secs will be repurposed to serve as a grace interval for failing health checks. *Specifically, any health check failures will be ignored during the grace interval*. There are 2 possible cases given the above: Instance responds OK before initial_interval_secs expires → task moves into RUNNING. Instance fails to respond OK and the initial_interval_secs expires → task moves into FAILED. {quote} Source: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit# (under Proposed Flow Overview) > Commit ca683 is not backwards compatible. > ----------------------------------------- > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug > Reporter: Zameer Manji > Assignee: Kai Huang > Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)