[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zameer Manji updated AURORA-1791: --------------------------------- Description: The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] is not backwards compatible. The last section of the commit {quote} 4. Modified the Health Checker and redefined the meaning initial_interval_secs. {quote} has serious, unintended consequences. Consider the following health check config: {noformat} initial_interval_secs: 10 interval_secs: 5 max_consecutive_failures: 1 {noformat} On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second. On master, health checking starts right away which means the task can fail at the first second since {{max_consecutive_failures}} is set to 1. This is not backwards compatible and needs to be fixed. I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when {{max_consecutive_successes}} is met. An investigation shows {{initial_interval_secs}} was set to 5 but the task failed health checks right away: {noformat} D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. Performing health check. D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures counter. D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum consecutive successes. {noformat} was: The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] is not backwards compatible. The last section of the commit {quote} 4. Modified the Health Checker and redefined the meaning initial_interval_secs. {quote} has serious, unintended consequences. Consider the following health check config: {noformat} initial_interval_secs: 10 interval_secs: 5 max_consecutive_failures: 1 {noformat} On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second. On master, health checking starts right away which means the task can fail at the first second since {{max_consecutive_failures}} is set to 1. This is not backwards compatible and needs to be fixed. I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when {{max_consecutive_successes}} is met. > Commit ca683 is not backwards compatible. > ----------------------------------------- > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug > Reporter: Zameer Manji > Assignee: Kai Huang > Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)