[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768
 ] 

Kai Huang commented on AURORA-1791:
-----------------------------------

To sum up, the issue is caused by failed to reach min_consecutive_successes, 
not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but only ignores it until 
initial_interval_secs expires. This does not cause any problem but does not 
seem clear to people. I've changed it to:  updating failure counter after 
initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options 
here:

(a) Doing health checks periodically as defined. Even initial_interval_secs 
expires and min successes is not reached (because periodic check will miss some 
successes), we do not fail health check right away. Instead, we will rely on 
the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption 
that if a task responds OK before initial_interval_secs expires, for next 
health check, it will still responds OK. However, it's likely the task fails to 
respond OK until we perform this additional health check. It's highly likely 
the instance will be healthy afterwards, but we should fail the health check 
according to the definition?

> Commit ca683 is not backwards compatible.
> -----------------------------------------
>
>                 Key: AURORA-1791
>                 URL: https://issues.apache.org/jira/browse/AURORA-1791
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>            Assignee: Kai Huang
>            Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>       initial_interval_secs: 10
>       interval_secs: 5
>       max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to