[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567886#comment-15567886
 ] 

David McLaughlin commented on AURORA-1791:
------------------------------------------

I don't think so. There are easy-to-fix errors in this code without having to 
move to any event-driven approach. 

The log lines that Zameer posted are triggered during a run in which the health 
checker is reporting healthy. The main logic error is that we take a 
*pessimistic approach* to checking interval expiration. 

Specifically, this block:

{code}
    if not self._expired:
      if self.clock.time() - self.start_time > self.initial_interval:
        log.debug('Initial interval expired.')
        self._expired = True
        if not self.health_check_passed:
          log.warning('Failed to reach minimum consecutive successes.')
          self.healthy = False
      else:
        if self.current_consecutive_successes >= self.min_consecutive_successes:
          log.info('Reached minimum consecutive successes.')
          self.health_check_passed = True
{code}

We could be in a situation where current_consecutive_successes meets the 
minimum criteria but we decide to expire if we're even a millisecond over the 
interval.  You could rewrite this as:

{code}
    if not self._expired:
        if self.current_consecutive_successes >= self.min_consecutive_successes:
          log.info('Reached minimum consecutive successes.')
          self.health_check_passed = True
       
        if self.clock.time() - self.start_time > self.initial_interval:
          log.debug('Initial interval expired.')
          self._expired = True
          if not self.health_check_passed:
            log.warning('Failed to reach minimum consecutive successes.')
            self.healthy = False
{code}

And I think as long as the current healthiness meets the minimum consecutive 
successes, the task would enter RUNNING state. 

> Commit ca683 is not backwards compatible.
> -----------------------------------------
>
>                 Key: AURORA-1791
>                 URL: https://issues.apache.org/jira/browse/AURORA-1791
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>            Assignee: Kai Huang
>            Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>       initial_interval_secs: 10
>       interval_secs: 5
>       max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to