[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569618#comment-15569618 ] Kai Huang commented on AURORA-1791: --- The ticket to track is: https://issues.apache.org/jira/browse/AURORA-1793 > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569600#comment-15569600 ] Kai Huang commented on AURORA-1791: --- We've decided to revert the commit. The changes that directly causes problems is: Modify executor state transition logic to rely on health checks (if enabled). commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 There are two downstream commits that depends on the above commit: Add min_consecutive_health_checks in HealthCheckConfig commit ed72b1bf662d1e29d2bb483b317c787630c26a9e Add support for receiving min_consecutive_successes in health checker commit e91130e49445c3933b6e27f5fde18c3a0e61b87a We will drop all three of these commits and revert back to one commit before the problematic commit: Running task ssh without an instance should pick a random instance commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c I will create a separate ticket for people to track the reversion. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567954#comment-15567954 ] David McLaughlin commented on AURORA-1791: -- Given the lack of test coverage I've found just looking at a single function, I would seriously recommend we roll back the commit (or will it be commits?) rather than rush a patch in order to fix master. Any objections? cc/ [~zmanji] and [~jcohen] > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567886#comment-15567886 ] David McLaughlin commented on AURORA-1791: -- I don't think so. There are easy-to-fix errors in this code without having to move to any event-driven approach. The log lines that Zameer posted are triggered during a run in which the health checker is reporting healthy. The main logic error is that we take a *pessimistic approach* to checking interval expiration. Specifically, this block: {code} if not self._expired: if self.clock.time() - self.start_time > self.initial_interval: log.debug('Initial interval expired.') self._expired = True if not self.health_check_passed: log.warning('Failed to reach minimum consecutive successes.') self.healthy = False else: if self.current_consecutive_successes >= self.min_consecutive_successes: log.info('Reached minimum consecutive successes.') self.health_check_passed = True {code} We could be in a situation where current_consecutive_successes meets the minimum criteria but we decide to expire if we're even a millisecond over the interval. You could rewrite this as: {code} if not self._expired: if self.current_consecutive_successes >= self.min_consecutive_successes: log.info('Reached minimum consecutive successes.') self.health_check_passed = True if self.clock.time() - self.start_time > self.initial_interval: log.debug('Initial interval expired.') self._expired = True if not self.health_check_passed: log.warning('Failed to reach minimum consecutive successes.') self.healthy = False {code} And I think as long as the current healthiness meets the minimum consecutive successes, the task would enter RUNNING state. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567803#comment-15567803 ] Kai Huang commented on AURORA-1791: --- An issue to implement (b) is that the health checker thread might be sleeping while initial_interval_secs expires. We will need a event-driven mechanism to notify the health checker to wake up and do a health check when initial_interval_secs expires. This seems requires a lot of refactoring. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566749#comment-15566749 ] Stephan Erb commented on AURORA-1791: - Another alternative would be to only ever increment the {{current_consecutive_failures}} counter when the {{initial_interval_sec}} has expired. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566721#comment-15566721 ] Kai Huang commented on AURORA-1791: --- In your case, it's likely your task becomes health at the 6th second. However, within the 10 secs initial_interval_secs, we only performed one health check at the 5th second. So by definition, we "failed" to detect at least one health check within the given period, so we moved the task state to FAILED. We can enforce a health check at the end of the initial_interval_secs. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566711#comment-15566711 ] Kai Huang commented on AURORA-1791: --- Thanks for pointing it out. The current implementation is: if we failed to see(even if it exist) enough successful health checks within the initial_interval_secs, we will move the task state to failed. This behavior is not back-ward compatible and not safe. I'll revised the design. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1557#comment-1557 ] Zameer Manji commented on AURORA-1791: -- Note, I could be wrong here but this was deployed to a cluster and tasks that were healthy before started to fail. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. -- This message was sent by Atlassian JIRA (v6.3.4#6332)