-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52766/
-----------------------------------------------------------

Review request for Aurora, Joshua Cohen and Zameer Manji.


Bugs: AURORA-1791
    https://issues.apache.org/jira/browse/AURORA-1791


Repository: aurora


Description
-------

Fix a bug in commit ca683cb. The commit is related to this review 
https://reviews.apache.org/r/51876/. Please see it for more details and 
backgrounds.

Currently, health checks are performed during a grace period called 
initial_interval_secs. It is likely that HealthChecker fails to see sufficient 
number of successes before the intitial_interval_secs expires. For example, for 
a task with HealthCheckConfig(initital_interval_secs=15, interval_secs=10, 
min_consecutive_successes=1). If the task sleeps during the first 12 seconds 
and becomes healthy afterwards, the health checker will report the task status 
as "TASK_FAILED" and miss the "healthy" status between second 12-15. This is 
because only one health check is performed at second 10 before the 
initial_interval_secs expires. This is an implementation flaw that breaks 
backward-compatability. 

To address this problem, I rewrite the function that is responsible for 
updating the failure counts and the healthy status. The expected behavior is 
that for the task described above, the health checker will performs a health 
check after the initial_interval_secs expires and sets the health check status 
to be healthy. Please see this review for more details.

Will add some more tests since the current e2e tests does not include the above 
test case.


Diffs
-----

  src/main/python/apache/aurora/executor/common/health_checker.py 
1e0be108b49480d57c5ab94b1d2903bb57bae20a 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
28769dca68a6353fc1283a8bb279fae05173aaac 

Diff: https://reviews.apache.org/r/52766/diff/


Testing
-------

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::

./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Modified the test in http_example.py. Let the http server sleep for the first 
10 seconds.

Launch a job that contains the task with Default 
HealthCheckConfig(initial_interval_secs=15, interval_secs=10, 
min_consecutive_successes=1) in vagrant aurora cluster. The task transitions to 
TASK_RUNNING state after ~20 seconds.


File Attachments
----------------

Task with default Health Check Config
  
https://reviews.apache.org/media/uploaded/files/2016/10/12/64cf6610-9294-46cb-b159-6e5721da5fff__Screen_Shot_2016-10-11_at_6.17.00_PM.png


Thanks,

Kai Huang

Reply via email to