Re: Review Request 52766: Fix a bug in insufficient successes during initial_interval_secs

Kai Huang Tue, 11 Oct 2016 22:43:01 -0700


> On Oct. 12, 2016, 1:49 a.m., David McLaughlin wrote:
> > src/test/python/apache/aurora/executor/common/test_health_checker.py, line 
> > 500
> > <https://reviews.apache.org/r/52766/diff/1/?file=1531886#file1531886line500>
> >
> >     Maybe I'm misunderstanding something - but this test suite seems to 
> > lack any test for the value of hc.health_check_passed? This is what 
> > ultimately signals that the task should move into RUNNING:
> >     
> >     
> > https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L222


Thanks for the reminder. Added test coverage for this value. 
initial_interval_secs can expire either at (a) before TASK_RUNNING or at (b) 
after TASK_RUNNING.

In zameer's job, the status returned in case (a) is "not health_check_passed & 
not healthy".
One additional health check is required at the boundary (c) to further ensure 
the task passed health check.


We do not detect this because test_task_health_ok() in test_thermos_executor 
only covers case (b). 
https://github.com/apache/aurora/blob/master/src/test/python/apache/aurora/executor/test_thermos_executor.py#L458


                                   achieve min successes
                                     
                                   health_check_passed & healthy                
          
    TASK_STARTING ---(a)----(c)-------------------------------------------> 
TASK_RUNNING---(b)-----------------> TASK_FINISHED
                            |                                                   
                    |   
                            |                                                   
                    | reach max failure limit
                            |fail to achieve min successes                      
                    |
                            |                                                   
                    | health_check_passed & not healthy
                            |not health_check_passed & not healthy              
                   \|/  
                            |                                                   
               TASK_FAILED
                           \|/
                       TASK_FAILED


- Kai


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52766/#review152252
-----------------------------------------------------------


On Oct. 12, 2016, 5:01 a.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52766/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2016, 5:01 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1791
>     https://issues.apache.org/jira/browse/AURORA-1791
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Fix a bug in commit ca683cb. The commit is related to this review 
> https://reviews.apache.org/r/51876/. Please see it for more details and 
> backgrounds.
> 
> Currently, health checks are performed during a grace period called 
> initial_interval_secs. It is likely that HealthChecker fails to see 
> sufficient number of successes before the intitial_interval_secs expires. For 
> example, for a task with HealthCheckConfig(initital_interval_secs=15, 
> interval_secs=10, min_consecutive_successes=1). If the task sleeps during the 
> first 12 seconds and becomes healthy afterwards, the health checker will 
> report the task status as "TASK_FAILED" and miss the "healthy" status between 
> second 12-15. This is because only one health check is performed at second 10 
> before the initial_interval_secs expires. This is an implementation flaw that 
> breaks backward-compatability. 
> 
> To address this problem, I rewrite the function that is responsible for 
> updating the failure counts and the healthy status. The expected behavior is 
> that for the task described above, the health checker will performs a health 
> check after the initial_interval_secs expires and sets the health check 
> status to be healthy. Please see this review for more details.
> 
> Will add some more tests since the current e2e tests does not include the 
> above test case.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 1e0be108b49480d57c5ab94b1d2903bb57bae20a 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> 28769dca68a6353fc1283a8bb279fae05173aaac 
> 
> Diff: https://reviews.apache.org/r/52766/diff/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Modified the test in http_example.py. Let the http server sleep for the first 
> 10 seconds.
> 
> Launch a job that contains the task with Default 
> HealthCheckConfig(initial_interval_secs=15, interval_secs=10, 
> min_consecutive_successes=1) in vagrant aurora cluster. The task transitions 
> to TASK_RUNNING state after ~20 seconds.
> 
> 
> File Attachments
> ----------------
> 
> Task with default Health Check Config
>   
> https://reviews.apache.org/media/uploaded/files/2016/10/12/64cf6610-9294-46cb-b159-6e5721da5fff__Screen_Shot_2016-10-11_at_6.17.00_PM.png
> 
> 
> Thanks,
> 
> Kai Huang
> 
>

Re: Review Request 52766: Fix a bug in insufficient successes during initial_interval_secs

Reply via email to