> On April 14, 2017, 2:26 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py
> > Lines 163-166 (patched)
> > <https://reviews.apache.org/r/58462/diff/1/?file=1692816#file1692816line163>
> >
> >     This will cause a task to get stuck in `STARTING` since `self.running` 
> > will never be set to `True`.
> >     
> >     Can you explain the particular usecase here? Also add a test case to 
> > exercise this branch.
> 
> Vladimir Khalatyan wrote:
>     The idea is to make HealthCheck process to start after some of the setup 
> processes are finished. With the current approach it's possible to addjust 
> the "starting" point of the HealthCheck process by changing 
> initial_interval_secs. But it means that we rely on the timing which doesn't 
> guarantee anything.  
>     The idea of HealthCheck "snoozing" is ignore any status of the 
> healthcheck unless some process tells HealthCheck to start checking the 
> health of the service.
>     
>     Example (simplified one):
>      Let's assume we start two processes on the machine: the LB registration 
> and the UWSGI process. Let's say the uwsgi process requires some time to warm 
> up. The LB registration depends on the load on LB, how soon uwsgi warms up, 
> etc. So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs. So we create a .healthchecksnooze file and ignore all 
> results of the healthcheck unless this file is there. In a meanwhile the LB 
> registration process will try to register service some number of times ( < 
> max_failures) and delete the .healthchecksnooze after it succeeds. Since this 
> particular moment the healthcheck will start incrementing the concecutive 
> successes or failures and we can determine whether the deployment is 
> successfull or not. 
>      So with this approach we can specify the "starting" point of health 
> checking more accurately and dependent on other processes. 
>      
>      Here by "starting" point of the health check I mean the checking of the 
> application health and changing the consecutive successes or failures, not 
> the actual system process.
> 
> Santhosh Kumar Shanmugham wrote:
>     > "So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs."
>     
>     The current implementation addresses this problem of 
> `initial_interval_secs` not responding faster with varying startup times. It 
> achieves this by performing `health checks` during the startup time 
> (`initial_interval_secs`) but ignores all failures during this period, 
> however successful health checks now count towards transitioning the task to 
> a healthy (RUNNING) state. Thereby it can accomodate both slow startup as 
> well as fast startup without making the faster startup instances from waiting 
> until the entire `initial_interval_secs` has expired.
>     
>     However for your change in particular, you might also need to account for 
> `_should_enforce_deadline` - which will treat a task as unhealthy if it runs 
> out of attempts.

I was just looking at the docs and your usecase of health-check snoozing is 
vastly different from the usecase the documentation - 

> You can pause health checking by touching a file inside of your sandbox, 
> named .healthchecksnooze. As long as that file is present, health checks will 
> be disabled, enabling users to gather core dumps or other performance 
> measurements without worrying about Aurora’s health check killing their 
> process.

Using health check snoozing to achieve synchronization of health checks, is 
indicative of how inflexible the health-checking mechanism is in reality. :(


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58462/#review172032
-----------------------------------------------------------


On April 14, 2017, 1:35 p.m., Vladimir Khalatyan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58462/
> -----------------------------------------------------------
> 
> (Updated April 14, 2017, 1:35 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze 
> present
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> e9e4129af2db5202a82e9f6d54109a00bbae97ce 
> 
> 
> Diff: https://reviews.apache.org/r/58462/diff/1/
> 
> 
> Testing
> -------
> 
> The Health Check is succeeding when the .healthchecksnooze is present. But it 
> should just snooze which means there shouldn't be any increase in consecutive 
> successes or consecutive failures.
> 
> 
> Thanks,
> 
> Vladimir Khalatyan
> 
>

Reply via email to