Re: Review Request 58462: Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze present

Santhosh Kumar Shanmugham Wed, 19 Apr 2017 01:36:05 -0700


> On April 14, 2017, 2:26 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py
> > Lines 163-166 (patched)
> > <https://reviews.apache.org/r/58462/diff/1/?file=1692816#file1692816line163>
> >
> >     This will cause a task to get stuck in `STARTING` since `self.running` 
> > will never be set to `True`.
> >     
> >     Can you explain the particular usecase here? Also add a test case to 
> > exercise this branch.
> 
> Vladimir Khalatyan wrote:
>     The idea is to make HealthCheck process to start after some of the setup 
> processes are finished. With the current approach it's possible to addjust 
> the "starting" point of the HealthCheck process by changing 
> initial_interval_secs. But it means that we rely on the timing which doesn't 
> guarantee anything.  
>     The idea of HealthCheck "snoozing" is ignore any status of the 
> healthcheck unless some process tells HealthCheck to start checking the 
> health of the service.
>     
>     Example (simplified one):
>      Let's assume we start two processes on the machine: the LB registration 
> and the UWSGI process. Let's say the uwsgi process requires some time to warm 
> up. The LB registration depends on the load on LB, how soon uwsgi warms up, 
> etc. So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs. So we create a .healthchecksnooze file and ignore all 
> results of the healthcheck unless this file is there. In a meanwhile the LB 
> registration process will try to register service some number of times ( < 
> max_failures) and delete the .healthchecksnooze after it succeeds. Since this 
> particular moment the healthcheck will start incrementing the concecutive 
> successes or failures and we can determine whether the deployment is 
> successfull or not. 
>      So with this approach we can specify the "starting" point of health 
> checking more accurately and dependent on other processes. 
>      
>      Here by "starting" point of the health check I mean the checking of the 
> application health and changing the consecutive successes or failures, not 
> the actual system process.
> 
> Santhosh Kumar Shanmugham wrote:
>     > "So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs."
>     
>     The current implementation addresses this problem of 
> `initial_interval_secs` not responding faster with varying startup times. It 
> achieves this by performing `health checks` during the startup time 
> (`initial_interval_secs`) but ignores all failures during this period, 
> however successful health checks now count towards transitioning the task to 
> a healthy (RUNNING) state. Thereby it can accomodate both slow startup as 
> well as fast startup without making the faster startup instances from waiting 
> until the entire `initial_interval_secs` has expired.
>     
>     However for your change in particular, you might also need to account for 
> `_should_enforce_deadline` - which will treat a task as unhealthy if it runs 
> out of attempts.
> 
> Santhosh Kumar Shanmugham wrote:
>     I was just looking at the docs and your usecase of health-check snoozing 
> is vastly different from the usecase the documentation - 
>     
>     > You can pause health checking by touching a file inside of your 
> sandbox, named .healthchecksnooze. As long as that file is present, health 
> checks will be disabled, enabling users to gather core dumps or other 
> performance measurements without worrying about Aurora’s health check killing 
> their process.
>     
>     Using health check snoozing to achieve synchronization of health checks, 
> is indicative of how inflexible the health-checking mechanism is in reality. 
> :(
> 
> Stephan Erb wrote:
>     Personally speaking, I would not want to run without a sane 
> `initial_interval_secs` (e.g. 5-10 minutes). In distributed systems, things 
> tend to stall from time to time. Not having a timeout to unstuck a deployment 
> seems very problematic to me.
>     
>     There is a recent discussion on the mailing list discovering a related 
> issue: 
> https://lists.apache.org/thread.html/a0b01ba8928dc637cff223d2ff74a8d00436850954d61150fc709ca4@%3Cuser.aurora.apache.org%3E
> 
> Vladimir Khalatyan wrote:
>     Santhosh, so what can you suggest in my case? Let's assume the LB 
> registration process depends on the latency, readiness of the service, etc. 
> Similar dependencied for uwsgi process. What is the initial_interval_secs 
> should I set? Or how can I determine what is the interval supposed to be? 1 
> minute? 10 minutes? 30 minutes? I can only guess. But if there are any delays 
> on the LB register or uwsgi warm up, everything could screw up. So it's not 
> flexible anymore. 
>     With the approach of disabling the health check process, it's flexible. 
> Ideally I don't want to health check my application unless it's ready to run. 
> The readiness could be checked on my side, inside my processes. And somehow I 
> should be able to tell health check that my application is ready for 
> checking. The ".healthchecksnooze" file could be a great candidate for that.
> 
> Vladimir Khalatyan wrote:
>     I am not saying that there is no control of the timeout. 
>     Let's simplify everything. We have process A (some setup process), 
> process B (another setup process) and process C (the main daemon process of 
> the application).
>     We would like to execute these processes in an order of A->B->C.
>     
>     Unfortunately currently there is no control of when to start the 
> HealthCheck (HC) unless trying to predict the initial_interval_secs (which 
> basically means don't accept failures unless all setups are finished and my 
> main process is started) and few other parameters. For example in case I add 
> one more process D between A and B, I would need to recalculate the 
> initial_interval_secs in order to have the system work properly. 
>     
>     Ideally you would like to start A, then B, then C and start the HC after 
> it. So the .healthchecksnooze file would be an identifier of when to start 
> the HC (unless it's possible to have a dependency for the HC, which would be 
> awesome and simplify everything). What it means is that "Don't run the HC if 
> this file exists". So after all processes A,B and C are executed properly, 
> one of those processes (for example C) can delete the .healthchecksnooze file 
> and that basically means that we can start a HC. We still can have 
> initial_interval_secs if we want but probably min_consecutive_successes and 
> max_consecutive_failures are enough. With this file we can shift the starting 
> point of healthchecking to whenever we want depending on our configurations 
> instead of having it running right at the beginning of the Task.
>     
>     And still we have the controlling configurations for each of these 
> processes. You can specify what is timeout for A, how many max failures 
> acceptable for B, etc. So if something "stuck" at some point and reach the 
> failure requirements the deployment would fail.


Vladimir, I am fine with the change. Personally I would prefer a first-class 
support for notifying when to start HCs. For instance via a HTTP endpoint 
(/startHealthCheck).

Having said that my comment still stands which will be an issue for the 
implementation - In addition to skipping the increments of 
`min_consecutive_successes` we also need to need to skip increments to the 
`attempts`


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58462/#review172032
-----------------------------------------------------------


On April 14, 2017, 1:35 p.m., Vladimir Khalatyan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58462/
> -----------------------------------------------------------
> 
> (Updated April 14, 2017, 1:35 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze 
> present
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> e9e4129af2db5202a82e9f6d54109a00bbae97ce 
> 
> 
> Diff: https://reviews.apache.org/r/58462/diff/1/
> 
> 
> Testing
> -------
> 
> The Health Check is succeeding when the .healthchecksnooze is present. But it 
> should just snooze which means there shouldn't be any increase in consecutive 
> successes or consecutive failures.
> 
> 
> Thanks,
> 
> Vladimir Khalatyan
> 
>

Re: Review Request 58462: Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze present

Reply via email to