> On April 14, 2017, 2:26 p.m., Santhosh Kumar Shanmugham wrote: > > src/main/python/apache/aurora/executor/common/health_checker.py > > Lines 163-166 (patched) > > <https://reviews.apache.org/r/58462/diff/1/?file=1692816#file1692816line163> > > > > This will cause a task to get stuck in `STARTING` since `self.running` > > will never be set to `True`. > > > > Can you explain the particular usecase here? Also add a test case to > > exercise this branch. > > Vladimir Khalatyan wrote: > The idea is to make HealthCheck process to start after some of the setup > processes are finished. With the current approach it's possible to addjust > the "starting" point of the HealthCheck process by changing > initial_interval_secs. But it means that we rely on the timing which doesn't > guarantee anything. > The idea of HealthCheck "snoozing" is ignore any status of the > healthcheck unless some process tells HealthCheck to start checking the > health of the service. > > Example (simplified one): > Let's assume we start two processes on the machine: the LB registration > and the UWSGI process. Let's say the uwsgi process requires some time to warm > up. The LB registration depends on the load on LB, how soon uwsgi warms up, > etc. So the actual moment when the application becomes available can vary > from couple of seconds to minutes and we can not rely on > initial_interval_secs. So we create a .healthchecksnooze file and ignore all > results of the healthcheck unless this file is there. In a meanwhile the LB > registration process will try to register service some number of times ( < > max_failures) and delete the .healthchecksnooze after it succeeds. Since this > particular moment the healthcheck will start incrementing the concecutive > successes or failures and we can determine whether the deployment is > successfull or not. > So with this approach we can specify the "starting" point of health > checking more accurately and dependent on other processes. > > Here by "starting" point of the health check I mean the checking of the > application health and changing the consecutive successes or failures, not > the actual system process. > > Santhosh Kumar Shanmugham wrote: > > "So the actual moment when the application becomes available can vary > from couple of seconds to minutes and we can not rely on > initial_interval_secs." > > The current implementation addresses this problem of > `initial_interval_secs` not responding faster with varying startup times. It > achieves this by performing `health checks` during the startup time > (`initial_interval_secs`) but ignores all failures during this period, > however successful health checks now count towards transitioning the task to > a healthy (RUNNING) state. Thereby it can accomodate both slow startup as > well as fast startup without making the faster startup instances from waiting > until the entire `initial_interval_secs` has expired. > > However for your change in particular, you might also need to account for > `_should_enforce_deadline` - which will treat a task as unhealthy if it runs > out of attempts.
I was just looking at the docs and your usecase of health-check snoozing is vastly different from the usecase the documentation - > You can pause health checking by touching a file inside of your sandbox, > named .healthchecksnooze. As long as that file is present, health checks will > be disabled, enabling users to gather core dumps or other performance > measurements without worrying about Aurora’s health check killing their > process. Using health check snoozing to achieve synchronization of health checks, is indicative of how inflexible the health-checking mechanism is in reality. :( - Santhosh Kumar ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58462/#review172032 ----------------------------------------------------------- On April 14, 2017, 1:35 p.m., Vladimir Khalatyan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58462/ > ----------------------------------------------------------- > > (Updated April 14, 2017, 1:35 p.m.) > > > Review request for Aurora, Joshua Cohen and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze > present > > > Diffs > ----- > > src/main/python/apache/aurora/executor/common/health_checker.py > e9e4129af2db5202a82e9f6d54109a00bbae97ce > > > Diff: https://reviews.apache.org/r/58462/diff/1/ > > > Testing > ------- > > The Health Check is succeeding when the .healthchecksnooze is present. But it > should just snooze which means there shouldn't be any increase in consecutive > successes or consecutive failures. > > > Thanks, > > Vladimir Khalatyan > >