> On April 14, 2017, 2:26 p.m., Santhosh Kumar Shanmugham wrote: > > src/main/python/apache/aurora/executor/common/health_checker.py > > Lines 163-166 (patched) > > <https://reviews.apache.org/r/58462/diff/1/?file=1692816#file1692816line163> > > > > This will cause a task to get stuck in `STARTING` since `self.running` > > will never be set to `True`. > > > > Can you explain the particular usecase here? Also add a test case to > > exercise this branch. > > Vladimir Khalatyan wrote: > The idea is to make HealthCheck process to start after some of the setup > processes are finished. With the current approach it's possible to addjust > the "starting" point of the HealthCheck process by changing > initial_interval_secs. But it means that we rely on the timing which doesn't > guarantee anything. > The idea of HealthCheck "snoozing" is ignore any status of the > healthcheck unless some process tells HealthCheck to start checking the > health of the service. > > Example (simplified one): > Let's assume we start two processes on the machine: the LB registration > and the UWSGI process. Let's say the uwsgi process requires some time to warm > up. The LB registration depends on the load on LB, how soon uwsgi warms up, > etc. So the actual moment when the application becomes available can vary > from couple of seconds to minutes and we can not rely on > initial_interval_secs. So we create a .healthchecksnooze file and ignore all > results of the healthcheck unless this file is there. In a meanwhile the LB > registration process will try to register service some number of times ( < > max_failures) and delete the .healthchecksnooze after it succeeds. Since this > particular moment the healthcheck will start incrementing the concecutive > successes or failures and we can determine whether the deployment is > successfull or not. > So with this approach we can specify the "starting" point of health > checking more accurately and dependent on other processes. > > Here by "starting" point of the health check I mean the checking of the > application health and changing the consecutive successes or failures, not > the actual system process. > > Santhosh Kumar Shanmugham wrote: > > "So the actual moment when the application becomes available can vary > from couple of seconds to minutes and we can not rely on > initial_interval_secs." > > The current implementation addresses this problem of > `initial_interval_secs` not responding faster with varying startup times. It > achieves this by performing `health checks` during the startup time > (`initial_interval_secs`) but ignores all failures during this period, > however successful health checks now count towards transitioning the task to > a healthy (RUNNING) state. Thereby it can accomodate both slow startup as > well as fast startup without making the faster startup instances from waiting > until the entire `initial_interval_secs` has expired. > > However for your change in particular, you might also need to account for > `_should_enforce_deadline` - which will treat a task as unhealthy if it runs > out of attempts. > > Santhosh Kumar Shanmugham wrote: > I was just looking at the docs and your usecase of health-check snoozing > is vastly different from the usecase the documentation - > > > You can pause health checking by touching a file inside of your > sandbox, named .healthchecksnooze. As long as that file is present, health > checks will be disabled, enabling users to gather core dumps or other > performance measurements without worrying about Aurora’s health check killing > their process. > > Using health check snoozing to achieve synchronization of health checks, > is indicative of how inflexible the health-checking mechanism is in reality. > :( > > Stephan Erb wrote: > Personally speaking, I would not want to run without a sane > `initial_interval_secs` (e.g. 5-10 minutes). In distributed systems, things > tend to stall from time to time. Not having a timeout to unstuck a deployment > seems very problematic to me. > > There is a recent discussion on the mailing list discovering a related > issue: > https://lists.apache.org/thread.html/a0b01ba8928dc637cff223d2ff74a8d00436850954d61150fc709ca4@%3Cuser.aurora.apache.org%3E > > Vladimir Khalatyan wrote: > Santhosh, so what can you suggest in my case? Let's assume the LB > registration process depends on the latency, readiness of the service, etc. > Similar dependencied for uwsgi process. What is the initial_interval_secs > should I set? Or how can I determine what is the interval supposed to be? 1 > minute? 10 minutes? 30 minutes? I can only guess. But if there are any delays > on the LB register or uwsgi warm up, everything could screw up. So it's not > flexible anymore. > With the approach of disabling the health check process, it's flexible. > Ideally I don't want to health check my application unless it's ready to run. > The readiness could be checked on my side, inside my processes. And somehow I > should be able to tell health check that my application is ready for > checking. The ".healthchecksnooze" file could be a great candidate for that. > > Vladimir Khalatyan wrote: > I am not saying that there is no control of the timeout. > Let's simplify everything. We have process A (some setup process), > process B (another setup process) and process C (the main daemon process of > the application). > We would like to execute these processes in an order of A->B->C. > > Unfortunately currently there is no control of when to start the > HealthCheck (HC) unless trying to predict the initial_interval_secs (which > basically means don't accept failures unless all setups are finished and my > main process is started) and few other parameters. For example in case I add > one more process D between A and B, I would need to recalculate the > initial_interval_secs in order to have the system work properly. > > Ideally you would like to start A, then B, then C and start the HC after > it. So the .healthchecksnooze file would be an identifier of when to start > the HC (unless it's possible to have a dependency for the HC, which would be > awesome and simplify everything). What it means is that "Don't run the HC if > this file exists". So after all processes A,B and C are executed properly, > one of those processes (for example C) can delete the .healthchecksnooze file > and that basically means that we can start a HC. We still can have > initial_interval_secs if we want but probably min_consecutive_successes and > max_consecutive_failures are enough. With this file we can shift the starting > point of healthchecking to whenever we want depending on our configurations > instead of having it running right at the beginning of the Task. > > And still we have the controlling configurations for each of these > processes. You can specify what is timeout for A, how many max failures > acceptable for B, etc. So if something "stuck" at some point and reach the > failure requirements the deployment would fail.
Vladimir, I am fine with the change. Personally I would prefer a first-class support for notifying when to start HCs. For instance via a HTTP endpoint (/startHealthCheck). Having said that my comment still stands which will be an issue for the implementation - In addition to skipping the increments of `min_consecutive_successes` we also need to need to skip increments to the `attempts` - Santhosh Kumar ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58462/#review172032 ----------------------------------------------------------- On April 14, 2017, 1:35 p.m., Vladimir Khalatyan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58462/ > ----------------------------------------------------------- > > (Updated April 14, 2017, 1:35 p.m.) > > > Review request for Aurora, Joshua Cohen and Zameer Manji. > > > Repository: aurora > > > Description > ------- > > Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze > present > > > Diffs > ----- > > src/main/python/apache/aurora/executor/common/health_checker.py > e9e4129af2db5202a82e9f6d54109a00bbae97ce > > > Diff: https://reviews.apache.org/r/58462/diff/1/ > > > Testing > ------- > > The Health Check is succeeding when the .healthchecksnooze is present. But it > should just snooze which means there shouldn't be any increase in consecutive > successes or consecutive failures. > > > Thanks, > > Vladimir Khalatyan > >