Hi Veiko, On Mon, Sep 23, 2019 at 09:11:36AM +0000, Veiko Kukk wrote: > On 2019-08-28 11:13, Veiko Kukk wrote: > > Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu > > (4 threads configured). Not sure if this is related to patch or is > > some new bug in 1.9.10. I've now replaced running instance with 1.9.10 > > without external check patch to see if this happens again. > > Now, after almost one month, with 1.9.10 (no patches) it happened again. All > external checks failed again and there was large amount of zombie external > check processes accumulated. > Unfortunately since I was not there doing reload, I can't tell timeframe or > exact amount of those processes.
No worries for this last point, we know what production is. I must say I forgot about this issue since we had another conversation on a similar subject with a few other people in github issue #141 where Lukas linked your issue as well. There I understood that there was a fundamental issue related to the call of fork() on a thread ID > 1, because our signal handlers registered before the creation of the threads are not shared by the threads and the SIGCHLD is not received for threads other than 1, so it can happen that no cleanup happens at all and that zombie process accumulate. I don't know if this could be the cause of the failure of your test with the thread_isolate() patch 1 month ago. I proposed a fix there, consisting in making sure that only thread 1 runs the external checks (by default any thread can do), which fixes the issue in artificially made up setups for me. But I didn't get a response, maybe the people had reverted by then or maybe they're still observing. I can encourage you to take it, I've now merged it so that I don't lose it anymore in a defunct issue. I'm attaching it here, it applies to 1.9 as well with an offset. Normally you'd still need the one about thread_isolate() that you first tried though. Just let us know. Now the other thing to keep in mind is that since it failed after one month, it would also be caused by another bug in 1.9 which would be affected by threads. Hoping this helps, Willy
>From 6dd4ac890b5810b0f0fe81725fda05ad3d052849 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w...@1wt.eu> Date: Tue, 3 Sep 2019 18:55:02 +0200 Subject: BUG/MEDIUM: check/threads: make external checks run exclusively on thread 1 See GH issues #141 for all the context. In short, registered signal handlers are not inherited by other threads during startup, which is normally not a problem, except that we need that the same thread as the one doing the fork() cleans up the old process using waitpid() once its death is reported via SIGCHLD, as happens in external checks. The only simple solution to this at the moment is to make sure that external checks are exclusively run on the first thread, the one which registered the signal handlers on startup. It will be far more than enough anyway given that external checks must not require to be load balanced on multiple threads! A more complex solution could be designed over the long term to let each thread deal with all signals but it sounds overkill. This must be backported as far as 1.8. --- src/checks.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/src/checks.c b/src/checks.c index e5dfdd73c8..b879100fac 100644 --- a/src/checks.c +++ b/src/checks.c @@ -2177,7 +2177,7 @@ static struct task *process_chk_proc(struct task *t, void *context, unsigned sho /* a success was detected */ check_notify_success(check); } - task_set_affinity(t, MAX_THREADS_MASK); + task_set_affinity(t, 1); check->state &= ~CHK_ST_INPROGRESS; pid_list_del(check->curpid); @@ -2425,8 +2425,13 @@ static int start_check_task(struct check *check, int mininter, int nbcheck, int srvpos) { struct task *t; + unsigned long thread_mask = MAX_THREADS_MASK; + + if (check->type == PR_O2_EXT_CHK) + thread_mask = 1; + /* task for the check */ - if ((t = task_new(MAX_THREADS_MASK)) == NULL) { + if ((t = task_new(thread_mask)) == NULL) { ha_alert("Starting [%s:%s] check: out of memory.\n", check->server->proxy->id, check->server->id); return 0; -- 2.20.1