Re: 1.9 external health checks fail suddenly

Willy Tarreau Mon, 23 Sep 2019 09:37:38 -0700

Hi Veiko,

On Mon, Sep 23, 2019 at 09:11:36AM +0000, Veiko Kukk wrote:
> On 2019-08-28 11:13, Veiko Kukk wrote:
> > Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu
> > (4 threads configured). Not sure if this is related to patch or is
> > some new bug in 1.9.10. I've now replaced running instance with 1.9.10
> > without external check patch to see if this happens again.
> 
> Now, after almost one month, with 1.9.10 (no patches) it happened again. All
> external checks failed again and there was large amount of zombie external
> check processes accumulated.
> Unfortunately since I was not there doing reload, I can't tell timeframe or
> exact amount of those processes.


No worries for this last point, we know what production is. I must
say I forgot about this issue since we had another conversation on
a similar subject with a few other people in github issue #141 where
Lukas linked your issue as well.

There I understood that there was a fundamental issue related to the
call of fork() on a thread ID > 1, because our signal handlers registered
before the creation of the threads are not shared by the threads and the
SIGCHLD is not received for threads other than 1, so it can happen that
no cleanup happens at all and that zombie process accumulate. I don't
know if this could be the cause of the failure of your test with the
thread_isolate() patch 1 month ago. I proposed a fix there, consisting
in making sure that only thread 1 runs the external checks (by default
any thread can do), which fixes the issue in artificially made up setups
for me. But I didn't get a response, maybe the people had reverted by
then or maybe they're still observing.

I can encourage you to take it, I've now merged it so that I don't lose
it anymore in a defunct issue. I'm attaching it here, it applies to 1.9
as well with an offset. Normally you'd still need the one about
thread_isolate() that you first tried though. Just let us know.

Now the other thing to keep in mind is that since it failed after one
month, it would also be caused by another bug in 1.9 which would be
affected by threads.

Hoping this helps,
Willy

>From 6dd4ac890b5810b0f0fe81725fda05ad3d052849 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w...@1wt.eu>
Date: Tue, 3 Sep 2019 18:55:02 +0200
Subject: BUG/MEDIUM: check/threads: make external checks run exclusively on
 thread 1

See GH issues #141 for all the context. In short, registered signal
handlers are not inherited by other threads during startup, which is
normally not a problem, except that we need that the same thread as
the one doing the fork() cleans up the old process using waitpid()
once its death is reported via SIGCHLD, as happens in external checks.

The only simple solution to this at the moment is to make sure that
external checks are exclusively run on the first thread, the one
which registered the signal handlers on startup. It will be far more
than enough anyway given that external checks must not require to be
load balanced on multiple threads! A more complex solution could be
designed over the long term to let each thread deal with all signals
but it sounds overkill.

This must be backported as far as 1.8.
---
 src/checks.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/checks.c b/src/checks.c
index e5dfdd73c8..b879100fac 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -2177,7 +2177,7 @@ static struct task *process_chk_proc(struct task *t, void 
*context, unsigned sho
                        /* a success was detected */
                        check_notify_success(check);
                }
-               task_set_affinity(t, MAX_THREADS_MASK);
+               task_set_affinity(t, 1);
                check->state &= ~CHK_ST_INPROGRESS;

                pid_list_del(check->curpid);
@@ -2425,8 +2425,13 @@ static int start_check_task(struct check *check, int 
mininter,
                            int nbcheck, int srvpos)
 {
        struct task *t;
+       unsigned long thread_mask = MAX_THREADS_MASK;
+
+       if (check->type == PR_O2_EXT_CHK)
+               thread_mask = 1;
+
        /* task for the check */
-       if ((t = task_new(MAX_THREADS_MASK)) == NULL) {
+       if ((t = task_new(thread_mask)) == NULL) {
                ha_alert("Starting [%s:%s] check: out of memory.\n",
                         check->server->proxy->id, check->server->id);
                return 0;
-- 
2.20.1

Re: 1.9 external health checks fail suddenly

Reply via email to