Re: 1.9 external health checks fail suddenly
Hi Veiko, On Mon, Sep 23, 2019 at 09:11:36AM +, Veiko Kukk wrote: > On 2019-08-28 11:13, Veiko Kukk wrote: > > Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu > > (4 threads configured). Not sure if this is related to patch or is > > some new bug in 1.9.10. I've now replaced running instance with 1.9.10 > > without external check patch to see if this happens again. > > Now, after almost one month, with 1.9.10 (no patches) it happened again. All > external checks failed again and there was large amount of zombie external > check processes accumulated. > Unfortunately since I was not there doing reload, I can't tell timeframe or > exact amount of those processes. No worries for this last point, we know what production is. I must say I forgot about this issue since we had another conversation on a similar subject with a few other people in github issue #141 where Lukas linked your issue as well. There I understood that there was a fundamental issue related to the call of fork() on a thread ID > 1, because our signal handlers registered before the creation of the threads are not shared by the threads and the SIGCHLD is not received for threads other than 1, so it can happen that no cleanup happens at all and that zombie process accumulate. I don't know if this could be the cause of the failure of your test with the thread_isolate() patch 1 month ago. I proposed a fix there, consisting in making sure that only thread 1 runs the external checks (by default any thread can do), which fixes the issue in artificially made up setups for me. But I didn't get a response, maybe the people had reverted by then or maybe they're still observing. I can encourage you to take it, I've now merged it so that I don't lose it anymore in a defunct issue. I'm attaching it here, it applies to 1.9 as well with an offset. Normally you'd still need the one about thread_isolate() that you first tried though. Just let us know. Now the other thing to keep in mind is that since it failed after one month, it would also be caused by another bug in 1.9 which would be affected by threads. Hoping this helps, Willy >From 6dd4ac890b5810b0f0fe81725fda05ad3d052849 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Tue, 3 Sep 2019 18:55:02 +0200 Subject: BUG/MEDIUM: check/threads: make external checks run exclusively on thread 1 See GH issues #141 for all the context. In short, registered signal handlers are not inherited by other threads during startup, which is normally not a problem, except that we need that the same thread as the one doing the fork() cleans up the old process using waitpid() once its death is reported via SIGCHLD, as happens in external checks. The only simple solution to this at the moment is to make sure that external checks are exclusively run on the first thread, the one which registered the signal handlers on startup. It will be far more than enough anyway given that external checks must not require to be load balanced on multiple threads! A more complex solution could be designed over the long term to let each thread deal with all signals but it sounds overkill. This must be backported as far as 1.8. --- src/checks.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/src/checks.c b/src/checks.c index e5dfdd73c8..b879100fac 100644 --- a/src/checks.c +++ b/src/checks.c @@ -2177,7 +2177,7 @@ static struct task *process_chk_proc(struct task *t, void *context, unsigned sho /* a success was detected */ check_notify_success(check); } - task_set_affinity(t, MAX_THREADS_MASK); + task_set_affinity(t, 1); check->state &= ~CHK_ST_INPROGRESS; pid_list_del(check->curpid); @@ -2425,8 +2425,13 @@ static int start_check_task(struct check *check, int mininter, int nbcheck, int srvpos) { struct task *t; + unsigned long thread_mask = MAX_THREADS_MASK; + + if (check->type == PR_O2_EXT_CHK) + thread_mask = 1; + /* task for the check */ - if ((t = task_new(MAX_THREADS_MASK)) == NULL) { + if ((t = task_new(thread_mask)) == NULL) { ha_alert("Starting [%s:%s] check: out of memory.\n", check->server->proxy->id, check->server->id); return 0; -- 2.20.1
Re: 1.9 external health checks fail suddenly
On 2019-08-28 11:13, Veiko Kukk wrote: Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu (4 threads configured). Not sure if this is related to patch or is some new bug in 1.9.10. I've now replaced running instance with 1.9.10 without external check patch to see if this happens again. Now, after almost one month, with 1.9.10 (no patches) it happened again. All external checks failed again and there was large amount of zombie external check processes accumulated. Unfortunately since I was not there doing reload, I can't tell timeframe or exact amount of those processes. regards, Veiko
Re: 1.9 external health checks fail suddenly
On 2019-07-11 08:35, Willy Tarreau wrote: against your version. Normally it should work for 1.9 to 2.1. Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu (4 threads configured). Not sure if this is related to patch or is some new bug in 1.9.10. I've now replaced running instance with 1.9.10 without external check patch to see if this happens again. best regards, Veiko
Re: 1.9 external health checks fail suddenly
Hi Veiko, On Wed, Jul 10, 2019 at 09:10:35AM +, Veiko Kukk wrote: > On 2019-07-09 14:29, Willy Tarreau wrote: > > I didn't have a patch but just did it. It was only compile-tested, > > please verify that it works as expected on a non-sensitive machine > > first! > > Hi, Willy > > Against what version should I run this patch? against your version. Normally it should work for 1.9 to 2.1. Willy
Re: 1.9 external health checks fail suddenly
On 2019-07-09 13:59, Lukas Tribus wrote: How are you currently working around this issue? Did you disable external checks? I'd assume failing checks have negative impact on production systems also. Since this has happened so far only 3 times during 2 months, we've just reloaded HAproxy when it happens. Regards, Veiko
Re: 1.9 external health checks fail suddenly
On 2019-07-09 14:29, Willy Tarreau wrote: I didn't have a patch but just did it. It was only compile-tested, please verify that it works as expected on a non-sensitive machine first! Hi, Willy Against what version should I run this patch? Veiko
Re: 1.9 external health checks fail suddenly
Hi Lukas, On Tue, Jul 09, 2019 at 03:59:04PM +0200, Lukas Tribus wrote: > Hello Veiko, > > > On Tue, 9 Jul 2019 at 15:40, Veiko Kukk wrote: > > > > On 2019-07-08 16:06, Lukas Tribus wrote: > > > The bug you may be affected by is: > > > https://github.com/haproxy/haproxy/issues/141 > > > > > > Can you check what happens with: > > > nbthread 1 > > > > I'm afraid I can't because those are production systems that won't be > > able to service with single thread, they have relatively high ssl > > termination load. > > You could probably raise nbproc at that point, if you can get away > with some stats issues ... > > How are you currently working around this issue? Did you disable > external checks? I'd assume failing checks have negative impact on > production systems also. > > > Willy, in issue #141 in sounds like you already have an idea how this > could be fixed, is there a patch that we can ask Veiko to try for > this? I didn't have a patch but just did it. It was only compile-tested, please verify that it works as expected on a non-sensitive machine first! Cheers, Willy >From 32205189f881b98cb0bbe6ed32178f2929e9a627 Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Tue, 9 Jul 2019 16:27:39 +0200 Subject: WIP/BUG: checks: make sure we isolate the thread doing the fork --- src/checks.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/checks.c b/src/checks.c index d3920ce8d..46f93e58f 100644 --- a/src/checks.c +++ b/src/checks.c @@ -1977,8 +1977,10 @@ static int connect_proc_chk(struct task *t) block_sigchld(); + thread_isolate(); pid = fork(); if (pid < 0) { + thread_release(); ha_alert("Failed to fork process for external health check: %s. Aborting.\n", strerror(errno)); set_server_check_status(check, HCHK_STATUS_SOCKERR, strerror(errno)); @@ -2015,6 +2017,7 @@ static int connect_proc_chk(struct task *t) } /* Parent */ + thread_release(); if (check->result == CHK_RES_UNKNOWN) { if (pid_list_add(pid, t) != NULL) { t->expire = tick_add(now_ms, MS_TO_TICKS(check->inter)); -- 2.20.1
Re: 1.9 external health checks fail suddenly
Hello Veiko, On Tue, 9 Jul 2019 at 15:40, Veiko Kukk wrote: > > On 2019-07-08 16:06, Lukas Tribus wrote: > > The bug you may be affected by is: > > https://github.com/haproxy/haproxy/issues/141 > > > > Can you check what happens with: > > nbthread 1 > > I'm afraid I can't because those are production systems that won't be > able to service with single thread, they have relatively high ssl > termination load. You could probably raise nbproc at that point, if you can get away with some stats issues ... How are you currently working around this issue? Did you disable external checks? I'd assume failing checks have negative impact on production systems also. Willy, in issue #141 in sounds like you already have an idea how this could be fixed, is there a patch that we can ask Veiko to try for this? cheers, lukas
Re: 1.9 external health checks fail suddenly
On 2019-07-08 16:06, Lukas Tribus wrote: The bug you may be affected by is: https://github.com/haproxy/haproxy/issues/141 Can you check what happens with: nbthread 1 I'm afraid I can't because those are production systems that won't be able to service with single thread, they have relatively high ssl termination load. Veiko
Re: 1.9 external health checks fail suddenly
Hello, On Mon, 1 Jul 2019 at 12:27, Lukas Tribus wrote: > > > Sometimes (infrequently) all external checks hang and time out: > > > * Has happened with versions 1.9.4 and 1.9.8 on multiple servers with > > > nbproc 1 and nbthread set to (4-12) depending on server. > > > * Happens infrequently, last one happened after 10 days of uptime. > > > * External checks are written in python and write errors into their own > > > log file directly. When hanging happens, nothing is logged by external > > > check. > > > * Only external checks fail, common 'option httpcheck' does not fail at > > > the same time. > > > > External checks are not thread-safe, that's a bug. > > > > Could you try the suggest patch in: > > > > https://github.com/haproxy/haproxy/issues/140#issuecomment-507119534 > > Sorry, I think I got confused here with something else ... the bug is > about signals being blocked, which is not your problem. The bug you may be affected by is: https://github.com/haproxy/haproxy/issues/141 Can you check what happens with: nbthread 1 cheers, lukas
Re: 1.9 external health checks fail suddenly
On 2019-07-01 10:11, Veiko Kukk wrote: Hi Sometimes (infrequently) all external checks hang and time out: * Has happened with versions 1.9.4 and 1.9.8 on multiple servers with nbproc 1 and nbthread set to (4-12) depending on server. * Happens infrequently, last one happened after 10 days of uptime. * External checks are written in python and write errors into their own log file directly. When hanging happens, nothing is logged by external check. * Only external checks fail, common 'option httpcheck' does not fail at the same time. Might be useful to add that reload helps to get over, external health checks start working again.
Re: 1.9 external health checks fail suddenly
Hello, On Mon, 1 Jul 2019 at 12:14, Lukas Tribus wrote: > > Hello Veiko, > > > On Mon, 1 Jul 2019 at 12:12, Veiko Kukk wrote: > > > > Hi > > > > Sometimes (infrequently) all external checks hang and time out: > > * Has happened with versions 1.9.4 and 1.9.8 on multiple servers with > > nbproc 1 and nbthread set to (4-12) depending on server. > > * Happens infrequently, last one happened after 10 days of uptime. > > * External checks are written in python and write errors into their own > > log file directly. When hanging happens, nothing is logged by external > > check. > > * Only external checks fail, common 'option httpcheck' does not fail at > > the same time. > > External checks are not thread-safe, that's a bug. > > Could you try the suggest patch in: > > https://github.com/haproxy/haproxy/issues/140#issuecomment-507119534 Sorry, I think I got confused here with something else ... the bug is about signals being blocked, which is not your problem. Lukas
Re: 1.9 external health checks fail suddenly
Hello Veiko, On Mon, 1 Jul 2019 at 12:12, Veiko Kukk wrote: > > Hi > > Sometimes (infrequently) all external checks hang and time out: > * Has happened with versions 1.9.4 and 1.9.8 on multiple servers with > nbproc 1 and nbthread set to (4-12) depending on server. > * Happens infrequently, last one happened after 10 days of uptime. > * External checks are written in python and write errors into their own > log file directly. When hanging happens, nothing is logged by external > check. > * Only external checks fail, common 'option httpcheck' does not fail at > the same time. External checks are not thread-safe, that's a bug. Could you try the suggest patch in: https://github.com/haproxy/haproxy/issues/140#issuecomment-507119534 Thanks, Lukas