Re: 1.9 external health checks fail suddenly

2019-09-23 Thread Willy Tarreau
Hi Veiko,

On Mon, Sep 23, 2019 at 09:11:36AM +, Veiko Kukk wrote:
> On 2019-08-28 11:13, Veiko Kukk wrote:
> > Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu
> > (4 threads configured). Not sure if this is related to patch or is
> > some new bug in 1.9.10. I've now replaced running instance with 1.9.10
> > without external check patch to see if this happens again.
> 
> Now, after almost one month, with 1.9.10 (no patches) it happened again. All
> external checks failed again and there was large amount of zombie external
> check processes accumulated.
> Unfortunately since I was not there doing reload, I can't tell timeframe or
> exact amount of those processes.

No worries for this last point, we know what production is. I must
say I forgot about this issue since we had another conversation on
a similar subject with a few other people in github issue #141 where
Lukas linked your issue as well.

There I understood that there was a fundamental issue related to the
call of fork() on a thread ID > 1, because our signal handlers registered
before the creation of the threads are not shared by the threads and the
SIGCHLD is not received for threads other than 1, so it can happen that
no cleanup happens at all and that zombie process accumulate. I don't
know if this could be the cause of the failure of your test with the
thread_isolate() patch 1 month ago. I proposed a fix there, consisting
in making sure that only thread 1 runs the external checks (by default
any thread can do), which fixes the issue in artificially made up setups
for me. But I didn't get a response, maybe the people had reverted by
then or maybe they're still observing.

I can encourage you to take it, I've now merged it so that I don't lose
it anymore in a defunct issue. I'm attaching it here, it applies to 1.9
as well with an offset. Normally you'd still need the one about
thread_isolate() that you first tried though. Just let us know.

Now the other thing to keep in mind is that since it failed after one
month, it would also be caused by another bug in 1.9 which would be
affected by threads.

Hoping this helps,
Willy
>From 6dd4ac890b5810b0f0fe81725fda05ad3d052849 Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Tue, 3 Sep 2019 18:55:02 +0200
Subject: BUG/MEDIUM: check/threads: make external checks run exclusively on
 thread 1

See GH issues #141 for all the context. In short, registered signal
handlers are not inherited by other threads during startup, which is
normally not a problem, except that we need that the same thread as
the one doing the fork() cleans up the old process using waitpid()
once its death is reported via SIGCHLD, as happens in external checks.

The only simple solution to this at the moment is to make sure that
external checks are exclusively run on the first thread, the one
which registered the signal handlers on startup. It will be far more
than enough anyway given that external checks must not require to be
load balanced on multiple threads! A more complex solution could be
designed over the long term to let each thread deal with all signals
but it sounds overkill.

This must be backported as far as 1.8.
---
 src/checks.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/checks.c b/src/checks.c
index e5dfdd73c8..b879100fac 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -2177,7 +2177,7 @@ static struct task *process_chk_proc(struct task *t, void 
*context, unsigned sho
/* a success was detected */
check_notify_success(check);
}
-   task_set_affinity(t, MAX_THREADS_MASK);
+   task_set_affinity(t, 1);
check->state &= ~CHK_ST_INPROGRESS;
 
pid_list_del(check->curpid);
@@ -2425,8 +2425,13 @@ static int start_check_task(struct check *check, int 
mininter,
int nbcheck, int srvpos)
 {
struct task *t;
+   unsigned long thread_mask = MAX_THREADS_MASK;
+
+   if (check->type == PR_O2_EXT_CHK)
+   thread_mask = 1;
+
/* task for the check */
-   if ((t = task_new(MAX_THREADS_MASK)) == NULL) {
+   if ((t = task_new(thread_mask)) == NULL) {
ha_alert("Starting [%s:%s] check: out of memory.\n",
 check->server->proxy->id, check->server->id);
return 0;
-- 
2.20.1



Re: 1.9 external health checks fail suddenly

2019-09-23 Thread Veiko Kukk

On 2019-08-28 11:13, Veiko Kukk wrote:

Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu
(4 threads configured). Not sure if this is related to patch or is
some new bug in 1.9.10. I've now replaced running instance with 1.9.10
without external check patch to see if this happens again.


Now, after almost one month, with 1.9.10 (no patches) it happened again. 
All external checks failed again and there was large amount of zombie 
external check processes accumulated.
Unfortunately since I was not there doing reload, I can't tell timeframe 
or exact amount of those processes.


regards,
Veiko



Re: 1.9 external health checks fail suddenly

2019-08-28 Thread Veiko Kukk

On 2019-07-11 08:35, Willy Tarreau wrote:

against your version. Normally it should work for 1.9 to 2.1.


Applied it to 1.9.10, after ~ 12h it ran into spinlock using 400% cpu (4 
threads configured). Not sure if this is related to patch or is some new 
bug in 1.9.10. I've now replaced running instance with 1.9.10 without 
external check patch to see if this happens again.


best regards,
Veiko



Re: 1.9 external health checks fail suddenly

2019-07-11 Thread Willy Tarreau
Hi Veiko,

On Wed, Jul 10, 2019 at 09:10:35AM +, Veiko Kukk wrote:
> On 2019-07-09 14:29, Willy Tarreau wrote:
> > I didn't have a patch but just did it. It was only compile-tested,
> > please verify that it works as expected on a non-sensitive machine
> > first!
> 
> Hi, Willy
> 
> Against what version should I run this patch?

against your version. Normally it should work for 1.9 to 2.1.

Willy



Re: 1.9 external health checks fail suddenly

2019-07-10 Thread Veiko Kukk

On 2019-07-09 13:59, Lukas Tribus wrote:

How are you currently working around this issue? Did you disable
external checks? I'd assume failing checks have negative impact on
production systems also.


Since this has happened so far only 3 times during 2 months, we've just 
reloaded HAproxy when it happens.


Regards,
Veiko



Re: 1.9 external health checks fail suddenly

2019-07-10 Thread Veiko Kukk

On 2019-07-09 14:29, Willy Tarreau wrote:

I didn't have a patch but just did it. It was only compile-tested,
please verify that it works as expected on a non-sensitive machine
first!


Hi, Willy

Against what version should I run this patch?

Veiko



Re: 1.9 external health checks fail suddenly

2019-07-09 Thread Willy Tarreau
Hi Lukas,

On Tue, Jul 09, 2019 at 03:59:04PM +0200, Lukas Tribus wrote:
> Hello Veiko,
> 
> 
> On Tue, 9 Jul 2019 at 15:40, Veiko Kukk  wrote:
> >
> > On 2019-07-08 16:06, Lukas Tribus wrote:
> > > The bug you may be affected by is:
> > > https://github.com/haproxy/haproxy/issues/141
> > >
> > > Can you check what happens with:
> > > nbthread 1
> >
> > I'm afraid I can't because those are production systems that won't be
> > able to service with single thread, they have relatively high ssl
> > termination load.
> 
> You could probably raise nbproc at that point, if you can get away
> with some stats issues ...
> 
> How are you currently working around this issue? Did you disable
> external checks? I'd assume failing checks have negative impact on
> production systems also.
> 
> 
> Willy, in issue #141 in sounds like you already have an idea how this
> could be fixed, is there a patch that we can ask Veiko to try for
> this?

I didn't have a patch but just did it. It was only compile-tested,
please verify that it works as expected on a non-sensitive machine
first!

Cheers,
Willy
>From 32205189f881b98cb0bbe6ed32178f2929e9a627 Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Tue, 9 Jul 2019 16:27:39 +0200
Subject: WIP/BUG: checks: make sure we isolate the thread doing the fork

---
 src/checks.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/checks.c b/src/checks.c
index d3920ce8d..46f93e58f 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -1977,8 +1977,10 @@ static int connect_proc_chk(struct task *t)
 
block_sigchld();
 
+   thread_isolate();
pid = fork();
if (pid < 0) {
+   thread_release();
ha_alert("Failed to fork process for external health check: %s. 
Aborting.\n",
 strerror(errno));
set_server_check_status(check, HCHK_STATUS_SOCKERR, 
strerror(errno));
@@ -2015,6 +2017,7 @@ static int connect_proc_chk(struct task *t)
}
 
/* Parent */
+   thread_release();
if (check->result == CHK_RES_UNKNOWN) {
if (pid_list_add(pid, t) != NULL) {
t->expire = tick_add(now_ms, MS_TO_TICKS(check->inter));
-- 
2.20.1



Re: 1.9 external health checks fail suddenly

2019-07-09 Thread Lukas Tribus
Hello Veiko,


On Tue, 9 Jul 2019 at 15:40, Veiko Kukk  wrote:
>
> On 2019-07-08 16:06, Lukas Tribus wrote:
> > The bug you may be affected by is:
> > https://github.com/haproxy/haproxy/issues/141
> >
> > Can you check what happens with:
> > nbthread 1
>
> I'm afraid I can't because those are production systems that won't be
> able to service with single thread, they have relatively high ssl
> termination load.

You could probably raise nbproc at that point, if you can get away
with some stats issues ...

How are you currently working around this issue? Did you disable
external checks? I'd assume failing checks have negative impact on
production systems also.


Willy, in issue #141 in sounds like you already have an idea how this
could be fixed, is there a patch that we can ask Veiko to try for
this?

cheers,
lukas



Re: 1.9 external health checks fail suddenly

2019-07-09 Thread Veiko Kukk

On 2019-07-08 16:06, Lukas Tribus wrote:

The bug you may be affected by is:
https://github.com/haproxy/haproxy/issues/141

Can you check what happens with:
nbthread 1


I'm afraid I can't because those are production systems that won't be 
able to service with single thread, they have relatively high ssl 
termination load.


Veiko



Re: 1.9 external health checks fail suddenly

2019-07-08 Thread Lukas Tribus
Hello,

On Mon, 1 Jul 2019 at 12:27, Lukas Tribus  wrote:
> > > Sometimes (infrequently) all external checks hang and time out:
> > > * Has happened with versions 1.9.4 and 1.9.8 on multiple servers with
> > > nbproc 1 and nbthread set to (4-12) depending on server.
> > > * Happens infrequently, last one happened after 10 days of uptime.
> > > * External checks are written in python and write errors into their own
> > > log file directly. When hanging happens, nothing is logged by external
> > > check.
> > > * Only external checks fail, common 'option httpcheck' does not fail at
> > > the same time.
> >
> > External checks are not thread-safe, that's a bug.
> >
> > Could you try the suggest patch in:
> >
> > https://github.com/haproxy/haproxy/issues/140#issuecomment-507119534
>
> Sorry, I think I got confused here with something else ... the bug is
> about signals being blocked, which is not your problem.

The bug you may be affected by is:
https://github.com/haproxy/haproxy/issues/141

Can you check what happens with:
nbthread 1



cheers,
lukas



Re: 1.9 external health checks fail suddenly

2019-07-01 Thread Veiko Kukk

On 2019-07-01 10:11, Veiko Kukk wrote:

Hi

Sometimes (infrequently) all external checks hang and time out:
* Has happened with versions 1.9.4 and 1.9.8 on multiple servers with
nbproc 1 and nbthread set to (4-12) depending on server.
* Happens infrequently, last one happened after 10 days of uptime.
* External checks are written in python and write errors into their
own log file directly. When hanging happens, nothing is logged by
external check.
* Only external checks fail, common 'option httpcheck' does not fail
at the same time.


Might be useful to add that reload helps to get over, external health 
checks start working again.




Re: 1.9 external health checks fail suddenly

2019-07-01 Thread Lukas Tribus
Hello,

On Mon, 1 Jul 2019 at 12:14, Lukas Tribus  wrote:
>
> Hello Veiko,
>
>
> On Mon, 1 Jul 2019 at 12:12, Veiko Kukk  wrote:
> >
> > Hi
> >
> > Sometimes (infrequently) all external checks hang and time out:
> > * Has happened with versions 1.9.4 and 1.9.8 on multiple servers with
> > nbproc 1 and nbthread set to (4-12) depending on server.
> > * Happens infrequently, last one happened after 10 days of uptime.
> > * External checks are written in python and write errors into their own
> > log file directly. When hanging happens, nothing is logged by external
> > check.
> > * Only external checks fail, common 'option httpcheck' does not fail at
> > the same time.
>
> External checks are not thread-safe, that's a bug.
>
> Could you try the suggest patch in:
>
> https://github.com/haproxy/haproxy/issues/140#issuecomment-507119534

Sorry, I think I got confused here with something else ... the bug is
about signals being blocked, which is not your problem.



Lukas



Re: 1.9 external health checks fail suddenly

2019-07-01 Thread Lukas Tribus
Hello Veiko,


On Mon, 1 Jul 2019 at 12:12, Veiko Kukk  wrote:
>
> Hi
>
> Sometimes (infrequently) all external checks hang and time out:
> * Has happened with versions 1.9.4 and 1.9.8 on multiple servers with
> nbproc 1 and nbthread set to (4-12) depending on server.
> * Happens infrequently, last one happened after 10 days of uptime.
> * External checks are written in python and write errors into their own
> log file directly. When hanging happens, nothing is logged by external
> check.
> * Only external checks fail, common 'option httpcheck' does not fail at
> the same time.

External checks are not thread-safe, that's a bug.

Could you try the suggest patch in:

https://github.com/haproxy/haproxy/issues/140#issuecomment-507119534


Thanks,
Lukas