Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use

2019-11-21 Thread Christian Ruppert

On 2019-11-20 11:05, William Lallemand wrote:

On Wed, Nov 20, 2019 at 10:19:20AM +0100, Christian Ruppert wrote:

Hi William,

thanks for the patch. I'll test it later today.  What I actually 
wanted to
achieve is: 
https://cbonte.github.io/haproxy-dconv/2.0/management.html#4 Then
HAProxy tries to bind to all listening ports. If some fatal errors 
happen
(eg: address not present on the system, permission denied), the 
process quits
with an error. If a socket binding fails because a port is already in 
use,
then the process will first send a SIGTTOU signal to all the pids 
specified
in the "-st" or "-sf" pid list. This is what is called the "pause" 
signal. It
instructs all existing haproxy processes to temporarily stop listening 
to
their ports so that the new process can try to bind again. During this 
time,
the old process continues to process existing connections. If the 
binding
still fails (because for example a port is shared with another 
daemon), then
the new process sends a SIGTTIN signal to the old processes to 
instruct them
to resume operations just as if nothing happened. The old processes 
will then
restart listening to the ports and continue to accept connections. Not 
that

this mechanism is system

In my test case though it failed to do so.


Well, it only works with HAProxy processes, not with other processes. 
There is
no mechanism to ask a process which is neither an haproxy process nor a 
process

which use SO_REUSEPORT.

With HAProxy processes it will bind with SO_REUSEPORT, and will only 
use the

SIGTTOU/SIGTTIN signals if it fails to do so.

This part of the documentation is for HAProxy without master-worker 
mode
in master-worker mode, once the master is launched successfully it is 
never

supposed to quit upon a reload (kill -USR2).

During a reload in master-worker mode, the master will do a -sf .
If the reload failed for any reason (bad configuration, unable to bind 
etc.),
the behavior is to keep the previous workers. It only tries to kill the 
workers

if the reload succeed. So this is the default behavior.


Your patch seems to fix the issue. The master process won't exit 
anymore. Fallback seems to work during my initial tests. Thanks!


--
Regards,
Christian Ruppert



Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use

2019-11-20 Thread William Lallemand
On Wed, Nov 20, 2019 at 10:19:20AM +0100, Christian Ruppert wrote:
> Hi William,
> 
> thanks for the patch. I'll test it later today.  What I actually wanted to
> achieve is: https://cbonte.github.io/haproxy-dconv/2.0/management.html#4 Then
> HAProxy tries to bind to all listening ports. If some fatal errors happen
> (eg: address not present on the system, permission denied), the process quits
> with an error. If a socket binding fails because a port is already in use,
> then the process will first send a SIGTTOU signal to all the pids specified
> in the "-st" or "-sf" pid list. This is what is called the "pause" signal. It
> instructs all existing haproxy processes to temporarily stop listening to
> their ports so that the new process can try to bind again. During this time,
> the old process continues to process existing connections. If the binding
> still fails (because for example a port is shared with another daemon), then
> the new process sends a SIGTTIN signal to the old processes to instruct them
> to resume operations just as if nothing happened. The old processes will then
> restart listening to the ports and continue to accept connections. Not that
> this mechanism is system
> 
> In my test case though it failed to do so.

Well, it only works with HAProxy processes, not with other processes. There is
no mechanism to ask a process which is neither an haproxy process nor a process
which use SO_REUSEPORT.

With HAProxy processes it will bind with SO_REUSEPORT, and will only use the
SIGTTOU/SIGTTIN signals if it fails to do so.

This part of the documentation is for HAProxy without master-worker mode
in master-worker mode, once the master is launched successfully it is never
supposed to quit upon a reload (kill -USR2).

During a reload in master-worker mode, the master will do a -sf . 
If the reload failed for any reason (bad configuration, unable to bind etc.),
the behavior is to keep the previous workers. It only tries to kill the workers
if the reload succeed. So this is the default behavior.

-- 
William Lallemand



Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use

2019-11-20 Thread Christian Ruppert

Hi William,

thanks for the patch. I'll test it later today.
What I actually wanted to achieve is:
https://cbonte.github.io/haproxy-dconv/2.0/management.html#4
Then HAProxy tries to bind to all listening ports. If some fatal errors 
happen
(eg: address not present on the system, permission denied), the process 
quits
with an error. If a socket binding fails because a port is already in 
use, then
the process will first send a SIGTTOU signal to all the pids specified 
in the
"-st" or "-sf" pid list. This is what is called the "pause" signal. It 
instructs
all existing haproxy processes to temporarily stop listening to their 
ports so
that the new process can try to bind again. During this time, the old 
process
continues to process existing connections. If the binding still fails 
(because
for example a port is shared with another daemon), then the new process 
sends a
SIGTTIN signal to the old processes to instruct them to resume 
operations just
as if nothing happened. The old processes will then restart listening to 
the
ports and continue to accept connections. Not that this mechanism is 
system


In my test case though it failed to do so.

On 2019-11-19 17:27, William Lallemand wrote:

On Tue, Nov 19, 2019 at 04:19:26PM +0100, William Lallemand wrote:

> I then add another bind for port 80, which is in use by squid already
> and try to reload HAProxy. It takes some time until it failes:
>
> Nov 19 14:39:21 894a0f616fec haproxy[2978]: [WARNING] 322/143921 (2978)
> : Reexecuting Master process
> ...
> Nov 19 14:39:28 894a0f616fec haproxy[2978]: [ALERT] 322/143922 (2978) :
> Starting frontend somefrontend: cannot bind socket [0.0.0.0:80]
> ...
> Nov 19 14:39:28 894a0f616fec systemd[1]: haproxy.service: Main process
> exited, code=exited, status=1/FAILURE
>
> The reload itself is still running (systemd) and will timeout after
> about 90s. After that, because of the Restart=always, I guess, it ends
> up in a restart loop.
>
> So I would have expected that the master process will fallback to the
> old process and proceed with the old child until the problem has been
> fixed.
>


The patch in attachment fixes a bug where haproxy could reexecute 
itself in

waitpid mode with -sf -1.

I'm not sure this is your bug, but if this is the case you should see 
haproxy
in waitpid mode, then the master exiting with the usage message in your 
logs.


--
Regards,
Christian Ruppert



Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use

2019-11-19 Thread William Lallemand
On Tue, Nov 19, 2019 at 04:19:26PM +0100, William Lallemand wrote:
> > I then add another bind for port 80, which is in use by squid already 
> > and try to reload HAProxy. It takes some time until it failes:
> > 
> > Nov 19 14:39:21 894a0f616fec haproxy[2978]: [WARNING] 322/143921 (2978) 
> > : Reexecuting Master process
> > ...
> > Nov 19 14:39:28 894a0f616fec haproxy[2978]: [ALERT] 322/143922 (2978) : 
> > Starting frontend somefrontend: cannot bind socket [0.0.0.0:80]
> > ...
> > Nov 19 14:39:28 894a0f616fec systemd[1]: haproxy.service: Main process 
> > exited, code=exited, status=1/FAILURE
> > 
> > The reload itself is still running (systemd) and will timeout after 
> > about 90s. After that, because of the Restart=always, I guess, it ends 
> > up in a restart loop.
> > 
> > So I would have expected that the master process will fallback to the 
> > old process and proceed with the old child until the problem has been 
> > fixed.
> > 

The patch in attachment fixes a bug where haproxy could reexecute itself in
waitpid mode with -sf -1.

I'm not sure this is your bug, but if this is the case you should see haproxy
in waitpid mode, then the master exiting with the usage message in your logs.

-- 
William Lallemand
>From 481a3c62a622974587c731b1bdc1478538fd6527 Mon Sep 17 00:00:00 2001
From: William Lallemand 
Date: Tue, 19 Nov 2019 17:04:18 +0100
Subject: [PATCH] BUG/MEDIUM: mworker: don't fill the -sf argument with -1
 during the reexec

Upon a reexec_on_failure, if the process tried to exit after the
initialization of the process structure but before it was filled with a
PID, the PID in the mworker_proc structure is set to -1.

In this particular case the -sf argument is filled with -1 and haproxy
will exit with the usage message because of that argument.

Should be backported in 2.0.
---
 src/haproxy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/haproxy.c b/src/haproxy.c
index a0e630dfa..1d4771e64 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -673,7 +673,7 @@ void mworker_reload()
 		next_argv[next_argc++] = "-sf";
 
 		list_for_each_entry(child, _list, list) {
-			if (!(child->options & (PROC_O_TYPE_WORKER|PROC_O_TYPE_PROG)))
+			if (!(child->options & (PROC_O_TYPE_WORKER|PROC_O_TYPE_PROG)) || child->pid <= -1 )
 continue;
 			next_argv[next_argc] = memprintf(, "%d", child->pid);
 			if (next_argv[next_argc] == NULL)
-- 
2.21.0



Re: master-worker no-exit-on-failure with SO_REUSEPORT and a port being already in use

2019-11-19 Thread William Lallemand
On Tue, Nov 19, 2019 at 03:45:09PM +0100, Christian Ruppert wrote:
> Hi list,
> 

Hello,

> I'm facing some issues with already in use ports and the fallback 
> feature, during a reload. SO_REUSEPORT already makes ist easier/better 
> but not perfect, as there are still cases were it fails.
> In my test case I've got a Squid running on port 80 and a HAProxy with 
> "master-worker no-exit-on-failure".

The "no-exit-on-failure" option is only useful when you don't want the master
to kill all the HAProxy processes when one of the workers was killed by
another thing that the master (segv, OOM, bug..). In this case you still need
another worker available to do the job. It's mostly used with a configuration
with nbproc > 1.

> I am using the shipped (2.0.8) 
> systemd unit file and startup HAProxy with some frontend and a bind on 
> like 1337 or something.
> I then add another bind for port 80, which is in use by squid already 
> and try to reload HAProxy. It takes some time until it failes:
> 
> Nov 19 14:39:21 894a0f616fec haproxy[2978]: [WARNING] 322/143921 (2978) 
> : Reexecuting Master process
> ...
> Nov 19 14:39:28 894a0f616fec haproxy[2978]: [ALERT] 322/143922 (2978) : 
> Starting frontend somefrontend: cannot bind socket [0.0.0.0:80]
> ...
> Nov 19 14:39:28 894a0f616fec systemd[1]: haproxy.service: Main process 
> exited, code=exited, status=1/FAILURE
> 
> The reload itself is still running (systemd) and will timeout after 
> about 90s. After that, because of the Restart=always, I guess, it ends 
> up in a restart loop.
> 
> So I would have expected that the master process will fallback to the 
> old process and proceed with the old child until the problem has been 
> fixed.
> 
> Can anybody confirm that? Is that intended?
> 
> https://cbonte.github.io/haproxy-dconv/2.0/management.html#4
> https://cbonte.github.io/haproxy-dconv/2.0/configuration.html#3.1-master-worker
>

Looks like a bug to me, the master should have fallback to the "waitpid mode" 
in this case.

Maybe we don't send the sd_notify OK when we are in waitpid mode and systemd
kills the process after the reload timeout.

I'll do some tests to check what's going on. 

-- 
William Lallemand