Re: HAProxy signal queue not working correctly

2015-03-24 Thread Ha Quan Le
Thanks, I sent request previously to you but I have done it. 
Ha. 

- Original Message -

From: "Alan Fitton"  
To: "Willy Tarreau"  
Cc: haproxy@formilux.org 
Sent: Tuesday, March 24, 2015 2:01:59 PM 
Subject: RE: HAProxy signal queue not working correctly 

Hi, 

I've been trying out this logging, and a few variations of my own, on one of 
the RHEL5 (2.6.18-371.9.1.el5) systems that was exhibiting the problem more 
frequently. 

I am seeing what you saw, signals queued and processed without any issue.. It's 
strange, I can't figure out why the problem isn't reproducible from a large 
amount of signals on the command line, yet would be happening every few days 
through occasional use of the reload command. 

I will keep looking. Probably by deploying a patched version which logs the 
below and some more to syslog, so I can look out for the problem in a more 
"realistic" setting. 

Since deploying my workaround patch which traverses the signal_queue instead of 
checking signal_state[sig].count, the problem hasn't been seen in our test 
environment. 

Agree it's probably a bug which needs fixing from a signal handling point of 
view, and I'm definitely up for trying to find the cause. My thoughts about 
queueing reloads on the automation side were because it seems unnecessary and 
not optimal (for us anyway) for the automation to potentially reload HAProxy 
more than once every few seconds. 

while true; do kill -QUIT 28292; kill -TTOU 28292; kill -QUIT 28292; kill -TTIN 
28292; done 

leave loop: signal_queue_len=3, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0 
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0 
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0 
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0 
enter loop: signal_queue_len=3, count[SIGTTOU]=1, count[SIGTTIN]=1, 
count[SIGQUIT]=1 
leave loop: signal_queue_len=3, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0 

Regards, 
Alan 

-Original Message- 
From: Willy Tarreau [mailto:w...@1wt.eu] 
Sent: 19 March 2015 15:17 
To: Alan Fitton 
Cc: haproxy@formilux.org 
Subject: Re: HAProxy signal queue not working correctly 

Hi Alan, 

On Thu, Mar 19, 2015 at 10:56:35AM +, Alan Fitton wrote: 
> Hi Willy, 
> 
> Thank you for your reply and your work on HAProxy. I will add some 
> instrumentation and hopefully be able to demonstrate your theory. I agree, 
> it's the one I had arrived at too :) It seemed unlikely at first since the 
> signals are masked inside __signal_process_queue, but it's still possible if 
> timing goes against you. 

I don't think we're sensitive to timing here. I've done this to try to 
reproduce the issue : 

diff --git a/src/signal.c b/src/signal.c 
index e9301ed..241feac 100644 
--- a/src/signal.c 
+++ b/src/signal.c 
@@ -37,6 +37,7 @@ int signal_pending = 0; /* non-zero if t least one signal 
remains unprocessed */ 
*/ 
void signal_handler(int sig) 
{ 
+ fprintf(stderr, "received signal %d\n", sig); 
if (sig < 0 || sig >= MAX_SIGNAL) { 
/* unhandled signal */ 
signal(sig, SIG_IGN); 
@@ -77,6 +78,7 @@ void __signal_process_queue() 
* handler. That allows real signal handlers to redistribute signals 
* to tasks subscribed to signal zero. 
*/ 
+ fprintf(stderr, "enter loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count); 
for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) { 
sig = signal_queue[cur_pos]; 
desc = &signal_state[sig]; 
@@ -90,7 +92,9 @@ void __signal_process_queue() 
} 
desc->count = 0; 
} 
+ sleep(1); 
} 
+ fprintf(stderr, "leave loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count); 
signal_queue_len = 0; 

/* restore signal delivery */ 

Then I'm doing that : 

$ killall -QUIT haproxy;killall -1 haproxy; kill -USR1 $(pidof haproxy) $(pidof 
haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) 
$(pidof haproxy) $(pidof haproxy); 

(SIGQUIT send a large response, and SIGHUP talks as well). The sleep(1) 
is here to ensure that signals get delivered and queued before we 
re-enable signals. I'm clearly seeing my messages being queued : 

received signal 3 
enter loop: signal_queue_len=1, count[10]=0 
Dumping pools usage. Use SIGQUIT to flush them. 
- Pool pipe (32 bytes) : 5 allocated (160 bytes), 5 used, 4 users [SHARED] 
- Pool hlua_com (48 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED] 
- Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED] 
- Pool task (112 bytes) : 2 allocated (224 bytes), 2 used, 1 users [SHARED] 
- Pool uniqueid (128 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED] 
- Pool connection (336 bytes) : 0 allocated (0 byt

RE: HAProxy signal queue not working correctly

2015-03-24 Thread Alan Fitton
Hi,

I've been trying out this logging, and a few variations of my own, on one of 
the RHEL5 (2.6.18-371.9.1.el5) systems that was exhibiting the problem more 
frequently.

I am seeing what you saw, signals queued and processed without any issue.. It's 
strange, I can't figure out why the problem isn't reproducible from a large 
amount of signals on the command line, yet would be happening every few days 
through occasional use of the reload command.

I will keep looking. Probably by deploying a patched version which logs the 
below and some more to syslog, so I can look out for the problem in a more 
"realistic" setting.

Since deploying my workaround patch which traverses the signal_queue instead of 
checking signal_state[sig].count, the problem hasn't been seen in our test 
environment.

Agree it's probably a bug which needs fixing from a signal handling point of 
view, and I'm definitely up for trying to find the cause. My thoughts about 
queueing reloads on the automation side were because it seems unnecessary and 
not optimal (for us anyway) for the automation to potentially reload HAProxy 
more than once every few seconds.

while true; do kill -QUIT 28292; kill -TTOU 28292; kill -QUIT 28292; kill -TTIN 
28292; done

leave loop: signal_queue_len=3, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
enter loop: signal_queue_len=3, count[SIGTTOU]=1, count[SIGTTIN]=1, 
count[SIGQUIT]=1
leave loop: signal_queue_len=3, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0

Regards,
Alan

-Original Message-
From: Willy Tarreau [mailto:w...@1wt.eu]
Sent: 19 March 2015 15:17
To: Alan Fitton
Cc: haproxy@formilux.org
Subject: Re: HAProxy signal queue not working correctly

Hi Alan,

On Thu, Mar 19, 2015 at 10:56:35AM +, Alan Fitton wrote:
> Hi Willy,
>
> Thank you for your reply and your work on HAProxy. I will add some
> instrumentation and hopefully be able to demonstrate your theory. I agree,
> it's the one I had arrived at too :) It seemed unlikely at first since the
> signals are masked inside __signal_process_queue, but it's still possible if
> timing goes against you.

I don't think we're sensitive to timing here. I've done this to try to
reproduce the issue :

diff --git a/src/signal.c b/src/signal.c
index e9301ed..241feac 100644
--- a/src/signal.c
+++ b/src/signal.c
@@ -37,6 +37,7 @@ int signal_pending = 0; /* non-zero if t least one signal 
remains unprocessed */
  */
 void signal_handler(int sig)
 {
+   fprintf(stderr, "received signal %d\n", sig);
if (sig < 0 || sig >= MAX_SIGNAL) {
/* unhandled signal */
signal(sig, SIG_IGN);
@@ -77,6 +78,7 @@ void __signal_process_queue()
 * handler. That allows real signal handlers to redistribute signals
 * to tasks subscribed to signal zero.
 */
+   fprintf(stderr, "enter loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
sig  = signal_queue[cur_pos];
desc = &signal_state[sig];
@@ -90,7 +92,9 @@ void __signal_process_queue()
}
desc->count = 0;
}
+   sleep(1);
}
+   fprintf(stderr, "leave loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
signal_queue_len = 0;

/* restore signal delivery */

Then I'm doing that :

$ killall -QUIT haproxy;killall -1 haproxy; kill -USR1 $(pidof haproxy) $(pidof 
haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) 
$(pidof haproxy) $(pidof haproxy);

(SIGQUIT send a large response, and SIGHUP talks as well). The sleep(1)
is here to ensure that signals get delivered and queued before we
re-enable signals. I'm clearly seeing my messages being queued :

received signal 3
enter loop: signal_queue_len=1, count[10]=0
Dumping pools usage. Use SIGQUIT to flush them.
  - Pool pipe (32 bytes) : 5 allocated (160 bytes), 5 used, 4 users [SHARED]
  - Pool hlua_com (48 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool task (112 bytes) : 2 allocated (224 bytes), 2 used, 1 users [SHARED]
  - Pool uniqueid (128 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool connection (336 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
[SHARED]
  - Pool hdr_idx (416 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool r

Re: HAProxy signal queue not working correctly

2015-03-19 Thread Willy Tarreau
Hi Alan,

On Thu, Mar 19, 2015 at 10:56:35AM +, Alan Fitton wrote:
> Hi Willy,
> 
> Thank you for your reply and your work on HAProxy. I will add some
> instrumentation and hopefully be able to demonstrate your theory. I agree,
> it's the one I had arrived at too :) It seemed unlikely at first since the
> signals are masked inside __signal_process_queue, but it's still possible if
> timing goes against you.

I don't think we're sensitive to timing here. I've done this to try to
reproduce the issue :

diff --git a/src/signal.c b/src/signal.c
index e9301ed..241feac 100644
--- a/src/signal.c
+++ b/src/signal.c
@@ -37,6 +37,7 @@ int signal_pending = 0; /* non-zero if t least one signal 
remains unprocessed */
  */
 void signal_handler(int sig)
 {
+   fprintf(stderr, "received signal %d\n", sig);
if (sig < 0 || sig >= MAX_SIGNAL) {
/* unhandled signal */
signal(sig, SIG_IGN);
@@ -77,6 +78,7 @@ void __signal_process_queue()
 * handler. That allows real signal handlers to redistribute signals
 * to tasks subscribed to signal zero.
 */
+   fprintf(stderr, "enter loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
sig  = signal_queue[cur_pos];
desc = &signal_state[sig];
@@ -90,7 +92,9 @@ void __signal_process_queue()
}
desc->count = 0;
}
+   sleep(1);
}
+   fprintf(stderr, "leave loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
signal_queue_len = 0;
 
/* restore signal delivery */

Then I'm doing that :

$ killall -QUIT haproxy;killall -1 haproxy; kill -USR1 $(pidof haproxy) $(pidof 
haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) 
$(pidof haproxy) $(pidof haproxy); 

(SIGQUIT send a large response, and SIGHUP talks as well). The sleep(1)
is here to ensure that signals get delivered and queued before we
re-enable signals. I'm clearly seeing my messages being queued :

received signal 3
enter loop: signal_queue_len=1, count[10]=0
Dumping pools usage. Use SIGQUIT to flush them.
  - Pool pipe (32 bytes) : 5 allocated (160 bytes), 5 used, 4 users [SHARED]
  - Pool hlua_com (48 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool task (112 bytes) : 2 allocated (224 bytes), 2 used, 1 users [SHARED]
  - Pool uniqueid (128 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool connection (336 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
[SHARED]
  - Pool hdr_idx (416 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool requri (1024 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool session (1072 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool buffer (8064 bytes) : 3 allocated (24192 bytes), 0 used, 1 users 
[SHARED]
Total: 10 pools, 24576 bytes allocated, 384 used.
leave loop: signal_queue_len=1, count[10]=0
received signal 10
received signal 1
enter loop: signal_queue_len=2, count[10]=1
[WARNING] 077/155017 (7008) : Stopping frontend foo in 0 ms.
received signal 0
[WARNING] 077/155017 (7008) : SIGHUP received, dumping servers states.
[WARNING] 077/155017 (7008) : SIGHUP: Proxy foo has no servers. Conn: 
act(FE+BE): 0+0, 0 pend (0 unass), tot(FE+BE): 0+0.
leave loop: signal_queue_len=3, count[10]=0
[WARNING] 077/155020 (7008) : Proxy foo stopped (FE: 0 conns, BE: 0 conns).

As you can see, the signals are properly queued and delivered. I'm suspecting
that it would be possible that the signals are not properly blocked on your
platform. If you play with the patch above, you might be able to observe
whether it works or not. If this is the case then maybe we'll have to work
on a workaround.

> We have a large number of microservices and are using automation where app
> servers get registered and deregistered from an apache zookeeper (for
> autoscaling). HAProxy configuration is generated based off changes to
> zookeeper and reloaded. We don't have reloads in "loops" as such, they are
> usually done infrequently. But, during a large release of applications I
> think it's quite possible that haproxy will be send a few reload commands in
> a very short space of time.

I tried to do that as well but couldn't get into the bug unfortunately :-/

> Maybe I will try to queue these up a little on the automation side.

No, there is no reason, we need to fix the bug if it's one, or to work
around the issue if the problem is in your kernel for example.

Best regards,
Willy




RE: HAProxy signal queue not working correctly

2015-03-19 Thread Alan Fitton
Hi Willy,

Thank you for your reply and your work on HAProxy. I will add some 
instrumentation and hopefully be able to demonstrate your theory. I agree, it's 
the one I had arrived at too :) It seemed unlikely at first since the signals 
are masked inside __signal_process_queue, but it's still possible if timing 
goes against you.

We have a large number of microservices and are using automation where app 
servers get registered and deregistered from an apache zookeeper (for 
autoscaling). HAProxy configuration is generated based off changes to zookeeper 
and reloaded. We don't have reloads in "loops" as such, they are usually done 
infrequently. But, during a large release of applications I think it's quite 
possible that haproxy will be send a few reload commands in a very short space 
of time. Maybe I will try to queue these up a little on the automation side.

Regards,
Alan

-Original Message-
From: Willy Tarreau [mailto:w...@1wt.eu]
Sent: 19 March 2015 07:27
To: Alan Fitton
Cc: haproxy@formilux.org
Subject: Re: HAProxy signal queue not working correctly

Hi Alan,

On Wed, Mar 18, 2015 at 01:11:32PM +, Alan Fitton wrote:
> Basically the signal_queue isn't being updated with a reference to SIGTTOU,
> because signal_state[SIGTTOU].count is > 0. I guess there's an assumption in
> the code that if any given signal already has events counted up in
> signal_state, then it must have updated signal_queue so they will get
> processed soon.

This is indeed what the code does :

if (!signal_state[sig].count) {
/* signal was not queued yet */
if (signal_queue_len < MAX_SIGNAL)
signal_queue[signal_queue_len++] = sig;
else
qfprintf(stderr, "Signal %d : signal queue is 
unexpectedly full.\n", sig);
}

signal_state[sig].count++;

So there's theorically no way to have a non-zero count value
with a zero signal_queue_len, unless one of these gets corrupted
at some point.

Also, __signal_process_queue() seems to properly count these :

for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
sig  = signal_queue[cur_pos];
desc = &signal_state[sig];
if (desc->count) {
struct sig_handler *sh, *shb;
list_for_each_entry_safe(sh, shb, &desc->handlers, 
list) {
if ((sh->flags & SIG_F_TYPE_FCT) && sh->handler)
((void (*)(struct sig_handler 
*))sh->handler)(sh);
else if ((sh->flags & SIG_F_TYPE_TASK) && 
sh->handler)
task_wakeup(sh->handler, sh->arg | 
TASK_WOKEN_SIGNAL);
}
desc->count = 0;
}
}
signal_queue_len = 0;

> But from what I see below, this doesn't seem to be the case
> always, and then all events of a particular signal can end up getting "lost".
> I think there is some timing or logic issue here.
>
> (22 = SIGTTOU)
>
> /* Break on SIGTTOU. There are 805 events in the
> Program received signal SIGTTOU, Stopped (tty output).
> 0x2b369ab6a373 in __epoll_wait_nocancel () from /lib64/libc.so.6
> (gdb) print signal_state[22]
> $16 = {count = 805, handlers = {n = 0xe1efa80, p = 0xe1efa80}}
> (gdb) print signal_queue_len
> $17 = 0

That clearly demonstrates a bug! Well, thinking about it now, there would
be a possibility : if the signal is delivered while we're in
__signal_process_queue(), what you observe could indeed happen, because
we'd miss the desc->count and clear signal_queue_len afterwards.

Could you please try to instrument this function to confirm if the issue
is there ? If so we need to use a different set of variables to process
this and protect the loop.

I'll try to do something about it. I've already got a report of a reload
not working once in a while but had no info around it so I attributed it
to a PEBKAC-style issue. If you could share a reproducer, it would really
help. Given your sig count, I guess you send signals in loops ?

Thanks,
Willy

The information contained in this email is strictly confidential and for the 
use of the addressee only, unless otherwise indicated. If you are not the 
intended recipient, please do not read, copy, use or disclose to others this 
message or any attachment. Please also notify the sender by replying to this 
email or by telephone (+44(020 7896 0011) and then delete the email and any 
copies of it. Opinions, conclusion (etc) that do not relate to the official 
business of this company shall be understood as neither given nor endorsed by 
it. IG is a trading name of IG Markets Limited (a company 

Re: HAProxy signal queue not working correctly

2015-03-19 Thread Willy Tarreau
Hi Alan,

On Wed, Mar 18, 2015 at 01:11:32PM +, Alan Fitton wrote:
> Basically the signal_queue isn't being updated with a reference to SIGTTOU,
> because signal_state[SIGTTOU].count is > 0. I guess there's an assumption in
> the code that if any given signal already has events counted up in
> signal_state, then it must have updated signal_queue so they will get
> processed soon.

This is indeed what the code does :

if (!signal_state[sig].count) {
/* signal was not queued yet */
if (signal_queue_len < MAX_SIGNAL)
signal_queue[signal_queue_len++] = sig;
else
qfprintf(stderr, "Signal %d : signal queue is 
unexpectedly full.\n", sig);
}

signal_state[sig].count++;

So there's theorically no way to have a non-zero count value
with a zero signal_queue_len, unless one of these gets corrupted
at some point.

Also, __signal_process_queue() seems to properly count these :

for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
sig  = signal_queue[cur_pos];
desc = &signal_state[sig];
if (desc->count) {
struct sig_handler *sh, *shb;
list_for_each_entry_safe(sh, shb, &desc->handlers, 
list) {
if ((sh->flags & SIG_F_TYPE_FCT) && sh->handler)
((void (*)(struct sig_handler 
*))sh->handler)(sh);
else if ((sh->flags & SIG_F_TYPE_TASK) && 
sh->handler)
task_wakeup(sh->handler, sh->arg | 
TASK_WOKEN_SIGNAL);
}
desc->count = 0;
}
}
signal_queue_len = 0;

> But from what I see below, this doesn't seem to be the case
> always, and then all events of a particular signal can end up getting "lost".
> I think there is some timing or logic issue here.
> 
> (22 = SIGTTOU)
> 
> /* Break on SIGTTOU. There are 805 events in the
> Program received signal SIGTTOU, Stopped (tty output).
> 0x2b369ab6a373 in __epoll_wait_nocancel () from /lib64/libc.so.6
> (gdb) print signal_state[22]
> $16 = {count = 805, handlers = {n = 0xe1efa80, p = 0xe1efa80}}
> (gdb) print signal_queue_len
> $17 = 0

That clearly demonstrates a bug! Well, thinking about it now, there would
be a possibility : if the signal is delivered while we're in
__signal_process_queue(), what you observe could indeed happen, because
we'd miss the desc->count and clear signal_queue_len afterwards.

Could you please try to instrument this function to confirm if the issue
is there ? If so we need to use a different set of variables to process
this and protect the loop.

I'll try to do something about it. I've already got a report of a reload
not working once in a while but had no info around it so I attributed it
to a PEBKAC-style issue. If you could share a reproducer, it would really
help. Given your sig count, I guess you send signals in loops ?

Thanks,
Willy




RE: HAProxy signal queue not working correctly

2015-03-18 Thread Alan Fitton
As a workaround I've patched haproxy to check if a signal is present in the 
signal queue, rather than rely on the signal_state[sig].count to determine 
whether to add it.

--- a/src/signal.c 2015-02-01 06:54:32.0 +
+++ b/src/signal.c 2015-03-18 11:30:19.683413000 +
@@ -35,6 +35,17 @@
  * Signal number zero has a specific status, as it cannot be delivered by the
  * system, any function may call it to perform asynchronous signal delivery.
  */
+
+static int is_signal_queued(int sig)
+{
+  int i = 0;
+  while (i++ < signal_queue_len) {
+if (sig == signal_queue[i])
+  return 1;
+  }
+  return 0;
+}
+
void signal_handler(int sig)
{
   if (sig < 0 || sig >= MAX_SIGNAL) {
@@ -44,7 +55,7 @@
  return;
   }
-   if (!signal_state[sig].count) {
+   if (!is_signal_queued(sig)) {
  /* signal was not queued yet */
  if (signal_queue_len < MAX_SIGNAL)
  signal_queue[signal_queue_len++] = sig;


From: Alan Fitton [mailto:alan.fit...@ig.com]
Sent: 17 March 2015 16:02
To: haproxy@formilux.org
Subject: HAProxy signal queue not working correctly

Hello,

We are in the process of deploying HAProxy to replace our existing internal 
load balancers, 41 installations in our test environment. Backends will be 
added and removed from the configuration automatically (maybe a few times an 
hour) and then the "reload" functionality used.

Every few days, I find that 2 to 4 have ended up in a state where the reload 
function doesn't work. More specifically, the SIGTTOU is ignored by the 
existing HAProxy process, so the new one is unable to bind to its port.

I've been looking at the way HAProxy does signal handling and inspecting the 
process using gdb. I think I can see why the signal is ignored, but am unsure 
how exactly it ends up in this state.

Basically the signal_queue isn't being updated with a reference to SIGTTOU, 
because signal_state[SIGTTOU].count is > 0. I guess there's an assumption in 
the code that if any given signal already has events counted up in 
signal_state, then it must have updated signal_queue so they will get processed 
soon. But from what I see below, this doesn't seem to be the case always, and 
then all events of a particular signal can end up getting "lost". I think there 
is some timing or logic issue here.

(22 = SIGTTOU)

/* Break on SIGTTOU. There are 805 events in the
Program received signal SIGTTOU, Stopped (tty output).
0x2b369ab6a373 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) print signal_state[22]
$16 = {count = 805, handlers = {n = 0xe1efa80, p = 0xe1efa80}}
(gdb) print signal_queue_len
$17 = 0
(gdb) c
Continuing.
Program received signal SIGTTIN, Stopped (tty input).
0x2b369aac5320 in sigaction () from /lib64/libc.so.6
(gdb) print signal_queue_len
$18 = 0
(gdb) print signal_state[22]
$19 = {count = 806, handlers = {n = 0xe1efa80, p = 0xe1efa80}} <-- signal has 
been counted, but they never get processed
(gdb) c
Continuing.

This is on RHEL5. Reload functionality is the reason we chose haproxy so it's 
really important to us that it works correctly :) Please let me know if any 
more details would be useful.

Thanks and Best Regards,
The information contained in this email is strictly confidential and for the 
use of the addressee only, unless otherwise indicated. If you are not the 
intended recipient, please do not read, copy, use or disclose to others this 
message or any attachment. Please also notify the sender by replying to this 
email or by telephone (+44(020 7896 0011) and then delete the email and any 
copies of it. Opinions, conclusion (etc) that do not relate to the official 
business of this company shall be understood as neither given nor endorsed by 
it. IG is a trading name of IG Markets Limited (a company registered in England 
and Wales, company number 04008957) and IG Index Limited (a company registered 
in England and Wales, company number 01190902). Registered address at Cannon 
Bridge House, 25 Dowgate Hill, London EC4R 2YA. Both IG Markets Limited 
(register number 195355) and IG Index Limited (register number 114059) are 
authorised and regulated by the Financial Conduct Authority.