RE: HAProxy signal queue not working correctly

Alan Fitton Tue, 24 Mar 2015 11:03:56 -0700

Hi,

I've been trying out this logging, and a few variations of my own, on one of 
the RHEL5 (2.6.18-371.9.1.el5) systems that was exhibiting the problem more 
frequently.

I am seeing what you saw, signals queued and processed without any issue.. It's 
strange, I can't figure out why the problem isn't reproducible from a large 
amount of signals on the command line, yet would be happening every few days 
through occasional use of the reload command.

I will keep looking. Probably by deploying a patched version which logs the 
below and some more to syslog, so I can look out for the problem in a more 
"realistic" setting.

Since deploying my workaround patch which traverses the signal_queue instead of 
checking signal_state[sig].count, the problem hasn't been seen in our test 
environment.

Agree it's probably a bug which needs fixing from a signal handling point of 
view, and I'm definitely up for trying to find the cause. My thoughts about 
queueing reloads on the automation side were because it seems unnecessary and 
not optimal (for us anyway) for the automation to potentially reload HAProxy 
more than once every few seconds.

while true; do kill -QUIT 28292; kill -TTOU 28292; kill -QUIT 28292; kill -TTIN 
28292; done

leave loop: signal_queue_len=3, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
signal handler: signal_queue_len=0, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0
enter loop: signal_queue_len=3, count[SIGTTOU]=1, count[SIGTTIN]=1, 
count[SIGQUIT]=1
leave loop: signal_queue_len=3, count[SIGTTOU]=0, count[SIGTTIN]=0, 
count[SIGQUIT]=0

Regards,
Alan

-----Original Message-----
From: Willy Tarreau [mailto:w...@1wt.eu]
Sent: 19 March 2015 15:17
To: Alan Fitton
Cc: haproxy@formilux.org
Subject: Re: HAProxy signal queue not working correctly

Hi Alan,

On Thu, Mar 19, 2015 at 10:56:35AM +0000, Alan Fitton wrote:
> Hi Willy,
>
> Thank you for your reply and your work on HAProxy. I will add some
> instrumentation and hopefully be able to demonstrate your theory. I agree,
> it's the one I had arrived at too :) It seemed unlikely at first since the
> signals are masked inside __signal_process_queue, but it's still possible if
> timing goes against you.

I don't think we're sensitive to timing here. I've done this to try to
reproduce the issue :

diff --git a/src/signal.c b/src/signal.c
index e9301ed..241feac 100644
--- a/src/signal.c
+++ b/src/signal.c
@@ -37,6 +37,7 @@ int signal_pending = 0; /* non-zero if t least one signal 
remains unprocessed */
  */
 void signal_handler(int sig)
 {
+       fprintf(stderr, "received signal %d\n", sig);
        if (sig < 0 || sig >= MAX_SIGNAL) {
                /* unhandled signal */
                signal(sig, SIG_IGN);
@@ -77,6 +78,7 @@ void __signal_process_queue()
         * handler. That allows real signal handlers to redistribute signals
         * to tasks subscribed to signal zero.
         */
+       fprintf(stderr, "enter loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
        for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
                sig  = signal_queue[cur_pos];
                desc = &signal_state[sig];
@@ -90,7 +92,9 @@ void __signal_process_queue()
                        }
                        desc->count = 0;
                }
+               sleep(1);
        }
+       fprintf(stderr, "leave loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
        signal_queue_len = 0;

        /* restore signal delivery */

Then I'm doing that :

$ killall -QUIT haproxy;killall -1 haproxy; kill -USR1 $(pidof haproxy) $(pidof 
haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) 
$(pidof haproxy) $(pidof haproxy);

(SIGQUIT send a large response, and SIGHUP talks as well). The sleep(1)
is here to ensure that signals get delivered and queued before we
re-enable signals. I'm clearly seeing my messages being queued :

received signal 3
enter loop: signal_queue_len=1, count[10]=0
Dumping pools usage. Use SIGQUIT to flush them.
  - Pool pipe (32 bytes) : 5 allocated (160 bytes), 5 used, 4 users [SHARED]
  - Pool hlua_com (48 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool task (112 bytes) : 2 allocated (224 bytes), 2 used, 1 users [SHARED]
  - Pool uniqueid (128 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool connection (336 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
[SHARED]
  - Pool hdr_idx (416 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool requri (1024 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool session (1072 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool buffer (8064 bytes) : 3 allocated (24192 bytes), 0 used, 1 users 
[SHARED]
Total: 10 pools, 24576 bytes allocated, 384 used.
leave loop: signal_queue_len=1, count[10]=0
received signal 10
received signal 1
enter loop: signal_queue_len=2, count[10]=1
[WARNING] 077/155017 (7008) : Stopping frontend foo in 0 ms.
received signal 0
[WARNING] 077/155017 (7008) : SIGHUP received, dumping servers states.
[WARNING] 077/155017 (7008) : SIGHUP: Proxy foo has no servers. Conn: 
act(FE+BE): 0+0, 0 pend (0 unass), tot(FE+BE): 0+0.
leave loop: signal_queue_len=3, count[10]=0
[WARNING] 077/155020 (7008) : Proxy foo stopped (FE: 0 conns, BE: 0 conns).

As you can see, the signals are properly queued and delivered. I'm suspecting
that it would be possible that the signals are not properly blocked on your
platform. If you play with the patch above, you might be able to observe
whether it works or not. If this is the case then maybe we'll have to work
on a workaround.

> We have a large number of microservices and are using automation where app
> servers get registered and deregistered from an apache zookeeper (for
> autoscaling). HAProxy configuration is generated based off changes to
> zookeeper and reloaded. We don't have reloads in "loops" as such, they are
> usually done infrequently. But, during a large release of applications I
> think it's quite possible that haproxy will be send a few reload commands in
> a very short space of time.

I tried to do that as well but couldn't get into the bug unfortunately :-/

> Maybe I will try to queue these up a little on the automation side.

No, there is no reason, we need to fix the bug if it's one, or to work
around the issue if the problem is in your kernel for example.

Best regards,
Willy

The information contained in this email is strictly confidential and for the 
use of the addressee only, unless otherwise indicated. If you are not the 
intended recipient, please do not read, copy, use or disclose to others this 
message or any attachment. Please also notify the sender by replying to this 
email or by telephone (+44(020 7896 0011) and then delete the email and any 
copies of it. Opinions, conclusion (etc) that do not relate to the official 
business of this company shall be understood as neither given nor endorsed by 
it. IG is a trading name of IG Markets Limited (a company registered in England 
and Wales, company number 04008957) and IG Index Limited (a company registered 
in England and Wales, company number 01190902). Registered address at Cannon 
Bridge House, 25 Dowgate Hill, London EC4R 2YA. Both IG Markets Limited 
(register number 195355) and IG Index Limited (register number 114059) are 
authorised and regulated by the Financial Conduct Authority.

RE: HAProxy signal queue not working correctly

Reply via email to