Hi Willy,

Thank you for your reply and your work on HAProxy. I will add some 
instrumentation and hopefully be able to demonstrate your theory. I agree, it's 
the one I had arrived at too :) It seemed unlikely at first since the signals 
are masked inside __signal_process_queue, but it's still possible if timing 
goes against you.

We have a large number of microservices and are using automation where app 
servers get registered and deregistered from an apache zookeeper (for 
autoscaling). HAProxy configuration is generated based off changes to zookeeper 
and reloaded. We don't have reloads in "loops" as such, they are usually done 
infrequently. But, during a large release of applications I think it's quite 
possible that haproxy will be send a few reload commands in a very short space 
of time. Maybe I will try to queue these up a little on the automation side.

Regards,
Alan

-----Original Message-----
From: Willy Tarreau [mailto:w...@1wt.eu]
Sent: 19 March 2015 07:27
To: Alan Fitton
Cc: haproxy@formilux.org
Subject: Re: HAProxy signal queue not working correctly

Hi Alan,

On Wed, Mar 18, 2015 at 01:11:32PM +0000, Alan Fitton wrote:
> Basically the signal_queue isn't being updated with a reference to SIGTTOU,
> because signal_state[SIGTTOU].count is > 0. I guess there's an assumption in
> the code that if any given signal already has events counted up in
> signal_state, then it must have updated signal_queue so they will get
> processed soon.

This is indeed what the code does :

        if (!signal_state[sig].count) {
                /* signal was not queued yet */
                if (signal_queue_len < MAX_SIGNAL)
                        signal_queue[signal_queue_len++] = sig;
                else
                        qfprintf(stderr, "Signal %d : signal queue is 
unexpectedly full.\n", sig);
        }

        signal_state[sig].count++;

So there's theorically no way to have a non-zero count value
with a zero signal_queue_len, unless one of these gets corrupted
at some point.

Also, __signal_process_queue() seems to properly count these :

        for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
                sig  = signal_queue[cur_pos];
                desc = &signal_state[sig];
                if (desc->count) {
                        struct sig_handler *sh, *shb;
                        list_for_each_entry_safe(sh, shb, &desc->handlers, 
list) {
                                if ((sh->flags & SIG_F_TYPE_FCT) && sh->handler)
                                        ((void (*)(struct sig_handler 
*))sh->handler)(sh);
                                else if ((sh->flags & SIG_F_TYPE_TASK) && 
sh->handler)
                                        task_wakeup(sh->handler, sh->arg | 
TASK_WOKEN_SIGNAL);
                        }
                        desc->count = 0;
                }
        }
        signal_queue_len = 0;

> But from what I see below, this doesn't seem to be the case
> always, and then all events of a particular signal can end up getting "lost".
> I think there is some timing or logic issue here.
>
> (22 = SIGTTOU)
>
> /* Break on SIGTTOU. There are 805 events in the
> Program received signal SIGTTOU, Stopped (tty output).
> 0x00002b369ab6a373 in __epoll_wait_nocancel () from /lib64/libc.so.6
> (gdb) print signal_state[22]
> $16 = {count = 805, handlers = {n = 0xe1efa80, p = 0xe1efa80}}
> (gdb) print signal_queue_len
> $17 = 0

That clearly demonstrates a bug! Well, thinking about it now, there would
be a possibility : if the signal is delivered while we're in
__signal_process_queue(), what you observe could indeed happen, because
we'd miss the desc->count and clear signal_queue_len afterwards.

Could you please try to instrument this function to confirm if the issue
is there ? If so we need to use a different set of variables to process
this and protect the loop.

I'll try to do something about it. I've already got a report of a reload
not working once in a while but had no info around it so I attributed it
to a PEBKAC-style issue. If you could share a reproducer, it would really
help. Given your sig count, I guess you send signals in loops ?

Thanks,
Willy

The information contained in this email is strictly confidential and for the 
use of the addressee only, unless otherwise indicated. If you are not the 
intended recipient, please do not read, copy, use or disclose to others this 
message or any attachment. Please also notify the sender by replying to this 
email or by telephone (+44(020 7896 0011) and then delete the email and any 
copies of it. Opinions, conclusion (etc) that do not relate to the official 
business of this company shall be understood as neither given nor endorsed by 
it. IG is a trading name of IG Markets Limited (a company registered in England 
and Wales, company number 04008957) and IG Index Limited (a company registered 
in England and Wales, company number 01190902). Registered address at Cannon 
Bridge House, 25 Dowgate Hill, London EC4R 2YA. Both IG Markets Limited 
(register number 195355) and IG Index Limited (register number 114059) are 
authorised and regulated by the Financial Conduct Authority.

Reply via email to