Hi Alan, On Wed, Mar 18, 2015 at 01:11:32PM +0000, Alan Fitton wrote: > Basically the signal_queue isn't being updated with a reference to SIGTTOU, > because signal_state[SIGTTOU].count is > 0. I guess there's an assumption in > the code that if any given signal already has events counted up in > signal_state, then it must have updated signal_queue so they will get > processed soon.
This is indeed what the code does : if (!signal_state[sig].count) { /* signal was not queued yet */ if (signal_queue_len < MAX_SIGNAL) signal_queue[signal_queue_len++] = sig; else qfprintf(stderr, "Signal %d : signal queue is unexpectedly full.\n", sig); } signal_state[sig].count++; So there's theorically no way to have a non-zero count value with a zero signal_queue_len, unless one of these gets corrupted at some point. Also, __signal_process_queue() seems to properly count these : for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) { sig = signal_queue[cur_pos]; desc = &signal_state[sig]; if (desc->count) { struct sig_handler *sh, *shb; list_for_each_entry_safe(sh, shb, &desc->handlers, list) { if ((sh->flags & SIG_F_TYPE_FCT) && sh->handler) ((void (*)(struct sig_handler *))sh->handler)(sh); else if ((sh->flags & SIG_F_TYPE_TASK) && sh->handler) task_wakeup(sh->handler, sh->arg | TASK_WOKEN_SIGNAL); } desc->count = 0; } } signal_queue_len = 0; > But from what I see below, this doesn't seem to be the case > always, and then all events of a particular signal can end up getting "lost". > I think there is some timing or logic issue here. > > (22 = SIGTTOU) > > /* Break on SIGTTOU. There are 805 events in the > Program received signal SIGTTOU, Stopped (tty output). > 0x00002b369ab6a373 in __epoll_wait_nocancel () from /lib64/libc.so.6 > (gdb) print signal_state[22] > $16 = {count = 805, handlers = {n = 0xe1efa80, p = 0xe1efa80}} > (gdb) print signal_queue_len > $17 = 0 That clearly demonstrates a bug! Well, thinking about it now, there would be a possibility : if the signal is delivered while we're in __signal_process_queue(), what you observe could indeed happen, because we'd miss the desc->count and clear signal_queue_len afterwards. Could you please try to instrument this function to confirm if the issue is there ? If so we need to use a different set of variables to process this and protect the loop. I'll try to do something about it. I've already got a report of a reload not working once in a while but had no info around it so I attributed it to a PEBKAC-style issue. If you could share a reproducer, it would really help. Given your sig count, I guess you send signals in loops ? Thanks, Willy