Lutz Vieweg <[EMAIL PROTECTED]> wrote:
I'm currently investigating the following problem, which seems to indicate
a misbehaviour of the kernel:
A server software we implemented is sporadically "hanging" in a select()
call since we upgraded from kernel 2.4 to (currently) 2.6.9 (we have to wait
for 2.6.12 before we can upgrade again due to the shared-mem-not-dumped-into-
core-files problem addressed there).
What's suspicious is that whenever we attach with gdb to such a hanging process,
we can see that a pipe, whose file-descriptor is definitely included in the
fd_set "readfds" (and "n" is also high enough) has a byte in it available for
reading - and just leaving gdb again is enough to let the server continue just
fine.
We are using that pipe, which is known only to the same one process, to cause
select() to return immediately if a signal (SIGUSR1) had been delivered to the
process (by another process), there's a signal handler installed that does
nothing but a (non-blocking) write of 1 byte to the writing end of the pipe.
This mechanism worked fine before kernel 2.6, and it is still working in 99.99%
of
the cases, but under heavy load, every few hours, we'll see the hanging select()
as mentioned above.
Following up on my own (yes, still using kernel 2.6.9, we will try it with
.12 later -
but I wanted to share the latest results on my investigation nevertheless):
We found that when the server process hangs inside the select() call, the
kernel structure flags indicate a situation where select() shall indeed return:
The result of
> ps -eo cmd,pid,sig_pend,sig_block,sig_catch,sig_ignore
for the hanging process is:
CMD PID SIGNAL BLOCKED CATCHED
IGNORED
./csn io_child 10972 0000000000000200 0000000000000000 000000001181764b
0000000000000000
which means that SIGUSR1 is known to be pending (and of course SIGUSR1 is also
catched
as there's a signal handler installed as described above).
Correct me if I'm wrong, but isn't it a clear sign of something being wrong
with select() if it does not return in this situation?
Sending the hanging process another "kill -s SIGUSR1 10972" does not change the
situation, the process keeps hanging and the values printed above do not change.
Sending a different signal or attaching/detaching gdb causes select() to return,
with the pending value returning to 0 as expected.
So my suspicion is that there's a race condition where select() goes to sleep
even though SIGUSR1 just arrives.
Will follow up once we could upgrade to 2.6.12 or gained significant news,
I'm thankful for any ideas on this issue at any time.
Regards,
Lutz Vieweg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/