Hi All, A handful of times now our bind server (currently at 9.5.0, tracking ubuntu intrepid), has stopped responding to requests. We notice this, try to connect by ssh, which takes an abomidably long time, then then su/sudo to root, which also take a long time. My best guess is that these commands are doing DNS lookups, either directly or indirectly, eg. via syslog(), with signals blocked.
Attempting to gracefully restart bind via initscripts, even as root, doesn't work, since the RNDC channel is also slow. Load on the machine is low (~0.1) and IO seems to be minimal. That's running 2.6.24-21 with preempt, and libc6-2.8. Today, when I finally got a root shell, I saw this: # ps -efL |grep 7134 bind 7134 1 7134 0 5 Nov03 ? 00:00:00 /usr/sbin/named -u bind bind 7134 1 7135 0 5 Nov03 ? 00:01:14 /usr/sbin/named -u bind bind 7134 1 7136 0 5 Nov03 ? 00:01:14 /usr/sbin/named -u bind bind 7134 1 7137 0 5 Nov03 ? 00:00:01 /usr/sbin/named -u bind bind 7134 1 7138 0 5 Nov03 ? 00:00:46 /usr/sbin/named -u bind root 13906 13776 13906 0 1 15:42 pts/2 00:00:00 grep 7134 # ls -l /proc/7134/fd total 0K lrwx------ 1 bind bind 1K 2008-11-04 14:36 0 -> /dev/null lrwx------ 1 bind bind 1K 2008-11-04 14:36 1 -> /dev/null lrwx------ 1 bind bind 1K 2008-11-04 14:36 2 -> /dev/null lrwx------ 1 bind bind 1K 2008-11-04 14:36 20 -> socket:[14669] lrwx------ 1 bind bind 1K 2008-11-04 14:36 21 -> socket:[14674] lrwx------ 1 bind bind 1K 2008-11-04 14:36 22 -> socket:[14676] lrwx------ 1 bind bind 1K 2008-11-04 14:36 23 -> socket:[14678] lrwx------ 1 bind bind 1K 2008-11-04 14:36 24 -> socket:[14680] lrwx------ 1 bind bind 1K 2008-11-04 15:42 25 -> socket:[597051] lrwx------ 1 bind bind 1K 2008-11-04 14:36 3 -> socket:[14662] lrwx------ 1 bind bind 1K 2008-11-04 14:36 4 -> /dev/null lr-x------ 1 bind bind 1K 2008-11-04 14:36 5 -> pipe:[14668] lrwx------ 1 bind bind 1K 2008-11-04 14:36 512 -> socket:[14673] lrwx------ 1 bind bind 1K 2008-11-04 14:36 513 -> socket:[14675] lrwx------ 1 bind bind 1K 2008-11-04 14:36 514 -> socket:[14677] lrwx------ 1 bind bind 1K 2008-11-04 14:36 515 -> socket:[14679] l-wx------ 1 bind bind 1K 2008-11-04 14:36 7 -> pipe:[14668] lr-x------ 1 bind bind 1K 2008-11-04 14:36 8 -> /dev/random # strace -p 7134 Process 7134 attached - interrupt to quit futex(0xb753cbd8, FUTEX_WAIT, 7135, NULL # strace -p 7137 Process 7137 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...> <unfinished ...> # strace -p 7138 Process 7138 attached - interrupt to quit select(626, [5 20 21 22 515], [], NULL, NULL <unfinished ...> Ultimately I killed -9 the process, at which point it restarted with no complications. This *may* happen shortly after I run an rndc reload command (just now, I got rid of a category queries/ channel query pair; I seem to recall the first time this happened I'd done an rndc reload to effect a small zonefile update). Is that pid7138 select() an indication of the file descriptor problem? 626 seems very high. Any ideas what I can do to diagnose that? Thanks, Justin
