Hi, We've recently upgraded our caching servers to 9.4.3-P4/P3 (2 of them running 9.4.3-P4 and 2 running 9.4.3-P3). Few days ago I've noticed something strange - When the server is loaded, some queries randomly fails (SERVFAIL). It seems that only queries for which the answer is NOT cached are affected. I've verified with host/dig and tcpdump that there is no network issue (no unanswered packets). Digging deeper into the issue, I've found that the issue appears when the number of sockets used by named approach 1024~ (checked with netstat/lsof). The weirdest part, is that if I run "rndc reconfig", suddenly named is able to use more than 1024 sockets (I've seen it using 4000-5000~ sockets), and the problem goes away for about an hour.
If I downgrade to 3.4.2-P2 the problems goes away. I used the following command to reproduce the problem: for i in {1..100000}; do dig mx www.cnn.com @localhost |grep status |grep -v NOERROR; done My servers are running RHEL 5.4 (2.6.18-164.9.1.el5) and FreeBSD 7.0 (the problem is seen on both), and they are splitted into two, unrelated, networks, and on two separate physical locations. I've compiled bind from the vanilla ISC sources using the following configure command: ./configure --enable-threads --enable-largefile --prefix=/usr/local I've also tried the following (I've also raised the OS limits, of course): STD_CDEFINES="-DISC_SOCKET_FDSETSIZE=1048576" ./configure --enable-threads --enable-largefile --prefix=/usr/local As I was seeing the "general: error: socket: file descriptor exceeds limit (4096/4096)" error a couple of days ago. My best guess is that the problem is related to the recent move to epoll... Any ideas on how I should proceed from here? _______________________________________________ bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users