Sorry for the cross-post, but I wasn't sure what audience should get this. I run a web-/email-/name-server that does about .5-1.0 mbps. My memory usage is always below 50%, CPU usage is minimal (below 10%). I have a custom script that puts data into rrdtool, so I see my CPU/Memory usage up to within 5 minutes of the crash.
However, every morning, I run a Perl script that processes all of my apache logs into webalizer. I believe it is this script, that cause me to get several errors in the /var/log/daemon.log (The cron.daily is run at 06:25:00): May 1 06:32:40 a-web inetd[17351]: getpwnam: mail: No such user May 1 06:32:42 a-web inetd[17354]: getpwnam: cyrus: No such user <-- several logs removed, just named information --> May 1 08:45:55 a-web inetd[13332]: execv /usr/sbin/exim: Too many open files in system May 1 08:45:56 a-web inetd[13336]: execv /usr/sbin/exim: Too many open files in system May 1 08:46:24 a-web pop3d[13433]: connect from x.x.x.x May 1 08:46:24 a-web pop3d[13433]: error: cannot execute /usr/sbin/pop3d: Too many open files in system May 1 08:48:23 a-web proftpd[13841]: connect from x.x.x.x May 1 08:48:23 a-web proftpd[13841]: error: cannot execute /usr/sbin/proftpd: Too many open files in system May 1 08:49:26 a-web pop3d[14036]: connect from x.x.x.x May 1 08:49:26 a-web pop3d[14036]: error: cannot execute /usr/sbin/pop3d: Too many open files in system May 1 08:49:35 a-web pop3d[14068]: connect from x.x.x.x May 1 08:49:35 a-web pop3d[14068]: error: cannot execute /usr/sbin/pop3d: Too many open files in system May 1 08:50:05 a-web pop3d[14164]: connect from x.x.x.x May 1 08:50:26 a-web pop3d[14225]: connect from x.x.x.x May 1 08:50:26 a-web pop3d[14225]: error: cannot execute /usr/sbin/pop3d: Too many open files in system May 1 08:51:05 a-web pop3d[14346]: connect from x.x.x.x May 1 08:51:05 a-web pop3d[14346]: error: cannot execute /usr/sbin/pop3d: Too many open files in system May 1 15:51:14 a-web inetd[14372]: getpwnam: mail: No such user May 1 15:51:27 a-web inetd[14393]: getpwnam: cyrus: No such user May 1 15:51:31 a-web inetd[14400]: getpwnam: cyrus: No such user May 1 15:51:44 a-web inetd[14419]: getpwnam: cyrus: No such user May 1 15:51:50 a-web inetd[14430]: getpwnam: cyrus: No such user During this time, websites are available. FTP, SSH, and email are down. I don't know why there is a 7 hour jump in logs (I did not remove any logs between 08:51 and 15:51). When I got the machine rebooted, it wasn't even 9 am (and this message will be sent by 10:30 am). Now, my named (later this morning) says: May 1 09:10:51 a-web named[234]: limit files set to fdlimit (1024) However, I have changed my: /usr/src/(kernel version)/include/linux/limits.h /usr/include/linux/limits.h to have: #define NR_OPEN 2048 I then packaged up the kernel using make-kpg kernel_image and installed it. I have also changed my /etc/security/limits.conf to have: * soft nofile 2048 * hard nofile 2048 My ulimit -a reports: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 2048 pipe size (512 bytes, -p) 8 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 7168 virtual memory (kbytes, -v) unlimited So according to ulimit I have effectively doubled my open file limit (now 2048 when the default is 1024)... But you can see that at least named still thinks the limit is 1024. Anyways, even after changing this limit, it continues to crash every morning. If I run the webalizer script by shell, it does not crash. In fact, If I run the cron.daily scripts one-at-a-time the server doesn't crash. Info that might be helpful: Debian version: stable woody 3.0 kernel version: 2.4.18 libc6 version: 2.2.5-11.5 apache version: 1.3.26-0woody3 bind version: 8.3.3-2.0woody webalizer version: 2.01.10-2 cronolog version: 1.6.1-0.1 ram: 1 gigabyte swap: 1 gigabyte cpu: Intel Pentium 3 1.3 ghz So my questions are: 1) How do I fix this situation :). 2) Is there a way to see what the current number of real open files are? lsof reports all open sockets, etc, so I'm not sure how. I'm thinking if every couple of seconds or so if I could capture the process and current open files maybe I can determine the problem. I'm guessing the proper way is: lsof | grep REG | wc -l 1809 Which is a little odd, because the limit used to be 1024 and we didn't have problems during the day. So this is probably an incorrect. Thanks, Matthew Walkup