Are you talking about running iostat and sending it to a remote syslog periodically?
sar shows31% of the CPU was used for I/O for one minute5 minutes before it stopped recording, but the last I/O record shows 1.82% Here is the "sar -u" output for the time before the crash: Linux 2.6.18-92.1.17.el5 (xxxxxxx) 04/29/2009 12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:01:01 AM all 26.00 0.00 1.34 0.13 0.00 72.53 12:02:01 AM all 25.93 0.00 1.19 0.00 0.00 72.88 12:03:01 AM all 24.78 0.00 1.01 0.00 0.00 74.20 12:04:01 AM all 24.67 0.00 0.95 0.00 0.00 74.38 12:05:02 AM all 25.30 0.00 0.93 0.03 0.00 73.74 12:06:01 AM all 25.51 0.00 1.06 0.04 0.00 73.39 12:07:01 AM all 25.45 0.00 1.32 0.00 0.00 73.23 12:08:01 AM all 26.11 0.00 1.04 0.03 0.00 72.82 12:09:01 AM all 25.35 0.00 0.98 0.00 0.00 73.66 12:10:01 AM all 26.89 0.00 2.63 1.09 0.00 69.39 12:11:01 AM all 26.86 0.00 1.66 0.47 0.00 71.01 12:12:01 AM all 26.16 0.00 1.42 0.04 0.00 72.38 12:13:01 AM all 25.88 0.00 1.33 0.00 0.00 72.79 12:14:01 AM all 26.52 0.00 1.97 0.40 0.00 71.12 12:15:01 AM all 27.35 0.00 2.18 0.25 0.00 70.22 12:16:01 AM all 25.17 0.00 1.17 0.05 0.00 73.61 12:17:01 AM all 26.24 0.00 1.75 0.03 0.00 71.98 12:18:01 AM all 25.37 0.00 1.43 0.13 0.00 73.07 12:19:01 AM all 26.60 0.00 1.65 0.02 0.00 71.73 12:20:01 AM all 26.66 0.00 1.87 0.59 0.00 70.89 12:21:01 AM all 25.16 0.00 1.25 1.21 0.00 72.38 12:22:01 AM all 28.26 0.00 1.26 0.42 0.00 70.07 12:23:01 AM all 26.54 0.00 1.46 1.02 0.00 70.99 12:24:01 AM all 25.56 0.00 1.64 0.30 0.00 72.50 12:25:01 AM all 24.87 0.00 9.23 31.85 0.00 34.04 12:26:01 AM all 28.32 0.00 2.84 15.70 0.00 53.14 12:27:01 AM all 24.97 0.00 1.17 0.07 0.00 73.80 12:28:02 AM all 26.20 0.00 1.27 0.23 0.00 72.30 12:29:01 AM all 27.37 0.00 2.50 0.18 0.00 69.95 12:30:01 AM all 31.04 0.00 2.65 0.15 0.00 66.16 Average: all 26.24 0.00 1.80 1.82 0.00 70.14 08:20:06 AM LINUX RESTART Ganglia shows the number of running processes spike sharply at a max of 30+. I had to power-cycle the boxes to recover. Thanks, Jason solarflow99 wrote: > how is the I/O wait states at the time? can you try a netdump to a > remote syslog? > > > On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe > <ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote: > > Hi everyone, > > I administer Linux servers for a university. I have had two our over > servers have become unresponsive three times (2 on one server) in the > past week. These servers are general purpose timesharing machines and > were under a steady load of around 8. We have students running compute > jobs for last-minute homework assignments. I know that some > students are > working on an intro to threading class. the most telling data is that > ganglia shows a load spike of 50 before one of the outages. > > The servers are Dell PowerEdge 860 with 8GB of RAM and a single > quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop. > > I have the following limits in place: > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 16367 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 200 > virtual memory (kbytes, -v) 2057564 > file locks (-x) unlimited > > I'm recording sar data one per minute. The only notable thing is a > peak > of context switches before the outage and the interrupts all go to > core 0. > > How can prevent the servers from becoming unresponsive even under > heavy > load? > > What can I do to troubleshoot further? > > Thanks, > Jason > > _______________________________________________ > rhelv5-list mailing list > rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com> > https://www.redhat.com/mailman/listinfo/rhelv5-list > > > ------------------------------------------------------------------------ > > _______________________________________________ > rhelv5-list mailing list > rhelv5-list@redhat.com > https://www.redhat.com/mailman/listinfo/rhelv5-list > _______________________________________________ rhelv5-list mailing list rhelv5-list@redhat.com https://www.redhat.com/mailman/listinfo/rhelv5-list