Are you talking about running iostat and sending it to a remote syslog
periodically?

sar shows31% of the CPU was used for I/O for one minute5 minutes before
it stopped recording, but the last I/O record shows 1.82%

Here is the "sar -u" output for the time before the crash:
Linux 2.6.18-92.1.17.el5 (xxxxxxx)      04/29/2009

12:00:01 AM       CPU     %user     %nice   %system   %iowait   
%steal     %idle
12:01:01 AM       all     26.00      0.00      1.34      0.13     
0.00     72.53
12:02:01 AM       all     25.93      0.00      1.19      0.00     
0.00     72.88
12:03:01 AM       all     24.78      0.00      1.01      0.00     
0.00     74.20
12:04:01 AM       all     24.67      0.00      0.95      0.00     
0.00     74.38
12:05:02 AM       all     25.30      0.00      0.93      0.03     
0.00     73.74
12:06:01 AM       all     25.51      0.00      1.06      0.04     
0.00     73.39
12:07:01 AM       all     25.45      0.00      1.32      0.00     
0.00     73.23
12:08:01 AM       all     26.11      0.00      1.04      0.03     
0.00     72.82
12:09:01 AM       all     25.35      0.00      0.98      0.00     
0.00     73.66
12:10:01 AM       all     26.89      0.00      2.63      1.09     
0.00     69.39
12:11:01 AM       all     26.86      0.00      1.66      0.47     
0.00     71.01
12:12:01 AM       all     26.16      0.00      1.42      0.04     
0.00     72.38
12:13:01 AM       all     25.88      0.00      1.33      0.00     
0.00     72.79
12:14:01 AM       all     26.52      0.00      1.97      0.40     
0.00     71.12
12:15:01 AM       all     27.35      0.00      2.18      0.25     
0.00     70.22
12:16:01 AM       all     25.17      0.00      1.17      0.05     
0.00     73.61
12:17:01 AM       all     26.24      0.00      1.75      0.03     
0.00     71.98
12:18:01 AM       all     25.37      0.00      1.43      0.13     
0.00     73.07
12:19:01 AM       all     26.60      0.00      1.65      0.02     
0.00     71.73
12:20:01 AM       all     26.66      0.00      1.87      0.59     
0.00     70.89
12:21:01 AM       all     25.16      0.00      1.25      1.21     
0.00     72.38
12:22:01 AM       all     28.26      0.00      1.26      0.42     
0.00     70.07
12:23:01 AM       all     26.54      0.00      1.46      1.02     
0.00     70.99
12:24:01 AM       all     25.56      0.00      1.64      0.30     
0.00     72.50
12:25:01 AM       all     24.87      0.00      9.23     31.85     
0.00     34.04
12:26:01 AM       all     28.32      0.00      2.84     15.70     
0.00     53.14
12:27:01 AM       all     24.97      0.00      1.17      0.07     
0.00     73.80
12:28:02 AM       all     26.20      0.00      1.27      0.23     
0.00     72.30
12:29:01 AM       all     27.37      0.00      2.50      0.18     
0.00     69.95
12:30:01 AM       all     31.04      0.00      2.65      0.15     
0.00     66.16
Average:          all     26.24      0.00      1.80      1.82     
0.00     70.14
08:20:06 AM       LINUX RESTART

Ganglia shows the number of running processes spike sharply at a max of 30+.

I had to power-cycle the boxes to recover.

Thanks,
Jason

solarflow99 wrote:
> how is the I/O wait states at the time?  can you try a netdump to a
> remote syslog?
>
>
> On Thu, Apr 30, 2009 at 3:20 PM, Jason Edgecombe
> <ja...@rampaginggeek.com <mailto:ja...@rampaginggeek.com>> wrote:
>
>     Hi everyone,
>
>     I administer Linux servers for a university. I have had two our over
>     servers have become unresponsive three times (2 on one server) in the
>     past week. These servers are general purpose timesharing machines and
>     were under a steady load of around 8. We have students running compute
>     jobs for last-minute homework assignments. I know that some
>     students are
>     working on an intro to threading class. the most telling data is that
>     ganglia shows a load spike of 50 before one of the outages.
>
>     The servers are Dell PowerEdge 860 with 8GB of RAM and a single
>     quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop.
>
>     I have the following limits in place:
>     core file size          (blocks, -c) 0
>     data seg size           (kbytes, -d) unlimited
>     scheduling priority             (-e) 0
>     file size               (blocks, -f) unlimited
>     pending signals                 (-i) 16367
>     max locked memory       (kbytes, -l) 32
>     max memory size         (kbytes, -m) unlimited
>     open files                      (-n) 1024
>     pipe size            (512 bytes, -p) 8
>     POSIX message queues     (bytes, -q) 819200
>     real-time priority              (-r) 0
>     stack size              (kbytes, -s) 10240
>     cpu time               (seconds, -t) unlimited
>     max user processes              (-u) 200
>     virtual memory          (kbytes, -v) 2057564
>     file locks                      (-x) unlimited
>
>     I'm recording sar data one per minute. The only notable thing is a
>     peak
>     of context switches before the outage and the interrupts all go to
>     core 0.
>
>     How can prevent the servers from becoming unresponsive even under
>     heavy
>     load?
>
>     What can I do to troubleshoot further?
>
>     Thanks,
>     Jason
>
>     _______________________________________________
>     rhelv5-list mailing list
>     rhelv5-list@redhat.com <mailto:rhelv5-list@redhat.com>
>     https://www.redhat.com/mailman/listinfo/rhelv5-list
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> rhelv5-list mailing list
> rhelv5-list@redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv5-list
>   

_______________________________________________
rhelv5-list mailing list
rhelv5-list@redhat.com
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to