Hi everyone, I administer Linux servers for a university. I have had two our over servers have become unresponsive three times (2 on one server) in the past week. These servers are general purpose timesharing machines and were under a steady load of around 8. We have students running compute jobs for last-minute homework assignments. I know that some students are working on an intro to threading class. the most telling data is that ganglia shows a load spike of 50 before one of the outages.
The servers are Dell PowerEdge 860 with 8GB of RAM and a single quad-core Xeon CPUs. The OS is RHEL 5.2 64bit Desktop. I have the following limits in place: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 16367 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 200 virtual memory (kbytes, -v) 2057564 file locks (-x) unlimited I'm recording sar data one per minute. The only notable thing is a peak of context switches before the outage and the interrupts all go to core 0. How can prevent the servers from becoming unresponsive even under heavy load? What can I do to troubleshoot further? Thanks, Jason _______________________________________________ rhelv5-list mailing list rhelv5-list@redhat.com https://www.redhat.com/mailman/listinfo/rhelv5-list