Hi, We had the same problems with a cluster of 40 nodes. The motherboard has problems with great IO. We have some test programs they used only the cpu and make no or less IO. These programmes runs and runs. But when you have a program like Gaussian with a big IO then this can happen. At last we change the motherboard against the S2882. J. Kabelitz
-----Ursprüngliche Nachricht----- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von stephen mulcahy Gesendet: Mittwoch, 21. November 2007 18:28 An: [email protected] Betreff: [Beowulf] Tips for diagnosing intermittent problems on a small cluster Hi, As I mentioned in my previous posting, the 20 node Tyan S2891 Dual Opteron dual core Debian cluster (1 NFS providing head node, 19 diskless compute nodes) is currently experiencing 2 intermittent problems which I'm trying to diagnose. After a few days of testing and digging through system logs I'm pretty much stumped as to what may be causing these. There are 2 separate problems - anyones opinions on how to go about diagnosing these problems or things I might have missed would be most welcome. Problem #1 Over the last 6 months, 3 different nodes have been found in a powered down state - the nodes seem to have powered off during a run of the model. There are no interesting messages in the system logs co-inciding with the time of these shutdowns. My first suspect was the power supply to cluster but the UPS power system has logged no errors co-inciding with these failures. I've run a bunch of stress testers on the systems that failed including cpuburn and cpustress in the hope that a failing component such as psu or processors would be triggered again -- but all the systems happily ran 24 hours of tests without any problems. 2 of the 3 failing systems are logging some MCE messages - but they seem to be standard memory errors which are being corrected by the system. Any suggestions on where to go next? Problem #2 On 2 occasions over the last 6 months one of the 2 oceanographic models we run on this cluster (ROMS, the other being SWAN) has gone into a state where it is running significantly slower than usual. This seems to have been preceeded by us running the other model but we can't reproducibly get the system into this state. Looking at various process stats - when the model is in the slowed down state - the model goes from about 30% system cpu time, 60% user cpu time to about 60% system cpu time and 30% user cpu time. Again, nothing unusual in the logs, nor in the gigabit switch logs. A quick strace of one of the running model processes didn't show anything significantly unusual (although I don't normally sit there watching straces of the model during normal operational so I could well have missed all sorts of things here). Again, any suggestions on where to go next on this would be welcome, I'm wondering if I'm seeing some strange kernel-level or MPI-level problem which only manifests under certain conditions but I can't even guess at this stage what those conditions might be. Thanks, -stephen -- Stephen Mulcahy, Applepie Solutions Ltd., Innovation in Business Center, GMIT, Dublin Rd, Galway, Ireland. +353.91.751262 http://www.aplpi.com Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway) _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
