From: Howell Silverman; December 30, 2003 2:03 PM > > >What distro and release are you using? There was a RH release, maybe 7.3 > or 8, that had a problem like this. IIRC, upgrading the kernel corrected > the problem. > > >Regardless, please supply the distro release and Oscar version you are > using. > > ANS:We're using OSCAR version 2.3.1 with RH Linux 8. Obviously we'll try > an > upgrade to the latest. > > What is your recommendation - Upgrade OS first or upgrade OSCAR first?
First try to do some diagnostics on the system to see if you can discover a specific source for the problem. If all else fails, try updating the kernel on the headnode and perhaps the nfs-utils. You can get updated glibc and kernel rpms from any of Red Hat ftp://ftp.redhat.com/pub/redhat/linux/updates/8.0/en/os/i686/ Speakeasy, (Seattle, WA) ftp://speakeasy.rpmfind.net/linux/redhat/updates/8.0/en/os/i686/ RPMFind (Cambridge, MA) ftp://rpmfind.net/linux/redhat/updates/8.0/en/os/i686/ INRIA Rhone-Alpes (Grenoble, France) ftp://fr.rpmfind.net/linux/redhat/updates/8.0/en/os/i686/ You can get the rest of the updates by changing the "i686" to "i386" in any of the above. I recommend one of the "rpmfind" mirrors, as they are far more responsive. The cc's have a cuny address, so the Boston location may be best, otherwise try speakeasy. > Is there a procedure for upgrading either or both that will maintain the > integrity of the cluster? You only need to update the headnode. > >Also, what do the entries in the headnode's log look like around the time > of the not responding messages in the compute nodes? > > ANS:Can you point me to where the log is maintained? My original post > contained some log messages although I don't know from what log they came > from as it was provided to me by someone else in direct contact with the > machine. Do those help? The extracts don't really help per se, as the server timeout could be a symptom of another problem and the filtering easily masks that by dropping important info. The first thing is to look at the logs in detail to see if anything else is going on, like problems with the headnode's NICs (running out of resources?) or really high load on the headnode as was suggested below, e.g., is there is a huge amount of NFS activity by the other nodes or in toto? The various log files are in /var/log, specifically, /var/log/messages. > >Nature of the computational problem and the architecture of the cluster > would > >have a significant bearing on the solution to your problem I think. I > would > >guess offhand its a bandwith issue since the NFS does serve requests > >eventually. If you have a really large cluster or a very file writing > >intensive computational problem you can tie up the NFS server fairly > quickly. > > >About how many nodes are you using? > > ANS:Master plus 7 nodes. > > >Did you check the network switch? > > ANS:The switch seems to be operating normally. Cables intact etc. What about the network traffic? -- David N. Lombard My comments represent my opinions, not those of Intel Corporation. ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
