From: Howell Silverman; December 30, 2003 2:03 PM
> 
> >What distro and release are you using?  There was a RH release, maybe
7.3
> or 8, that had a problem like this.  IIRC, upgrading the kernel
corrected
> the problem.
> 
> >Regardless, please supply the distro release and Oscar version you
are
> using.
> 
> ANS:We're using OSCAR version 2.3.1 with RH Linux 8.  Obviously we'll
try
> an
> upgrade to the latest.
> 
> What is your recommendation - Upgrade OS first or upgrade OSCAR first?

First try to do some diagnostics on the system to see if you can
discover a specific source for the problem. If all else fails, try
updating the kernel on the headnode and perhaps the nfs-utils.  You can
get updated glibc and kernel rpms from any of

Red Hat
 ftp://ftp.redhat.com/pub/redhat/linux/updates/8.0/en/os/i686/


Speakeasy, (Seattle, WA)
 ftp://speakeasy.rpmfind.net/linux/redhat/updates/8.0/en/os/i686/

RPMFind (Cambridge, MA)
 ftp://rpmfind.net/linux/redhat/updates/8.0/en/os/i686/

INRIA Rhone-Alpes (Grenoble, France)
 ftp://fr.rpmfind.net/linux/redhat/updates/8.0/en/os/i686/

You can get the rest of the updates by changing the "i686" to "i386" in
any of the above.  I recommend one of the "rpmfind" mirrors, as they are
far more responsive.  The cc's have a cuny address, so the Boston
location may be best, otherwise try speakeasy.

> Is there a procedure for upgrading either or both that will maintain
the
> integrity of the cluster?

You only need to update the headnode.

> >Also, what do the entries in the headnode's log look like around the
time
> of the not responding messages in the compute nodes?
> 
> ANS:Can you point me to where the log is maintained?  My original post
> contained some log messages although I don't know from what log they
came
> from as it was provided to me by someone else in direct contact with
the
> machine. Do those help?

The extracts don't really help per se, as the server timeout could be a
symptom of another problem and the filtering easily masks that by
dropping important info.  The first thing is to look at the logs in
detail to see if anything else is going on, like problems with the
headnode's NICs (running out of resources?) or really high load on the
headnode as was suggested below, e.g., is there is a huge amount of NFS
activity by the other nodes or in toto?

The various log files are in /var/log, specifically, /var/log/messages.

> >Nature of the computational problem and the architecture of the
cluster
> would
> >have a significant bearing on the solution to your problem I think.
I
> would
> >guess offhand its a bandwith issue since the NFS does serve requests
> >eventually.  If you have a really large cluster or a very file
writing
> >intensive computational problem you can tie up the NFS server fairly
> quickly.
> 
> >About how many nodes are you using?
> 
> ANS:Master plus 7 nodes.
> 
> >Did you check the network switch?
> 
> ANS:The switch seems to be operating normally. Cables intact etc.

What about the network traffic?

-- 
David N. Lombard

My comments represent my opinions, not those of Intel Corporation.


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to