Bugs item #1081608, was opened at 2004-12-08 15:33
Message generated for change (Comment added) made by jsquyres
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=109368&aid=1081608&group_id=9368

Category: Installation
Group: 4.1
>Status: Closed
>Resolution: Fixed
Priority: 9
Submitted By: Bernard Li (bernardli)
Assigned to: Jeff Squyres (jsquyres)
Summary: MPICH test fails first time but passes second time

Initial Comment:
On a freshly built OSCAR install, you execute the 'Test 
OSCAR Cluster' step, MPICH test will fail during the initial 
run but will pass subsequently.

It appears that LAM is the default MPI implementation 
during the initial MPICH test and that seems to be 
causing the problem.

In both LAM & MPICH's test_user script in testing 
directory, changes were made to switcher but I believe 
the switcher_reload command should be executed.

----------------------------------------------------------------------

>Comment By: Jeff Squyres (jsquyres)
Date: 2005-02-22 18:55

Message:
Logged In: YES 
user_id=11722

I added horrid, horrid spin checking in the test_user scripts for both LAM 
and MPICH that will repeatedly [essentially] cexec a switcher command out 
to all nodes checking to see if the right default has been set.  If it hasn't, 
it'll 
keep spinning until it is set properly.

This seems to fix the problem.  Ugh.  Stupid NFS.

Committed on both the trunk and the branch.

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2005-02-15 08:16

Message:
Logged In: YES 
user_id=11722

As discussed on the call yesterday, this is an NFS issue.  Here's what's 
happening...

On the head node, the control test script executes a switcher command to 
change the MPI implementation to MPICH/LAM/whatever.  This changes 
the file $HOME/.switcher.ini (for the user oscartst, in this case).  Since this 
is executed on the head node (which doubles as the NFS server), we're 
quite sure that the server is aware of the change.

The problem is that this change does *not* necessarily propagate out to 
the client nodes immediately.  In the specific case we're seeing here, it 
may even be the difference between the file existing and not existing 
(e.g., if this was the first MPI test, the file $HOME/.switcher.ini may not yet 
exist).  In cases like this, switcher will fall back to its system default 
(whatever the user chose when they configurated switcher) -- which may 
be the "wrong" MPI for the specific test.

Specifically, it happens like this:

- switcher command is executed on the head node
- $HOME/.switcher.ini is changed on the head node
- PBS launches a job on some set of cluster nodes
- a script starts running on the first cluster node in the set, triggering 
shell 
setup scripts (e.g., $HOME/.bashrc).
- this triggers a complicated sequence of events which eventually invokes 
switcher and queries $HOME/.switcher.ini, in this case, to find out which 
MPI to setup
- if $HOME/.switcher.ini is stale (or does not yet exist on the client 
node(s)), switcher may choose the wrong MPI implementation

So this really isn't a switcher issue at all -- it's an NFS issue.

There does not appear to be a good way to force NFS to never have stale 
files other than completely disabiling caching on the clients.  Hence, it's 
pretty hard to guarantee that *all* client nodes have the most recent 
$HOME/.switcher.ini when a given test is run (if any of them don't, it can 
cause the test to fail).

I'll see what I can do in the PBS scripts to at least verify that the $HOME/
.switcher.ini file is not stale on the nodes before proceeding with the test.  
And if it's stale, print out a more descriptive error message, or somesuch.

----------------------------------------------------------------------

Comment By: Jeff Squyres (jsquyres)
Date: 2005-02-12 18:08

Message:
Logged In: YES 
user_id=11722

Fernando graciously lent me the use of one of his verservers to test this 
out on (more specifically, he made the offer 2-3 weeks ago and I just got 
around to taking him up on it today :-( ).

I found that the problem on his machine was that the nodes had /tmp 
permissions of 0755 instead of 01777.  Once I changed this, modules (and 
therefore switcher) started working properly.

However, this doesn't feel like the solution to this problem.  Is anyone else 
still having this problem?   And perchance, do they have a machine that I 
can login to in order to see this problem?

----------------------------------------------------------------------

Comment By: Bernard Li (bernardli)
Date: 2004-12-13 17:31

Message:
Logged In: YES 
user_id=879102

- doc fix for 4.0
- added to release notes for all distros
- should be fixed for 4.1

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=109368&aid=1081608&group_id=9368


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Oscar-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-devel

Reply via email to