Bugs item #1081608, was opened at 2004-12-08 15:33 Message generated for change (Comment added) made by jsquyres You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=109368&aid=1081608&group_id=9368
Category: Installation Group: 4.1 >Status: Closed >Resolution: Fixed Priority: 9 Submitted By: Bernard Li (bernardli) Assigned to: Jeff Squyres (jsquyres) Summary: MPICH test fails first time but passes second time Initial Comment: On a freshly built OSCAR install, you execute the 'Test OSCAR Cluster' step, MPICH test will fail during the initial run but will pass subsequently. It appears that LAM is the default MPI implementation during the initial MPICH test and that seems to be causing the problem. In both LAM & MPICH's test_user script in testing directory, changes were made to switcher but I believe the switcher_reload command should be executed. ---------------------------------------------------------------------- >Comment By: Jeff Squyres (jsquyres) Date: 2005-02-22 18:55 Message: Logged In: YES user_id=11722 I added horrid, horrid spin checking in the test_user scripts for both LAM and MPICH that will repeatedly [essentially] cexec a switcher command out to all nodes checking to see if the right default has been set. If it hasn't, it'll keep spinning until it is set properly. This seems to fix the problem. Ugh. Stupid NFS. Committed on both the trunk and the branch. ---------------------------------------------------------------------- Comment By: Jeff Squyres (jsquyres) Date: 2005-02-15 08:16 Message: Logged In: YES user_id=11722 As discussed on the call yesterday, this is an NFS issue. Here's what's happening... On the head node, the control test script executes a switcher command to change the MPI implementation to MPICH/LAM/whatever. This changes the file $HOME/.switcher.ini (for the user oscartst, in this case). Since this is executed on the head node (which doubles as the NFS server), we're quite sure that the server is aware of the change. The problem is that this change does *not* necessarily propagate out to the client nodes immediately. In the specific case we're seeing here, it may even be the difference between the file existing and not existing (e.g., if this was the first MPI test, the file $HOME/.switcher.ini may not yet exist). In cases like this, switcher will fall back to its system default (whatever the user chose when they configurated switcher) -- which may be the "wrong" MPI for the specific test. Specifically, it happens like this: - switcher command is executed on the head node - $HOME/.switcher.ini is changed on the head node - PBS launches a job on some set of cluster nodes - a script starts running on the first cluster node in the set, triggering shell setup scripts (e.g., $HOME/.bashrc). - this triggers a complicated sequence of events which eventually invokes switcher and queries $HOME/.switcher.ini, in this case, to find out which MPI to setup - if $HOME/.switcher.ini is stale (or does not yet exist on the client node(s)), switcher may choose the wrong MPI implementation So this really isn't a switcher issue at all -- it's an NFS issue. There does not appear to be a good way to force NFS to never have stale files other than completely disabiling caching on the clients. Hence, it's pretty hard to guarantee that *all* client nodes have the most recent $HOME/.switcher.ini when a given test is run (if any of them don't, it can cause the test to fail). I'll see what I can do in the PBS scripts to at least verify that the $HOME/ .switcher.ini file is not stale on the nodes before proceeding with the test. And if it's stale, print out a more descriptive error message, or somesuch. ---------------------------------------------------------------------- Comment By: Jeff Squyres (jsquyres) Date: 2005-02-12 18:08 Message: Logged In: YES user_id=11722 Fernando graciously lent me the use of one of his verservers to test this out on (more specifically, he made the offer 2-3 weeks ago and I just got around to taking him up on it today :-( ). I found that the problem on his machine was that the nodes had /tmp permissions of 0755 instead of 01777. Once I changed this, modules (and therefore switcher) started working properly. However, this doesn't feel like the solution to this problem. Is anyone else still having this problem? And perchance, do they have a machine that I can login to in order to see this problem? ---------------------------------------------------------------------- Comment By: Bernard Li (bernardli) Date: 2004-12-13 17:31 Message: Logged In: YES user_id=879102 - doc fix for 4.0 - added to release notes for all distros - should be fixed for 4.1 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=109368&aid=1081608&group_id=9368 ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Oscar-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-devel
