I found the problem - the orted wasn't whacking any lingering session directories when it exited. Missing one line...sigh.
Rolf: I have submitted a patch for the 1.4 branch. Can you please review? It is a trivial fix. David: Thanks for bringing it to my attention. Sorry for the problem. Ralph On Mar 1, 2010, at 2:34 PM, Rolf Vandevaart wrote: > On 03/01/10 11:51, Ralph Castain wrote: >> On Mar 1, 2010, at 8:41 AM, David Turner wrote: >>> On 3/1/10 1:51 AM, Ralph Castain wrote: >>>> Which version of OMPI are you using? We know that the 1.2 series was >>>> unreliable about removing the session directories, but 1.3 and above >>>> appear to be quite good about it. If you are having problems with the 1.3 >>>> or 1.4 series, I would definitely like to know about it. >>> Oops; sorry! OMPI 1.4.1, compiled with PGI 10.0 compilers, >>> running on Scientific Linux 5.4, ofed 1.4.2. >>> >>> The session directories are *frequently* left behind. I have >>> not really tried to characterize under what circumstances they >>> are removed. But please confirm: they *should* be removed by >>> OMPI. >> Most definitely - they should always be removed by OMPI. This is the first >> report we have had of them -not- being removed in the 1.4 series, so it is >> disturbing. >> What environment are you running under? Does this happen under normal >> termination, or under abnormal failures (the more you can tell us, the >> better)? > > Hi Ralph: > > It turns out that I am seeing session directories left behind as well with > v1.4 (r22713) I have not tested any other versions. I believe there are two > elements that make this reproducible. > 1. Run across 2 or more nodes. > 2. CTRL-C out of the MPI job. > > Then take a look at the remote nodes and you may see a leftover session > directory. The mpirun node seems to be clean. > > Here is an example using two nodes. I also added some sleeps to the ring_c > program to slow things down so I could hit CTRL-C. > > First, tmp directories are empty: > [rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv* > ls: No match. > [rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv* > ls: No match. > > Now run test: > [rolfv@burl-ct-x2200-6 ~/examples]$ mpirun -np 4 -host > burl-ct-x2200-6,burl-ct-x2200-6,burl-ct-x2200-7,burl-ct-x2200-7 ring_slow_c > Process 0 sending 10 to 1, tag 201 (4 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > mpirun: killing job... > > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 3002 on node burl-ct-x2200-6 > exited on signal 0 (Unknown signal 0). > -------------------------------------------------------------------------- > 4 total processes killed (some possibly by mpirun during cleanup) > mpirun: clean termination accomplished > > [burl-ct-x2200-6:02990] 2 more processes have sent help message > help-mpi-btl-openib.txt / default subnet prefix > > Now check tmp directories: > [rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv* ls: > No match. > [rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv* > total 8 > drwx------ 3 rolfv hpcgroup 4096 Mar 1 17:27 20007/ > > Rolf > > -- > > ========================= > rolf.vandeva...@sun.com > 781-442-3043 > ========================= > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users