I found the problem - the orted wasn't whacking any lingering session 
directories when it exited. Missing one line...sigh.

Rolf: I have submitted a patch for the 1.4 branch. Can you please review? It is 
a trivial fix.

David: Thanks for bringing it to my attention. Sorry for the problem.
Ralph

On Mar 1, 2010, at 2:34 PM, Rolf Vandevaart wrote:

> On 03/01/10 11:51, Ralph Castain wrote:
>> On Mar 1, 2010, at 8:41 AM, David Turner wrote:
>>> On 3/1/10 1:51 AM, Ralph Castain wrote:
>>>> Which version of OMPI are you using? We know that the 1.2 series was 
>>>> unreliable about removing the session directories, but 1.3 and above 
>>>> appear to be quite good about it. If you are having problems with the 1.3 
>>>> or 1.4 series, I would definitely like to know about it.
>>> Oops; sorry!  OMPI 1.4.1, compiled with PGI 10.0 compilers,
>>> running on Scientific Linux 5.4, ofed 1.4.2.
>>> 
>>> The session directories are *frequently* left behind.  I have
>>> not really tried to characterize under what circumstances they
>>> are removed. But please confirm:  they *should* be removed by
>>> OMPI.
>> Most definitely - they should always be removed by OMPI. This is the first 
>> report we have had of them -not- being removed in the 1.4 series, so it is 
>> disturbing.
>> What environment are you running under? Does this happen under normal 
>> termination, or under abnormal failures (the more you can tell us, the 
>> better)?
> 
> Hi Ralph:
> 
> It turns out that I am seeing session directories left behind as well with 
> v1.4 (r22713)  I have not tested any other versions.  I believe there are two 
> elements that make this reproducible.
> 1. Run across 2 or more nodes.
> 2. CTRL-C out of the MPI job.
> 
> Then take a look at the remote nodes and you may see a leftover session 
> directory.  The mpirun node seems to be clean.
> 
> Here is an example using two nodes.  I also added some sleeps to the ring_c 
> program to slow things down so I could hit CTRL-C.
> 
> First, tmp directories are empty:
> [rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv*
> ls: No match.
> [rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
> ls: No match.
> 
> Now run test:
> [rolfv@burl-ct-x2200-6 ~/examples]$ mpirun -np 4 -host 
> burl-ct-x2200-6,burl-ct-x2200-6,burl-ct-x2200-7,burl-ct-x2200-7 ring_slow_c
> Process 0 sending 10 to 1, tag 201 (4 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> mpirun: killing job...
> 
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3002 on node burl-ct-x2200-6 
> exited on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> 4 total processes killed (some possibly by mpirun during cleanup)
> mpirun: clean termination accomplished
> 
> [burl-ct-x2200-6:02990] 2 more processes have sent help message 
> help-mpi-btl-openib.txt / default subnet prefix
> 
> Now check tmp directories:
> [rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv* ls: 
> No match.
> [rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
> total 8
> drwx------ 3 rolfv hpcgroup 4096 Mar  1 17:27 20007/
> 
> Rolf
> 
> -- 
> 
> =========================
> rolf.vandeva...@sun.com
> 781-442-3043
> =========================
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to