On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> 
wrote:

> i faced a bit different problem, but that is 100% reproductible :
> - i launch mpirun (no batch manager) from a node with one IB port
> - i use -host node01,node02 where node01 and node02 both have two IB port on 
> the
>   same subnet

FWIW: 2 IB ports on the same subnet?  That's not a good idea.

> by default, this will hang.

...but it still shouldn't hang.  I wonder if it's somehow related to 
https://svn.open-mpi.org/trac/ompi/ticket/4442...?

> if this is a "feature" (e.g. openmpi does not support this kind of 
> configuration) i am fine with it.
> 
> when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the application 
> is a success, then it works just fine.
> 
> if the application calls MPI_Abort() /* and even if all tasks call 
> MPI_Abort() */ then it will hang 100% of the time.
> i do not see that as a feature but as a bug.

Yes, OMPI should never hang upon a call to MPI_Abort.

Can you get some stack traces to show where the hung process(es) are stuck?  
That would help Ralph pin down where things aren't working down in ORTE.

> in an other thread, Jeff mentionned that the usnic btl is doing stuff even if 
> there is no usnic hardware (this will be fixed shortly).
> Do you still see intermittent hang without listing usnic as a btl ?

Yeah, there's a definite race in the usnic BTL ATM.  If you care, here's what's 
happening:

- the usnic BTL fires off its connectivity checker, even if there is no usnic 
hardware present
- during the connectivity checker init:
    - local rank 0 on each server will establish a named socket
    - non-local-rank-0 will wait for that named socket to exist

The race is that the local rank 0 may establish the socket (which completes its 
connectivity checker setup), and then realize that there is no usnic hardware, 
so it exits/closes the usnic BTL -- which destroys the named socket.  Hence, if 
the non-local-rank-0's are late to the party, they never saw the named socket 
get created and wait forever for it.  Result: hang.

Patch coming today that fixes both of these things:

1. connectivity checker won't be launched unless there is usnic hardware present
2. non-local-rank-0's won't wait indefinitely for the named socket

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to