On Jun 2, 2014, at 5:03 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote:
> i faced a bit different problem, but that is 100% reproductible : > - i launch mpirun (no batch manager) from a node with one IB port > - i use -host node01,node02 where node01 and node02 both have two IB port on > the > same subnet FWIW: 2 IB ports on the same subnet? That's not a good idea. > by default, this will hang. ...but it still shouldn't hang. I wonder if it's somehow related to https://svn.open-mpi.org/trac/ompi/ticket/4442...? > if this is a "feature" (e.g. openmpi does not support this kind of > configuration) i am fine with it. > > when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the application > is a success, then it works just fine. > > if the application calls MPI_Abort() /* and even if all tasks call > MPI_Abort() */ then it will hang 100% of the time. > i do not see that as a feature but as a bug. Yes, OMPI should never hang upon a call to MPI_Abort. Can you get some stack traces to show where the hung process(es) are stuck? That would help Ralph pin down where things aren't working down in ORTE. > in an other thread, Jeff mentionned that the usnic btl is doing stuff even if > there is no usnic hardware (this will be fixed shortly). > Do you still see intermittent hang without listing usnic as a btl ? Yeah, there's a definite race in the usnic BTL ATM. If you care, here's what's happening: - the usnic BTL fires off its connectivity checker, even if there is no usnic hardware present - during the connectivity checker init: - local rank 0 on each server will establish a named socket - non-local-rank-0 will wait for that named socket to exist The race is that the local rank 0 may establish the socket (which completes its connectivity checker setup), and then realize that there is no usnic hardware, so it exits/closes the usnic BTL -- which destroys the named socket. Hence, if the non-local-rank-0's are late to the party, they never saw the named socket get created and wait forever for it. Result: hang. Patch coming today that fixes both of these things: 1. connectivity checker won't be launched unless there is usnic hardware present 2. non-local-rank-0's won't wait indefinitely for the named socket -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/