Re: [OMPI users] MPI hangs on multiple nodes

Rolf vandeVaart Tue, 20 Sep 2011 08:34:57 -0400

>> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't
>happen until the third iteration. I take that to mean that the basic
>communication works, but that something is saturating. Is there some notion
>of buffer size somewhere in the MPI system that could explain this?
>
>Hmm.  This is not a good sign; it somewhat indicates a problem with your OS.
>Based on this email and your prior emails, I'm guessing you're using TCP for
>communication, and that the problem is based on inter-node communication
>(e.g., the problem would occur even if you only run 1 process per machine,
>but does not occur if you run all N processes on a single machine, per your #4,
>below).
>


I agree with Jeff here.  Open MPI uses lazy connections to establish 
connections and round robins through the interfaces.
So, the first few communications could work as they are using interfaces that 
could communicate between the nodes, but the third iteration uses an interface 
that for some reason cannot establish the connection.

One flag you can use that may help is --mca btl_base_verbose 20, like this;

mpirun --mca btl_base_verbose 20 connectivity_c

It will dump out a bunch of stuff, but there will be a few lines that look like 
this:

[...snip...]
[dt:09880] btl: tcp: attempting to connect() to [[58627,1],1] address 
10.20.14.101 on port 1025
[...snip...]

Rolf


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Re: [OMPI users] MPI hangs on multiple nodes

Reply via email to