On 03/17/2014 10:52 AM, Jeff Squyres (jsquyres) wrote:
To add on to what Ralph said:

1. There are two different message passing paths in OMPI:
    - "OOB" (out of band): used for control messages
    - "BTL" (byte transfer layer): used for MPI traffic
    (there are actually others, but these seem to be the relevant 2 for your 
setup)

2. If you don't specify which OOB interfaces
to use OMPI will (basically) just pick one.
It doesn't really matter which one it uses;
the OOB channel doesn't use too much bandwidth,
and is mostly just during startup and shutdown.

The one exception to this is stdout/stderr routing.
If your MPI app writes to stdout/stderr, this also uses the OOB path.
So if you output a LOT to stdout, then the OOB interface choice might matter.

Hi All

Not trying to hijack Jianyu's very interesting and informative questions and thread, I have two questions and one note about it.
I promise to shut up after this.

Is the interface that OOB picks and uses
somehow related to how the hosts/nodes names listed
in a "hostfile"
(or in the mpiexec command -host option,
or in the Torque/SGE/Slurm node file,)
are resolved into IP addresses (via /etc/hosts, DNS or other mechanism)?

In other words, does OOB pick the interface associated to the IP address
that resolves the specific node name, or does OOB have its own will and
picks whatever interface it wants?

At some early point during startup I suppose mpiexec
needs to touch base first time with each node,
and I would guess the nodes' IP address
(and the corresponding interface) plays a role then.
Does OOB piggy-back that same interface to do its job?


3. If you don't specify which MPI interfaces to use, OMPI will basically find 
the
"best" set of interfaces and use those. IP interfaces are always rated less than
OS-bypass interfaces (e.g., verbs/IB).


In a node outfitted with more than one Inifinband interface,
can one choose which one OMPI is going to use (say, if one wants to
reserve the other IB interface for IO)?

In other words, are there verbs/rdma syntax equivalent to

--mca btl_tcp_if_include

and to

--mca oob_tcp_if_include  ?

[Perhaps something like --mca btl_openib_if_include ...?]

Forgive me if this question doesn't make sense,
for maybe on its guts verbs/rdma already has a greedy policy of using everything available, but I don't know anything about it.


Or, as you noticed, you can give a comma-delimited list of BTLs to use.
OMPI will then use -- at most -- exactly those BTLs, but definitely no others. Each BTL typically has an additional parameter or parameters that can be used to specify which interfaces to use for the network interface type that that BTL uses.
For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use.

Also, note that you seem to have missed a BTL: sm (shared memory).
sm is the preferred BTL to use for same-server communication.

This may be because several FAQs skip the sm BTL, even when it would
be an appropriate/recommended choice to include in the BTL list.
For instance:

http://www.open-mpi.org/faq/?category=all#selecting-components
http://www.open-mpi.org/faq/?category=all#tcp-selection

The command line examples with an ellipsis "..." don't actually e
xclude the use of "sm", but IMHO are too vague and somewhat misleading.

I think this issue was reported/discussed before in the list,
but somehow the FAQ were not fixed.

Thank you,
Gus Correa

It is much faster than both the TCP loopback device
(which OMPI excludes by default, BTW, which is probably
why you got reachability errors when you specifying
"--mca btl tcp,self") and the verbs (i.e., "openib")
BTL for same-server communication.

4. If you don't specify anything, OMPI usually picks the best thing for you.
In your case, it'll probably be equivalent to:

  mpirun --mca btl openib,sm,self ...

And the control messages will flow across one of your IP interfaces.

5. If you want to be specific about which one it uses,
you can specify oob_tcp_if_include.  For example:

   mpirun --mca oob_tcp_if_include eth0 ...

Make sense?



On Mar 15, 2014, at 1:18 AM, Jianyu Liu <jerry_...@msn.com> wrote:

On Mar 14, 2014, at 10:16:34 AM,Jeff Squyres <jsquyres_at_[hidden]> wrote:

On Mar 14, 2014, at 10:11 AM, Ralph Castain <rhc_at_[hidden]> wrote:

1. If specified '--mca btl tcp,self', which interface application will run on, 
use GigE adaper OR use the OpenFabrics interface in IP over IB mode (just like 
a high performance GigE adapter) ?

Both - ip over ib looks just like an Ethernet adaptor


To be clear: the TCP BTL will use all TCP interfaces (regardless of underlying 
physical transport). Your GigE adapter and your IP adapter both present IP 
interfaces to>the OS, and both support TCP. So the TCP BTL will use them, 
because it just sees the TCP/IP interfaces.

Thanks for your kindly input.

Please see if I have understood correctly

Assume there are two nework
   Gigabit Ethernet

     eth0-renamed : 192.168.[1-22].[1-14] / 255.255.192.0

   InfiniBand network

     ib0 :  172.20.[1-22].[1-4] / 255.255.0.0


1. If specified '--mca btl tcp,self

     The control information ( such as setup and teardown ) are routed to and 
passed by Gigabit Ethernet in TCP/IP mode
     The MPI messages are routed to and passed by InfiniBand network in IP over 
IB mode
     On the same machine, the TCP lookback device will be used for passing 
control and MPI messages

2. If specified '--mca btl tcp,self --mca btl_tcp_if_include ib0'

     Both of control information ( such as setup and teardown ) and MPI 
messages are routed to and passed by InfiniBand network in IP over IB mode
     On the same machine, The TCP lookback device will be used for passing 
control and MPI messages


3. If specified '--mca btl openib,self'

     The control information ( such as setup and teardown ) are routed to and 
passed by InfiniBand network in IP over IB mode
     The MPI messages are routed to and passed by InfiniBand network in RDMA 
mode
     On the same machine, the TCP lookback device will be used for passing 
control and MPI messages


4. If without specifiying any 'mca btl' parameters

     The control information ( such as setup and teardown ) are routed to and 
passed by Gigabit Ethernet in TCP/IP mode
     The MPI messages are routed and passed by InfiniBand network in RDMA mode
     On the same machine, the shared memory (sm) BTL will be used for control 
and MPI passing messages


Appreciating your kindly input

Jianyu                                  
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to