Re: [OMPI devel] Fwd: Questions on rank-to-address mapping

Jeff Squyres (jsquyres) Thu, 20 Apr 2017 11:42:08 -0700

On Apr 20, 2017, at 12:16 AM, Marc Cooper <marccooper2...@gmail.com> wrote:
> 
> I am trying to understand how connections are established among MPI ranks. 
> Pardon for the list of questions.
> 
> 1) Is there a global data structure that creates and stores rank to network 
> address (uri or port number) mapping


It's quite a bit more complicated than that.

First off, it is a common incorrect colloquialization to refer to MPI processes 
as "ranks".  It's a minor pet peeve of mine (ask any of the other Open MPI 
developers ;-) ).

Per MPI-3.1:2.7, MPI defines "processes".  Each MPI process contains at least 2 
communicators immediately after MPI_INIT: MPI_COMM_WORLD and MPI_COMM_SELF.  
The process has a rank identity in each communicator.  If the application makes 
more communicators, then processes will have even more rank identities.  The 
point is that a rank is meaningless outside the context of a specific 
communicator.  It is more correct to refer to MPI processes, and if you mention 
a rank, you have to also mention the specific communicator to which that rank 
is associated.

----

A lot of what you ask very much depends on the underlying system and 
configuration:
- what runtime system is used
- what network type and API is used

I'll give one example below ("mpirun --hostfile foo -np 4 my_mpi_app" using no 
scheduler, ssh to launch on remote servers, and TCP for MPI communications).  
But Open MPI is all about its plugins; these plugins can vary the exact 
procedure for much of what I describe below.

----

In Open MPI, we have a runtime system layer (ORTE -- the Open Run Time 
Environment) which abstracts away whatever the underlying runtime system is.

    Sidenote: the ORTE layer is getting more and more subsumed by the PMIx 
project, but that's a whole different story.

At the beginning of time, mpirun uses ORTE to launch 4 instances of my_mpi_app. 
 ORTE identifies these processes by name and VPID.  The name is basically a 
unique ID representing the entire set of processes that were launched together, 
or "job", as we refer to it in ORTE (that's an OMPI/ORTE term, not an MPI 
term).  The name includes a unique ID for each process in the job (the VPID, or 
"virtual PID").  VPIDs range from 0 to (n-1).

During each MPI process discovers its ORTE name and VPID.  The VPID directly 
correlates to its rank in MPI_COMM_WORLD.

An MPI_Comm contains an MPI_Group, and the Group contains an array of proc 
pointers (we have a few different kinds of procs).  Each proc points to a data 
structure representing a peer process (or the local process itself).  If no 
dynamic MPI processes are used, MPI_COMM_WORLD contains group that contains an 
array of pointers to procs representing all the processes launched by mpirun.

The creation of all other communicators in the app flows from MPI_COMM_WORLD, 
so ranks in each derived communicator are basically created by subset-like 
operations (MPI dynamic operations like spawn and connect/accept are a bit more 
complicated -- I'll skip that for now).  No new procs are created; the groups 
just contain different arrays of proc pointers (but all pointing to the same 
back-end proc structures).

Open MPI uses (at least) two different subsystems for network communication:
- ORTE/runtime
- MPI

Let's skip the runtime layer for the moment.  The ORTE/PMIx layers are 
continually evolving, and how network peer discovery is performed is quite a 
complicated answer (and changes over time, especially to start larger and 
larger scale-out jobs with thousands, tens of thousands, and hundreds of 
thousands of MPI processes).

In the MPI layer, let's take one case of using the ob1 PML plugin (there are 
other PMLs, too, and they can work differently).  The ob1 Point to Point 
Messaging Layer is used for MPI_SEND and friends.  It has several plugins to 
support different underlying network types.  In this case, we'll discuss the 
TCP BTL (byte transfer layer) plugin.

Also during MPI_INIT, ob1 loads up all available BTL plugins.  Each BTL will 
query the local machine to see if it is suitable to be used (e.g., if there are 
local network endpoints of the correct type).  If it can be used, it will be.  
If it can't be used (e.g., if you have no OpenFabrics network interfaces on 
your machine, the openib BTL will disqualify itself and effectively close 
itself out of the MPI process).  For the TCP BTL module, it will discover all 
the local IP interfaces (excluding loopback type interfaces, by default).  For 
each used IP interface, the TCP BTL will create a listener socket on that IP 
interface.

Each processes' collection of listener socket IP addresses / ports is published 
in what we call the "modex" (module exchange).  The modex information is 
*effectively* distributed to all MPI processes.

I say *effectively* because over the last many years/Open MPI releases, we have 
distributed the modex information between all the MPI processes in a variety of 
different ways.  Each release series basically optimizes the algorithms used 
and allows Open MPI to scale to launching larger numbers of processes.  That's 
a whole topic in itself...

Regardless, basically you can abstract the idea that if an Open MPI plugin 
needs information, it can ask for it from the modex, and it will get the info.  
*How* that happens is what I said above -- it has changed many times over the 
years.  Current mechanisms are more-or-less "lazy", in that information is 
generally sent when it is asked for [vs. preemptively exchanging everything 
during MPI_INIT] -- or, in some cases, modex information exchange can be wholly 
avoided, which is especially important when launching 10's or 100's of 
thousands of processes.  Like I said, the runtime side of this stuff is a whole 
topic in itself... let's get back to MPI.

Hence, the first time an MPI process goes to MPI send (of any flavor) to a 
peer, the corresponding BTL module asks for all the modex data for that peer.  
For each local IP endpoint, the TCP BTL looks through the peer's modex info to 
find a corresponding remote IP endpoint (sidenote: the TCP BTL uses some 
heuristics to determine which remote endpoint is a suitable peer -- the usNIC 
BTL does a much better job of checking reachability between local and remote 
endpoints by actually checking the kernel routing table).  TCP sockets are 
created between appropriate local IP endpoints and the peer's remote IP 
endpoints.

The set of TCP sockets, local IP endpoints, and their corresponding remote IP 
endpoints are then cached, index based on the proc of the peer.

Remember: each proc has at least 2 rank IDs (and possibly many more).  Hence, 
we index connection information based on the *proc*, not any given 
communicator's *rank*.

Hence, when we go to MPI_Send, the PML takes the communicator and rank, looks 
up the proc, and uses *that* to address the messages.  The BTL gets the message 
to send and the proc destination (indeed, BTLs know nothing about communicators 
or ranks -- only network addresses and procs).  The BTL can then look up its 
cached information for the peer based on the proc, and then go from there.

> 2) How is the location of an MPI rank is determined in point-to-point 
> operations such as MPI_Send.

What "location" are you referring to?  Network address?  CPU binding?  ...?

> Logically, given a rank,

Process.

:-)

> there has to be a look-up to get its address and pass this information to 
> lower levels to use a send operation with this destination address using 
> appropriate network protocol.

Communication -> group -> proc -> BTL information hanging off that proc.

ob1 inherently uses all available BTL connections between peers (e.g., if you 
have multiple IP networks).  It will fragment large messages, split the sends / 
receives over all the available endpoint connections, etc.

> 3) Are the connections among ranks pre-established when a communicator is 
> created or are they established on-demand.

It depends on the network (in ob1's case, it depends on the BTL).  Most BTLs 
fall into one of three categories:

1. Make connections on-demand (works much better for large scale MPI 
applications, because it's unusual for an MPI process to talk to more than a 
few peers in a job).

2. The network itself is connectionless, so only local endpoints need to be 
opened and information placed in the modex.

3. Setup all-to-all communication during MPI_INIT.  These days, only the two 
shared memory BTLs (sm and vader) do this (i.e., they alloc big slabs of shared 
memory and setup designated senders/receivers for each part of the shared 
memory, etc.).  For all other networks, it's basically too expensive 
(especially at scale) to setup all-to-all communication during MPI_INIT.

Note: each BTL can really do whatever it wants -- it just happens that all of 
the current BTLs do it one of the first 2 ways.

> Where exactly in the source code is this connection created (I'm tracking 
> ompi_mpi_init(), so think it must be in orte_init(), not sure though).

For lack of a longer explanation, ORTE only has to do with launch processes, 
monitoring, and terminating them.  ORTE has its own communication system, but 
that's not used by the MPI layer for passing MPI messages.

In this email, I've been talking about the BTLs, which are used by the ob1 PML. 
 Each BTL does its own connection management however it wants to (as described 
above).

Note that the other PMLs do things slightly differently, too.  That can be 
another topic for another time; I don't have time to describe that now.  ;-)

> 4) Does the communicator attribute (c_keyhash) store this mapping 
> information, or is this totally different?

Totally different.

We really don't store any internal-to-OMPI stuff in communicator attributes.  
All that stuff is solely used by applications for MPI_COMM_SET_ATTR / 
MPI_COMM_GET_ATTR and friends.

-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Fwd: Questions on rank-to-address mapping

Reply via email to