Thanks so much Jeff, appreciate your detailed response. I will follow up
with reading the source code for better understanding and get back if I
have further questions.

On 20 April 2017 at 11:40, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> On Apr 20, 2017, at 12:16 AM, Marc Cooper <marccooper2...@gmail.com>
> wrote:
> >
> > I am trying to understand how connections are established among MPI
> ranks. Pardon for the list of questions.
> >
> > 1) Is there a global data structure that creates and stores rank to
> network address (uri or port number) mapping
>
> It's quite a bit more complicated than that.
>
> First off, it is a common incorrect colloquialization to refer to MPI
> processes as "ranks".  It's a minor pet peeve of mine (ask any of the other
> Open MPI developers ;-) ).
>
> Per MPI-3.1:2.7, MPI defines "processes".  Each MPI process contains at
> least 2 communicators immediately after MPI_INIT: MPI_COMM_WORLD and
> MPI_COMM_SELF.  The process has a rank identity in each communicator.  If
> the application makes more communicators, then processes will have even
> more rank identities.  The point is that a rank is meaningless outside the
> context of a specific communicator.  It is more correct to refer to MPI
> processes, and if you mention a rank, you have to also mention the specific
> communicator to which that rank is associated.
>
> ----
>
> A lot of what you ask very much depends on the underlying system and
> configuration:
> - what runtime system is used
> - what network type and API is used
>
> I'll give one example below ("mpirun --hostfile foo -np 4 my_mpi_app"
> using no scheduler, ssh to launch on remote servers, and TCP for MPI
> communications).  But Open MPI is all about its plugins; these plugins can
> vary the exact procedure for much of what I describe below.
>
> ----
>
> In Open MPI, we have a runtime system layer (ORTE -- the Open Run Time
> Environment) which abstracts away whatever the underlying runtime system is.
>
>     Sidenote: the ORTE layer is getting more and more subsumed by the PMIx
> project, but that's a whole different story.
>
> At the beginning of time, mpirun uses ORTE to launch 4 instances of
> my_mpi_app.  ORTE identifies these processes by name and VPID.  The name is
> basically a unique ID representing the entire set of processes that were
> launched together, or "job", as we refer to it in ORTE (that's an OMPI/ORTE
> term, not an MPI term).  The name includes a unique ID for each process in
> the job (the VPID, or "virtual PID").  VPIDs range from 0 to (n-1).
>
> During each MPI process discovers its ORTE name and VPID.  The VPID
> directly correlates to its rank in MPI_COMM_WORLD.
>
> An MPI_Comm contains an MPI_Group, and the Group contains an array of proc
> pointers (we have a few different kinds of procs).  Each proc points to a
> data structure representing a peer process (or the local process itself).
> If no dynamic MPI processes are used, MPI_COMM_WORLD contains group that
> contains an array of pointers to procs representing all the processes
> launched by mpirun.
>
> The creation of all other communicators in the app flows from
> MPI_COMM_WORLD, so ranks in each derived communicator are basically created
> by subset-like operations (MPI dynamic operations like spawn and
> connect/accept are a bit more complicated -- I'll skip that for now).  No
> new procs are created; the groups just contain different arrays of proc
> pointers (but all pointing to the same back-end proc structures).
>
> Open MPI uses (at least) two different subsystems for network
> communication:
> - ORTE/runtime
> - MPI
>
> Let's skip the runtime layer for the moment.  The ORTE/PMIx layers are
> continually evolving, and how network peer discovery is performed is quite
> a complicated answer (and changes over time, especially to start larger and
> larger scale-out jobs with thousands, tens of thousands, and hundreds of
> thousands of MPI processes).
>
> In the MPI layer, let's take one case of using the ob1 PML plugin (there
> are other PMLs, too, and they can work differently).  The ob1 Point to
> Point Messaging Layer is used for MPI_SEND and friends.  It has several
> plugins to support different underlying network types.  In this case, we'll
> discuss the TCP BTL (byte transfer layer) plugin.
>
> Also during MPI_INIT, ob1 loads up all available BTL plugins.  Each BTL
> will query the local machine to see if it is suitable to be used (e.g., if
> there are local network endpoints of the correct type).  If it can be used,
> it will be.  If it can't be used (e.g., if you have no OpenFabrics network
> interfaces on your machine, the openib BTL will disqualify itself and
> effectively close itself out of the MPI process).  For the TCP BTL module,
> it will discover all the local IP interfaces (excluding loopback type
> interfaces, by default).  For each used IP interface, the TCP BTL will
> create a listener socket on that IP interface.
>
> Each processes' collection of listener socket IP addresses / ports is
> published in what we call the "modex" (module exchange).  The modex
> information is *effectively* distributed to all MPI processes.
>
> I say *effectively* because over the last many years/Open MPI releases, we
> have distributed the modex information between all the MPI processes in a
> variety of different ways.  Each release series basically optimizes the
> algorithms used and allows Open MPI to scale to launching larger numbers of
> processes.  That's a whole topic in itself...
>
> Regardless, basically you can abstract the idea that if an Open MPI plugin
> needs information, it can ask for it from the modex, and it will get the
> info.  *How* that happens is what I said above -- it has changed many times
> over the years.  Current mechanisms are more-or-less "lazy", in that
> information is generally sent when it is asked for [vs. preemptively
> exchanging everything during MPI_INIT] -- or, in some cases, modex
> information exchange can be wholly avoided, which is especially important
> when launching 10's or 100's of thousands of processes.  Like I said, the
> runtime side of this stuff is a whole topic in itself... let's get back to
> MPI.
>
> Hence, the first time an MPI process goes to MPI send (of any flavor) to a
> peer, the corresponding BTL module asks for all the modex data for that
> peer.  For each local IP endpoint, the TCP BTL looks through the peer's
> modex info to find a corresponding remote IP endpoint (sidenote: the TCP
> BTL uses some heuristics to determine which remote endpoint is a suitable
> peer -- the usNIC BTL does a much better job of checking reachability
> between local and remote endpoints by actually checking the kernel routing
> table).  TCP sockets are created between appropriate local IP endpoints and
> the peer's remote IP endpoints.
>
> The set of TCP sockets, local IP endpoints, and their corresponding remote
> IP endpoints are then cached, index based on the proc of the peer.
>
> Remember: each proc has at least 2 rank IDs (and possibly many more).
> Hence, we index connection information based on the *proc*, not any given
> communicator's *rank*.
>
> Hence, when we go to MPI_Send, the PML takes the communicator and rank,
> looks up the proc, and uses *that* to address the messages.  The BTL gets
> the message to send and the proc destination (indeed, BTLs know nothing
> about communicators or ranks -- only network addresses and procs).  The BTL
> can then look up its cached information for the peer based on the proc, and
> then go from there.
>
> > 2) How is the location of an MPI rank is determined in point-to-point
> operations such as MPI_Send.
>
> What "location" are you referring to?  Network address?  CPU binding?  ...?
>
> > Logically, given a rank,
>
> Process.
>
> :-)
>
> > there has to be a look-up to get its address and pass this information
> to lower levels to use a send operation with this destination address using
> appropriate network protocol.
>
> Communication -> group -> proc -> BTL information hanging off that proc.
>
> ob1 inherently uses all available BTL connections between peers (e.g., if
> you have multiple IP networks).  It will fragment large messages, split the
> sends / receives over all the available endpoint connections, etc.
>
> > 3) Are the connections among ranks pre-established when a communicator
> is created or are they established on-demand.
>
> It depends on the network (in ob1's case, it depends on the BTL).  Most
> BTLs fall into one of three categories:
>
> 1. Make connections on-demand (works much better for large scale MPI
> applications, because it's unusual for an MPI process to talk to more than
> a few peers in a job).
>
> 2. The network itself is connectionless, so only local endpoints need to
> be opened and information placed in the modex.
>
> 3. Setup all-to-all communication during MPI_INIT.  These days, only the
> two shared memory BTLs (sm and vader) do this (i.e., they alloc big slabs
> of shared memory and setup designated senders/receivers for each part of
> the shared memory, etc.).  For all other networks, it's basically too
> expensive (especially at scale) to setup all-to-all communication during
> MPI_INIT.
>
> Note: each BTL can really do whatever it wants -- it just happens that all
> of the current BTLs do it one of the first 2 ways.
>
> > Where exactly in the source code is this connection created (I'm
> tracking ompi_mpi_init(), so think it must be in orte_init(), not sure
> though).
>
> For lack of a longer explanation, ORTE only has to do with launch
> processes, monitoring, and terminating them.  ORTE has its own
> communication system, but that's not used by the MPI layer for passing MPI
> messages.
>
> In this email, I've been talking about the BTLs, which are used by the ob1
> PML.  Each BTL does its own connection management however it wants to (as
> described above).
>
> Note that the other PMLs do things slightly differently, too.  That can be
> another topic for another time; I don't have time to describe that now.  ;-)
>
> > 4) Does the communicator attribute (c_keyhash) store this mapping
> information, or is this totally different?
>
> Totally different.
>
> We really don't store any internal-to-OMPI stuff in communicator
> attributes.  All that stuff is solely used by applications for
> MPI_COMM_SET_ATTR / MPI_COMM_GET_ATTR and friends.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to