Thanks so much Jeff, appreciate your detailed response. I will follow up with reading the source code for better understanding and get back if I have further questions.
On 20 April 2017 at 11:40, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > On Apr 20, 2017, at 12:16 AM, Marc Cooper <marccooper2...@gmail.com> > wrote: > > > > I am trying to understand how connections are established among MPI > ranks. Pardon for the list of questions. > > > > 1) Is there a global data structure that creates and stores rank to > network address (uri or port number) mapping > > It's quite a bit more complicated than that. > > First off, it is a common incorrect colloquialization to refer to MPI > processes as "ranks". It's a minor pet peeve of mine (ask any of the other > Open MPI developers ;-) ). > > Per MPI-3.1:2.7, MPI defines "processes". Each MPI process contains at > least 2 communicators immediately after MPI_INIT: MPI_COMM_WORLD and > MPI_COMM_SELF. The process has a rank identity in each communicator. If > the application makes more communicators, then processes will have even > more rank identities. The point is that a rank is meaningless outside the > context of a specific communicator. It is more correct to refer to MPI > processes, and if you mention a rank, you have to also mention the specific > communicator to which that rank is associated. > > ---- > > A lot of what you ask very much depends on the underlying system and > configuration: > - what runtime system is used > - what network type and API is used > > I'll give one example below ("mpirun --hostfile foo -np 4 my_mpi_app" > using no scheduler, ssh to launch on remote servers, and TCP for MPI > communications). But Open MPI is all about its plugins; these plugins can > vary the exact procedure for much of what I describe below. > > ---- > > In Open MPI, we have a runtime system layer (ORTE -- the Open Run Time > Environment) which abstracts away whatever the underlying runtime system is. > > Sidenote: the ORTE layer is getting more and more subsumed by the PMIx > project, but that's a whole different story. > > At the beginning of time, mpirun uses ORTE to launch 4 instances of > my_mpi_app. ORTE identifies these processes by name and VPID. The name is > basically a unique ID representing the entire set of processes that were > launched together, or "job", as we refer to it in ORTE (that's an OMPI/ORTE > term, not an MPI term). The name includes a unique ID for each process in > the job (the VPID, or "virtual PID"). VPIDs range from 0 to (n-1). > > During each MPI process discovers its ORTE name and VPID. The VPID > directly correlates to its rank in MPI_COMM_WORLD. > > An MPI_Comm contains an MPI_Group, and the Group contains an array of proc > pointers (we have a few different kinds of procs). Each proc points to a > data structure representing a peer process (or the local process itself). > If no dynamic MPI processes are used, MPI_COMM_WORLD contains group that > contains an array of pointers to procs representing all the processes > launched by mpirun. > > The creation of all other communicators in the app flows from > MPI_COMM_WORLD, so ranks in each derived communicator are basically created > by subset-like operations (MPI dynamic operations like spawn and > connect/accept are a bit more complicated -- I'll skip that for now). No > new procs are created; the groups just contain different arrays of proc > pointers (but all pointing to the same back-end proc structures). > > Open MPI uses (at least) two different subsystems for network > communication: > - ORTE/runtime > - MPI > > Let's skip the runtime layer for the moment. The ORTE/PMIx layers are > continually evolving, and how network peer discovery is performed is quite > a complicated answer (and changes over time, especially to start larger and > larger scale-out jobs with thousands, tens of thousands, and hundreds of > thousands of MPI processes). > > In the MPI layer, let's take one case of using the ob1 PML plugin (there > are other PMLs, too, and they can work differently). The ob1 Point to > Point Messaging Layer is used for MPI_SEND and friends. It has several > plugins to support different underlying network types. In this case, we'll > discuss the TCP BTL (byte transfer layer) plugin. > > Also during MPI_INIT, ob1 loads up all available BTL plugins. Each BTL > will query the local machine to see if it is suitable to be used (e.g., if > there are local network endpoints of the correct type). If it can be used, > it will be. If it can't be used (e.g., if you have no OpenFabrics network > interfaces on your machine, the openib BTL will disqualify itself and > effectively close itself out of the MPI process). For the TCP BTL module, > it will discover all the local IP interfaces (excluding loopback type > interfaces, by default). For each used IP interface, the TCP BTL will > create a listener socket on that IP interface. > > Each processes' collection of listener socket IP addresses / ports is > published in what we call the "modex" (module exchange). The modex > information is *effectively* distributed to all MPI processes. > > I say *effectively* because over the last many years/Open MPI releases, we > have distributed the modex information between all the MPI processes in a > variety of different ways. Each release series basically optimizes the > algorithms used and allows Open MPI to scale to launching larger numbers of > processes. That's a whole topic in itself... > > Regardless, basically you can abstract the idea that if an Open MPI plugin > needs information, it can ask for it from the modex, and it will get the > info. *How* that happens is what I said above -- it has changed many times > over the years. Current mechanisms are more-or-less "lazy", in that > information is generally sent when it is asked for [vs. preemptively > exchanging everything during MPI_INIT] -- or, in some cases, modex > information exchange can be wholly avoided, which is especially important > when launching 10's or 100's of thousands of processes. Like I said, the > runtime side of this stuff is a whole topic in itself... let's get back to > MPI. > > Hence, the first time an MPI process goes to MPI send (of any flavor) to a > peer, the corresponding BTL module asks for all the modex data for that > peer. For each local IP endpoint, the TCP BTL looks through the peer's > modex info to find a corresponding remote IP endpoint (sidenote: the TCP > BTL uses some heuristics to determine which remote endpoint is a suitable > peer -- the usNIC BTL does a much better job of checking reachability > between local and remote endpoints by actually checking the kernel routing > table). TCP sockets are created between appropriate local IP endpoints and > the peer's remote IP endpoints. > > The set of TCP sockets, local IP endpoints, and their corresponding remote > IP endpoints are then cached, index based on the proc of the peer. > > Remember: each proc has at least 2 rank IDs (and possibly many more). > Hence, we index connection information based on the *proc*, not any given > communicator's *rank*. > > Hence, when we go to MPI_Send, the PML takes the communicator and rank, > looks up the proc, and uses *that* to address the messages. The BTL gets > the message to send and the proc destination (indeed, BTLs know nothing > about communicators or ranks -- only network addresses and procs). The BTL > can then look up its cached information for the peer based on the proc, and > then go from there. > > > 2) How is the location of an MPI rank is determined in point-to-point > operations such as MPI_Send. > > What "location" are you referring to? Network address? CPU binding? ...? > > > Logically, given a rank, > > Process. > > :-) > > > there has to be a look-up to get its address and pass this information > to lower levels to use a send operation with this destination address using > appropriate network protocol. > > Communication -> group -> proc -> BTL information hanging off that proc. > > ob1 inherently uses all available BTL connections between peers (e.g., if > you have multiple IP networks). It will fragment large messages, split the > sends / receives over all the available endpoint connections, etc. > > > 3) Are the connections among ranks pre-established when a communicator > is created or are they established on-demand. > > It depends on the network (in ob1's case, it depends on the BTL). Most > BTLs fall into one of three categories: > > 1. Make connections on-demand (works much better for large scale MPI > applications, because it's unusual for an MPI process to talk to more than > a few peers in a job). > > 2. The network itself is connectionless, so only local endpoints need to > be opened and information placed in the modex. > > 3. Setup all-to-all communication during MPI_INIT. These days, only the > two shared memory BTLs (sm and vader) do this (i.e., they alloc big slabs > of shared memory and setup designated senders/receivers for each part of > the shared memory, etc.). For all other networks, it's basically too > expensive (especially at scale) to setup all-to-all communication during > MPI_INIT. > > Note: each BTL can really do whatever it wants -- it just happens that all > of the current BTLs do it one of the first 2 ways. > > > Where exactly in the source code is this connection created (I'm > tracking ompi_mpi_init(), so think it must be in orte_init(), not sure > though). > > For lack of a longer explanation, ORTE only has to do with launch > processes, monitoring, and terminating them. ORTE has its own > communication system, but that's not used by the MPI layer for passing MPI > messages. > > In this email, I've been talking about the BTLs, which are used by the ob1 > PML. Each BTL does its own connection management however it wants to (as > described above). > > Note that the other PMLs do things slightly differently, too. That can be > another topic for another time; I don't have time to describe that now. ;-) > > > 4) Does the communicator attribute (c_keyhash) store this mapping > information, or is this totally different? > > Totally different. > > We really don't store any internal-to-OMPI stuff in communicator > attributes. All that stuff is solely used by applications for > MPI_COMM_SET_ATTR / MPI_COMM_GET_ATTR and friends. > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel