On Apr 20, 2017, at 12:16 AM, Marc Cooper <marccooper2...@gmail.com> wrote: > > I am trying to understand how connections are established among MPI ranks. > Pardon for the list of questions. > > 1) Is there a global data structure that creates and stores rank to network > address (uri or port number) mapping
It's quite a bit more complicated than that. First off, it is a common incorrect colloquialization to refer to MPI processes as "ranks". It's a minor pet peeve of mine (ask any of the other Open MPI developers ;-) ). Per MPI-3.1:2.7, MPI defines "processes". Each MPI process contains at least 2 communicators immediately after MPI_INIT: MPI_COMM_WORLD and MPI_COMM_SELF. The process has a rank identity in each communicator. If the application makes more communicators, then processes will have even more rank identities. The point is that a rank is meaningless outside the context of a specific communicator. It is more correct to refer to MPI processes, and if you mention a rank, you have to also mention the specific communicator to which that rank is associated. ---- A lot of what you ask very much depends on the underlying system and configuration: - what runtime system is used - what network type and API is used I'll give one example below ("mpirun --hostfile foo -np 4 my_mpi_app" using no scheduler, ssh to launch on remote servers, and TCP for MPI communications). But Open MPI is all about its plugins; these plugins can vary the exact procedure for much of what I describe below. ---- In Open MPI, we have a runtime system layer (ORTE -- the Open Run Time Environment) which abstracts away whatever the underlying runtime system is. Sidenote: the ORTE layer is getting more and more subsumed by the PMIx project, but that's a whole different story. At the beginning of time, mpirun uses ORTE to launch 4 instances of my_mpi_app. ORTE identifies these processes by name and VPID. The name is basically a unique ID representing the entire set of processes that were launched together, or "job", as we refer to it in ORTE (that's an OMPI/ORTE term, not an MPI term). The name includes a unique ID for each process in the job (the VPID, or "virtual PID"). VPIDs range from 0 to (n-1). During each MPI process discovers its ORTE name and VPID. The VPID directly correlates to its rank in MPI_COMM_WORLD. An MPI_Comm contains an MPI_Group, and the Group contains an array of proc pointers (we have a few different kinds of procs). Each proc points to a data structure representing a peer process (or the local process itself). If no dynamic MPI processes are used, MPI_COMM_WORLD contains group that contains an array of pointers to procs representing all the processes launched by mpirun. The creation of all other communicators in the app flows from MPI_COMM_WORLD, so ranks in each derived communicator are basically created by subset-like operations (MPI dynamic operations like spawn and connect/accept are a bit more complicated -- I'll skip that for now). No new procs are created; the groups just contain different arrays of proc pointers (but all pointing to the same back-end proc structures). Open MPI uses (at least) two different subsystems for network communication: - ORTE/runtime - MPI Let's skip the runtime layer for the moment. The ORTE/PMIx layers are continually evolving, and how network peer discovery is performed is quite a complicated answer (and changes over time, especially to start larger and larger scale-out jobs with thousands, tens of thousands, and hundreds of thousands of MPI processes). In the MPI layer, let's take one case of using the ob1 PML plugin (there are other PMLs, too, and they can work differently). The ob1 Point to Point Messaging Layer is used for MPI_SEND and friends. It has several plugins to support different underlying network types. In this case, we'll discuss the TCP BTL (byte transfer layer) plugin. Also during MPI_INIT, ob1 loads up all available BTL plugins. Each BTL will query the local machine to see if it is suitable to be used (e.g., if there are local network endpoints of the correct type). If it can be used, it will be. If it can't be used (e.g., if you have no OpenFabrics network interfaces on your machine, the openib BTL will disqualify itself and effectively close itself out of the MPI process). For the TCP BTL module, it will discover all the local IP interfaces (excluding loopback type interfaces, by default). For each used IP interface, the TCP BTL will create a listener socket on that IP interface. Each processes' collection of listener socket IP addresses / ports is published in what we call the "modex" (module exchange). The modex information is *effectively* distributed to all MPI processes. I say *effectively* because over the last many years/Open MPI releases, we have distributed the modex information between all the MPI processes in a variety of different ways. Each release series basically optimizes the algorithms used and allows Open MPI to scale to launching larger numbers of processes. That's a whole topic in itself... Regardless, basically you can abstract the idea that if an Open MPI plugin needs information, it can ask for it from the modex, and it will get the info. *How* that happens is what I said above -- it has changed many times over the years. Current mechanisms are more-or-less "lazy", in that information is generally sent when it is asked for [vs. preemptively exchanging everything during MPI_INIT] -- or, in some cases, modex information exchange can be wholly avoided, which is especially important when launching 10's or 100's of thousands of processes. Like I said, the runtime side of this stuff is a whole topic in itself... let's get back to MPI. Hence, the first time an MPI process goes to MPI send (of any flavor) to a peer, the corresponding BTL module asks for all the modex data for that peer. For each local IP endpoint, the TCP BTL looks through the peer's modex info to find a corresponding remote IP endpoint (sidenote: the TCP BTL uses some heuristics to determine which remote endpoint is a suitable peer -- the usNIC BTL does a much better job of checking reachability between local and remote endpoints by actually checking the kernel routing table). TCP sockets are created between appropriate local IP endpoints and the peer's remote IP endpoints. The set of TCP sockets, local IP endpoints, and their corresponding remote IP endpoints are then cached, index based on the proc of the peer. Remember: each proc has at least 2 rank IDs (and possibly many more). Hence, we index connection information based on the *proc*, not any given communicator's *rank*. Hence, when we go to MPI_Send, the PML takes the communicator and rank, looks up the proc, and uses *that* to address the messages. The BTL gets the message to send and the proc destination (indeed, BTLs know nothing about communicators or ranks -- only network addresses and procs). The BTL can then look up its cached information for the peer based on the proc, and then go from there. > 2) How is the location of an MPI rank is determined in point-to-point > operations such as MPI_Send. What "location" are you referring to? Network address? CPU binding? ...? > Logically, given a rank, Process. :-) > there has to be a look-up to get its address and pass this information to > lower levels to use a send operation with this destination address using > appropriate network protocol. Communication -> group -> proc -> BTL information hanging off that proc. ob1 inherently uses all available BTL connections between peers (e.g., if you have multiple IP networks). It will fragment large messages, split the sends / receives over all the available endpoint connections, etc. > 3) Are the connections among ranks pre-established when a communicator is > created or are they established on-demand. It depends on the network (in ob1's case, it depends on the BTL). Most BTLs fall into one of three categories: 1. Make connections on-demand (works much better for large scale MPI applications, because it's unusual for an MPI process to talk to more than a few peers in a job). 2. The network itself is connectionless, so only local endpoints need to be opened and information placed in the modex. 3. Setup all-to-all communication during MPI_INIT. These days, only the two shared memory BTLs (sm and vader) do this (i.e., they alloc big slabs of shared memory and setup designated senders/receivers for each part of the shared memory, etc.). For all other networks, it's basically too expensive (especially at scale) to setup all-to-all communication during MPI_INIT. Note: each BTL can really do whatever it wants -- it just happens that all of the current BTLs do it one of the first 2 ways. > Where exactly in the source code is this connection created (I'm tracking > ompi_mpi_init(), so think it must be in orte_init(), not sure though). For lack of a longer explanation, ORTE only has to do with launch processes, monitoring, and terminating them. ORTE has its own communication system, but that's not used by the MPI layer for passing MPI messages. In this email, I've been talking about the BTLs, which are used by the ob1 PML. Each BTL does its own connection management however it wants to (as described above). Note that the other PMLs do things slightly differently, too. That can be another topic for another time; I don't have time to describe that now. ;-) > 4) Does the communicator attribute (c_keyhash) store this mapping > information, or is this totally different? Totally different. We really don't store any internal-to-OMPI stuff in communicator attributes. All that stuff is solely used by applications for MPI_COMM_SET_ATTR / MPI_COMM_GET_ATTR and friends. -- Jeff Squyres jsquy...@cisco.com _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel