Hi folks I mentioned this very briefly at the Tues telecon, but didn't explain it well as there just wasn't adequate time available. With the recent updates of the embedded PMIx code, OMPI's mpirun now has the ability to fully support pre-launch network resource assignment for processes. This includes endpoints as well as network coordinates.
In brief, what happens is: * at startup, the PMIx network support plugins in mpirun obtain their network configuration info. In cases where a fabric manager is present, we directly communicate to that FM for the info we need. Where no fabric manager is available, an MCA param can point us to a file containing the info, or the plugin can get it in whatever way the vendor chooses * when ORTE launches its daemons, the daemons query their PMIx network support plugins for any network inventory info they would like to communicate back to mpirun. Each plugin (TCP, whatever) is given an opportunity to contribute to that payload. The data is included in the daemon's "phone home" message * when the inventory arrives at mpirun, ORTE delivers it to the PMIx network support plugins for processing. As far as ORTE is concerned, it is an opaque "blob" - only the fabric plugin provider knows what is in it and how to process it. In the case of TCP (which I wrote), we store information on both the available static ports on each node and the available NICs (e.g., subnet they are attached to). * when mpirun is ready to launch, it passes the process map down to the PMIx network support plugins (again, every plugin gets to see it) so they can assign/allocate network resources to the procs. In the case of TCP, we assign a static socket (or multiple sockets if they request it) to each process on each node, a prioritized list of the NICs they can use (based on distance), and the network coordinates of the NICs. This all gets bundled up into a per-plugin "blob" and passed up to mpirun for inclusion in the launch command sent to the daemons. * when a daemon receives the launch command, it passes the "blobs" down to the local PMIx network support plugins, which parse the blob as they desire. In the case of TCP, we simply store the assignment info in the PMIx datastore for retrieval by the procs when they want to communicate to a peer or compute a topologically aware collective pattern. The definition of coordinate values for each NIC is up to the network support plugins. The pmix_coord_t struct includes an array of integer coordinates along with a value indicating the number of dimensions and a flag indicating whether it is a "logical" or "physical" view - this is in keeping with the MPI topology WG. Some fabrics are writing plugins that provide that info per the vendor's algorithms. In the case of TCP, what I've done is rather simple. I provide an x,y,z coordinate "logical" coordinate for each NIC where: * x represents the relative NIC index on the host where the proc is located - just a simple counter (e.g., this is the third NIC on the host) * y represents the switch to which that NIC is attached - i.e., if you have the same y-coord as another NIC, you are attached to the same switch * z represents the subnet - i.e., if you have the same z-coord as another NIC, then that NIC is on the same subnet as you It is totally up to the plugin - the idea is to provide each process with information that allows them to know relative location. I'm quite open to modifying the TCP one as it was just done as an example for testing the infrastructure. You can retrieve coordinate info for any proc using PMIx_Get. You can also retrieve the relative communication cost to any proc - the plugin will compute it for you based on the coordinates, assuming the plugin supports that ability (in the case of my TCP one, it uses the coordinate to compute the number of hops because I numbered things to support that algo). PRRTE already knows how to do all this - there are a few simple changes required to sync OMPI. If folks are interested in exploring this further, please let me know. Ralph _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel