I should have reminded everyone of the basics: * PMIX_NETWORK_ENDPT - gives you an array of network endpts for the specified proc, one per NIC, ordered in closest to farthest distance from where that proc is bound
Similarly, PMIX_NETWORK_COORDINATE provides the array of coordinates for the specified proc, one per NIC, ordered as above. I'll be posting some example code illustrating the use of all these in the near future and will alert anyone interested when I do. Ralph > On Mar 22, 2020, at 11:36 AM, Ralph Castain via devel > <devel@lists.open-mpi.org> wrote: > > I'll be writing a series of notes containing thoughts on how to exploit > PMIx-provided information, especially covering aspects that might not be > obvious (e.g., attributes that might not be widely known). This first note > covers the topic of collective optimization. > > PMIx provides network-related information that can be used in construction of > collectives - in this case, hierarchical collectives that minimize > cross-switch communications. Several pieces of information that might help > with construction of such collectives are provided by PMIx at time of process > execution. These include: > > * PMIX_LOCAL_PEERS - the list of local peers (i.e., procs from your nspace) > sharing your node. This can be used to aggregate the contribution from > participating procs on the node to (for example) the lowest rank participator > on that node (call this the "node leader"). > > * PMIX_SWITCH_PEERS - the list of peers that share the same switch as the > proc specified in the call to PMIx_Get. Multi-NIC environments will return an > array of results, each element containing the NIC and the list of peers > sharing the switch to which that NIC is connected. This can be used to > aggregate the contribution across switches - e.g., by having the lowest > ranked participating proc on each switch participate in an allgather, and > then distribute the results to the participating node leaders for final > distribution across their nodes. > > In the case of non-flat fabrics, further information regarding the topology > of the fabric and the location of each proc within that topology is provided > to aid in the construction of a collective. These include: > > * PMIX_NETWORK_COORDINATE - network coordinate of the specified process in > the given view type (e.g., logical vs physical), expressed as a pmix_coord_t > struct that contains both the coordinates and the number of dimensions > * PMIX_NETWORK_VIEW - Requested view type (e.g., logical vs physical) > * PMIX_NETWORK_DIMS - Number of dimensions in the specified network plane/view > > In addition, there are some values that can aid in interpreting this info > and/or describing it (e.g., in diagnostic output): > > * PMIX_NETWORK_PLANE - string ID of a network plane > * PMIX_NETWORK_SWITCH - string ID of a network switch > * PMIX_NETWORK_NIC - string ID of a NIC > * PMIX_NETWORK_SHAPE - number of interfaces (uint32_t) on each dimension of > the specified network plane in the requested view > * PMIX_NETWORK_SHAPE_STRING - network shape expressed as a string (e.g., > "10x12x2") > > Obviously, the availability of this support depends directly on access to the > required information. In the case of managed fabrics, this is provided by > PMIx plugins that directly obtain it from the respective fabric manager. I am > writing the support for Cray's Slingshot fabric, but any managed fabric can > be supported should someone wish to do so. > > Unmanaged fabrics pose a bit of a challenge (e.g., how does one determine who > shares your switch?), but I suspect those who understand those environments > can probably devise a solution should they choose to pursue it. Remember, > PMIx includes interfaces that allow the daemon-level PMIx servers to collect > any information the fabric plugins deem useful from either the fabric or > local node level and roll it up for later use - this allows us, for example, > to provide the fabric support plugins with information on the local locality > of NICs on each node which they then use in assigning network endpoints. > > This support will be appearing in PMIx (and thus, in OMPI) starting this > summer. You can play with it now, if you like - there are a couple of test > examples in the PMIx code base (see src/mca/pnet) that provide simulated > values being used by our early adopters for development. You are welcome to > use those, or to write your own plugin. > > As always, I'm happy to provide advice/help to those interested in utilizing > these capabilities. > Ralph > >