I'll be writing a series of notes containing thoughts on how to exploit 
PMIx-provided information, especially covering aspects that might not be 
obvious (e.g., attributes that might not be widely known). This first note 
covers the topic of collective optimization.

PMIx provides network-related information that can be used in construction of 
collectives - in this case, hierarchical collectives that minimize cross-switch 
communications. Several pieces of information that might help with construction 
of such collectives are provided by PMIx at time of process execution. These 
include:

* PMIX_LOCAL_PEERS - the list of local peers (i.e., procs from your nspace) 
sharing your node. This can be used to aggregate the contribution from 
participating procs on the node to (for example) the lowest rank participator 
on that node (call this the "node leader").

* PMIX_SWITCH_PEERS - the list of peers that share the same switch as the proc 
specified in the call to PMIx_Get. Multi-NIC environments will return an array 
of results, each element containing the NIC and the list of peers sharing the 
switch to which that NIC is connected. This can be used to aggregate the 
contribution across switches - e.g., by having the lowest ranked participating 
proc on each switch participate in an allgather, and then distribute the 
results to the participating node leaders for final distribution across their 
nodes.

In the case of non-flat fabrics, further information regarding the topology of 
the fabric and the location of each proc within that topology is provided to 
aid in the construction of a collective. These include:

* PMIX_NETWORK_COORDINATE - network coordinate of the specified process in the 
given view type (e.g., logical vs physical), expressed as a pmix_coord_t struct 
that contains both the coordinates and the number of dimensions
* PMIX_NETWORK_VIEW - Requested view type (e.g., logical vs physical)
* PMIX_NETWORK_DIMS - Number of dimensions in the specified network plane/view

In addition, there are some values that can aid in interpreting this info 
and/or describing it (e.g., in diagnostic output):

* PMIX_NETWORK_PLANE - string ID of a network plane
* PMIX_NETWORK_SWITCH - string ID of a network switch
* PMIX_NETWORK_NIC - string ID of a NIC
* PMIX_NETWORK_SHAPE - number of interfaces (uint32_t) on each dimension of the 
specified network plane in the requested view
* PMIX_NETWORK_SHAPE_STRING - network shape expressed as a string (e.g., 
"10x12x2")

Obviously, the availability of this support depends directly on access to the 
required information. In the case of managed fabrics, this is provided by PMIx 
plugins that directly obtain it from the respective fabric manager. I am 
writing the support for Cray's Slingshot fabric, but any managed fabric can be 
supported should someone wish to do so.

Unmanaged fabrics pose a bit of a challenge (e.g., how does one determine who 
shares your switch?), but I suspect those who understand those environments can 
probably devise a solution should they choose to pursue it. Remember, PMIx 
includes interfaces that allow the daemon-level PMIx servers to collect any 
information the fabric plugins deem useful from either the fabric or local node 
level and roll it up for later use - this allows us, for example, to provide 
the fabric support plugins with information on the local locality of NICs on 
each node which they then use in assigning network endpoints.

This support will be appearing in PMIx (and thus, in OMPI) starting this 
summer. You can play with it now, if you like - there are a couple of test 
examples in the PMIx code base (see src/mca/pnet) that provide simulated values 
being used by our early adopters for development. You are welcome to use those, 
or to write your own plugin.

As always, I'm happy to provide advice/help to those interested in utilizing 
these capabilities.
Ralph


Reply via email to