WHAT: Changes to MPI layer modex API
WHY: To be mo' betta scalable
WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
calls ompi_modex_send() and/or ompi_modex_recv()
TIMEOUT: COB Fri 4 Apr 2008
DESCRIPTION:
Per some of the scalability discussions that have been occurring (some
on-list and some off-list), and per the e-mail I sent out last week
about ongoing work in the openib BTL, Ralph and I put together a loose
proposal this morning to make the modex more scalable. The timeout is
fairly short because Ralph wanted to start implementing in the near
future, and we didn't anticipate that this would be a contentious
proposal.
The theme is to break the modex into two different kinds of data:
- Modex data that is specific to a given proc
- Modex data that is applicable to all procs on a given node
For example, in the openib BTL, the majority of modex data is
applicable to all processes on the same node (GIDs and LIDs and
whatnot). It is much more efficient to send only one copy of such
node-specific data to each process (vs. sending ppn copies to each
process). The spreadsheet I included in last week's e-mail clearly
shows this.
1. Add new modex API functions. The exact function signatures are
TBD, but they will be generally of the form:
* int ompi_modex_proc_send(...): send modex data that is specific to
this process. It is just about exactly the same as the current API
call (ompi_modex_send).
* int ompi_modex_proc_recv(...): receive modex data from a specified
peer process (indexed on ompi_proc_t*). It is just about exactly the
same as the current API call (ompi_modex_recv).
* int ompi_modex_node_send(...): send modex data that is relevant
for all processes in this job on this node. It is intended that only
one process in a job on a node will call this function. If more than
one process in a job on a node calls _node_send(), then only one will
"win" (meaning that the data sent by the others will be overwritten).
* int ompi_modex_node_recv(...): receive modex data that is relevant
for a whole peer node; receive the ["winning"] blob sent by
_node_send() from the source node. We haven't yet decided what the
node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would
figure out what node the (ompi_proc_t*) resides on and then give you
the data).
2. Make the existing modex API calls (ompi_modex_send,
ompi_modex_recv) be wrappers around the new "proc" send/receive
calls. This will provide exactly the same functionality as the
current API (but be sub-optimal at scale). It will give BTL authors
(etc.) time to update to the new API, potentially taking advantage of
common data across multiple processes on the same node. We'll likely
put in some opal_output()'s in the wrappers to help identify code that
is still calling the old APIs.
3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before
v1.3 is released.
--
Jeff Squyres
Cisco Systems