WHAT: Changes to MPI layer modex API

WHY: To be mo' betta scalable

WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that calls ompi_modex_send() and/or ompi_modex_recv()

TIMEOUT: COB Fri 4 Apr 2008

DESCRIPTION:

Per some of the scalability discussions that have been occurring (some on-list and some off-list), and per the e-mail I sent out last week about ongoing work in the openib BTL, Ralph and I put together a loose proposal this morning to make the modex more scalable. The timeout is fairly short because Ralph wanted to start implementing in the near future, and we didn't anticipate that this would be a contentious proposal.

The theme is to break the modex into two different kinds of data:

- Modex data that is specific to a given proc
- Modex data that is applicable to all procs on a given node

For example, in the openib BTL, the majority of modex data is applicable to all processes on the same node (GIDs and LIDs and whatnot). It is much more efficient to send only one copy of such node-specific data to each process (vs. sending ppn copies to each process). The spreadsheet I included in last week's e-mail clearly shows this.

1. Add new modex API functions. The exact function signatures are TBD, but they will be generally of the form:

* int ompi_modex_proc_send(...): send modex data that is specific to this process. It is just about exactly the same as the current API call (ompi_modex_send).

* int ompi_modex_proc_recv(...): receive modex data from a specified peer process (indexed on ompi_proc_t*). It is just about exactly the same as the current API call (ompi_modex_recv).

* int ompi_modex_node_send(...): send modex data that is relevant for all processes in this job on this node. It is intended that only one process in a job on a node will call this function. If more than one process in a job on a node calls _node_send(), then only one will "win" (meaning that the data sent by the others will be overwritten).

* int ompi_modex_node_recv(...): receive modex data that is relevant for a whole peer node; receive the ["winning"] blob sent by _node_send() from the source node. We haven't yet decided what the node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would figure out what node the (ompi_proc_t*) resides on and then give you the data).

2. Make the existing modex API calls (ompi_modex_send, ompi_modex_recv) be wrappers around the new "proc" send/receive calls. This will provide exactly the same functionality as the current API (but be sub-optimal at scale). It will give BTL authors (etc.) time to update to the new API, potentially taking advantage of common data across multiple processes on the same node. We'll likely put in some opal_output()'s in the wrappers to help identify code that is still calling the old APIs.

3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before v1.3 is released.

--
Jeff Squyres
Cisco Systems

Reply via email to