[OMPI devel] RFC: changes to modex

Jeff Squyres Wed, 2 Apr 2008 10:21:17 -0400

WHAT: Changes to MPI layer modex API

WHY: To be mo' betta scalable

WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere thatcalls ompi_modex_send() and/or ompi_modex_recv()


TIMEOUT: COB Fri 4 Apr 2008

DESCRIPTION:

Per some of the scalability discussions that have been occurring (someon-list and some off-list), and per the e-mail I sent out last weekabout ongoing work in the openib BTL, Ralph and I put together a looseproposal this morning to make the modex more scalable. The timeout isfairly short because Ralph wanted to start implementing in the nearfuture, and we didn't anticipate that this would be a contentiousproposal.


The theme is to break the modex into two different kinds of data:

- Modex data that is specific to a given proc
- Modex data that is applicable to all procs on a given node

For example, in the openib BTL, the majority of modex data isapplicable to all processes on the same node (GIDs and LIDs andwhatnot). It is much more efficient to send only one copy of suchnode-specific data to each process (vs. sending ppn copies to eachprocess). The spreadsheet I included in last week's e-mail clearlyshows this.

1. Add new modex API functions. The exact function signatures areTBD, but they will be generally of the form:

* int ompi_modex_proc_send(...): send modex data that is specific tothis process. It is just about exactly the same as the current APIcall (ompi_modex_send).

* int ompi_modex_proc_recv(...): receive modex data from a specifiedpeer process (indexed on ompi_proc_t*). It is just about exactly thesame as the current API call (ompi_modex_recv).

* int ompi_modex_node_send(...): send modex data that is relevantfor all processes in this job on this node. It is intended that onlyone process in a job on a node will call this function. If more thanone process in a job on a node calls _node_send(), then only one will"win" (meaning that the data sent by the others will be overwritten).

* int ompi_modex_node_recv(...): receive modex data that is relevantfor a whole peer node; receive the ["winning"] blob sent by_node_send() from the source node. We haven't yet decided what thenode index will be; it may be (ompi_proc_t*) (i.e., _node_recv() wouldfigure out what node the (ompi_proc_t*) resides on and then give youthe data).

2. Make the existing modex API calls (ompi_modex_send,ompi_modex_recv) be wrappers around the new "proc" send/receivecalls. This will provide exactly the same functionality as thecurrent API (but be sub-optimal at scale). It will give BTL authors(etc.) time to update to the new API, potentially taking advantage ofcommon data across multiple processes on the same node. We'll likelyput in some opal_output()'s in the wrappers to help identify code thatis still calling the old APIs.

3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) beforev1.3 is released.


--
Jeff Squyres
Cisco Systems

[OMPI devel] RFC: changes to modex

Reply via email to