Re: [OMPI devel] RFC: changes to modex

Jeff Squyres Wed, 2 Apr 2008 12:03:32 -0400

On Apr 2, 2008, at 11:10 AM, Tim Prins wrote:

Is there a reason to rename ompi_modex_{send,recv} to
ompi_modex_proc_{send,recv}? It seems simpler (and no more confusingandless work) to leave the names alone and addompi_modex_node_{send,recv}.

If the arguments don't change, I don't have a strong objection toleaving the names alone. I think the rationale for a new names is:


- the arguments may change
- completely clear names, and good symmetry with *_node_* and *_proc_*

If the args change, then I think it is best to use new names so thatBTL authors (etc.) have time to adapt. If not, then I minorly preferthe new names, but don't care too much.

Another question: Does the receiving process care that the information
received applies to a whole node? I ask because maybe we could get the
same effect by simply adding a parameter to ompi_modex_send which
specifies if the data applies to just the proc or a whole node.

So, if we have ranks 1 & 2 on n1, and rank 3 on n2, then rank 1would do:

ompi_modex_send("arch", arch, <applies to whole node>);
then rank 3 would do:
ompi_modex_recv(rank 1, "arch");
ompi_modex_recv(rank 2, "arch");

I'm not sure I understand what you mean. Proc 3 would get the oneblob that was sent from proc 1?


In the openib btl, I'll likely have both node and proc portions to send.


I don't really care either way, just wanted to throw out the idea.

Tim

Jeff Squyres wrote:

WHAT: Changes to MPI layer modex API

WHY: To be mo' betta scalable

WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
calls ompi_modex_send() and/or ompi_modex_recv()

TIMEOUT: COB Fri 4 Apr 2008

DESCRIPTION:

Per some of the scalability discussions that have been occurring(some

on-list and some off-list), and per the e-mail I sent out last week

about ongoing work in the openib BTL, Ralph and I put together alooseproposal this morning to make the modex more scalable. The timeoutis

fairly short because Ralph wanted to start implementing in the near
future, and we didn't anticipate that this would be a contentious
proposal.

The theme is to break the modex into two different kinds of data:

- Modex data that is specific to a given proc
- Modex data that is applicable to all procs on a given node

For example, in the openib BTL, the majority of modex data is
applicable to all processes on the same node (GIDs and LIDs and
whatnot).  It is much more efficient to send only one copy of such
node-specific data to each process (vs. sending ppn copies to each
process).  The spreadsheet I included in last week's e-mail clearly
shows this.

1. Add new modex API functions.  The exact function signatures are
TBD, but they will be generally of the form:

 * int ompi_modex_proc_send(...): send modex data that is specific to
this process.  It is just about exactly the same as the current API
call (ompi_modex_send).

 * int ompi_modex_proc_recv(...): receive modex data from a specified
peer process (indexed on ompi_proc_t*).  It is just about exactly the
same as the current API call (ompi_modex_recv).

 * int ompi_modex_node_send(...): send modex data that is relevant
for all processes in this job on this node.  It is intended that only
one process in a job on a node will call this function.  If more than
one process in a job on a node calls _node_send(), then only one will
"win" (meaning that the data sent by the others will be overwritten).

 * int ompi_modex_node_recv(...): receive modex data that is relevant
for a whole peer node; receive the ["winning"] blob sent by
_node_send() from the source node.  We haven't yet decided what the

node index will be; it may be (ompi_proc_t*) (i.e., _node_recv()would

figure out what node the (ompi_proc_t*) resides on and then give you
the data).

2. Make the existing modex API calls (ompi_modex_send,
ompi_modex_recv) be wrappers around the new "proc" send/receive
calls.  This will provide exactly the same functionality as the
current API (but be sub-optimal at scale).  It will give BTL authors
(etc.) time to update to the new API, potentially taking advantage of
common data across multiple processes on the same node.  We'll likely

put in some opal_output()'s in the wrappers to help identify codethat

is still calling the old APIs.

3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before
v1.3 is released.


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] RFC: changes to modex

Reply via email to