Re: [OMPI devel] Scalability of openib modex

Jeff Squyres Mon, 31 Mar 2008 09:26:37 -0400

On Mar 31, 2008, at 9:22 AM, Ralph H Castain wrote:

Thanks Jeff. It appears to me that the first approach to reducingmodex datamakes the most sense and has the largest impact - I would advocatepursuing
it first. We can look at further refinements later.
Along that line, one thing we also exchange in the modex (not IBspecific)is hostname and arch. This is in the ompi/proc/proc.c code. It seemsto methat this is also wasteful and can be removed. The daemons alreadyhave that
info for the job and can easily "drop" it into each proc - there is no
reason to send it around.
I'll take a look at cleaning that up, ensuring we don't "break"daemonless
environments, along with the other things underway.


Sounds perfect.


Ralph



On 3/28/08 11:37 AM, "Jeff Squyres" <[email protected]> wrote:

I've had this conversation independently with several people now, so
I'm sending it to the list rather than continuing to have the same
conversation over and over.  :-)

------

As most of you know, Jon and I are working on the new openib
"CPC" (connect pseudo-component) stuff in /tmp-public/openib-cpc2.
There are two main reasons for it:

1. Add support for RDMA CM (they need it for iWarp support)
2. Add support for IB CM (which will hopefully be a more scalable
connect system as compared to the current RML/OOB-based method of
making IB QPs)

When complete, there will be 4 CPCs: RDMA CM, IB CM, OOB, and XOOB
(same as OOB but with ConnectX XRC extensions).

RDMA CM has some known scaling issues, and at least some known
workarounds -- I won't discuss the merits/drawbacks of RDMA CM here.
IB CM has unknown scaling characteristics, but seems to look good on
paper (e.g., it uses UD for a 3-way handshake to make an IB QP).

On the trunk, it's a per-MPI process decision as to which CPC you'll
use.  Per ticket #1191, one of the goals of the /tmp-public branch is
to make CPC decision be a per-openib-BTL-module decision.  So you can

mix iWarp and IB hardware in a single host, for example. This fitsin

quite well with the "mpirun should work out of the box" philosophy of
Open MPI.

In the openib BTL, each BTL module is paired with a specific HCA/NIC
(verbs) port.  And depending on the interface hardware and software,
one or more CPCs may be available for each.  Hence, for each BTL
module in each MPI process, we may send one or more CPC connect
information blobs in the modex (note that the oob and xoob CPCs don't
need to send anything additional in the modex).

Jon and I are actually getting closer to completion on the branch,and

it seems to be working.

In conjunction with several other scalability discussions that are
ongoing right now, several of us have toyed with two basic ideas to
improve scalability of job launch / startup:

1. the possibility of eliminating the modex altogether (e.g., have
ORTE dump enough information to each MPI process to figure out/
calculate/locally lookup [in local files?] BTL addressing information
for all peers in MPI_COMM_WORLD, etc.), a la Portals.

2. reducing the amount of data in the modex.

One obvious idea for #2 is to have only one process on each host send
all/the majority of openib BTL modex information for that host.  The
rationale here is that all MPI processes on a single host will share
much of the same BTL addressing information, so why send it N times?
Local rank 0 can modex send all/the majority of the modex for the
openib BTL modules; local ranks 1-N can either send nothing or a
[very] small piece of differentiating information (e.g., IBCM service
ID).

This effectively makes the modex info for the openib BTL scale with
the number of nodes, not the number of processes.  This can be a big
win in terms of overall modex size that needs to be both gathered and
bcast.

I worked up a spreadsheet showing the current size of the modex inthe

openib-cpc2 branch right now (using some "somewhat" contrived machine
size/ppn/port combinations), and then compared it to the size after
implementing the #2 idea shown above (see attached PDF).

I also included a 3rd comparison for if Jon/I are able to reduce the

CPC modex blob sizes -- we don't know yet if that'll work or not.But

the numbers show that reducing the blobs by a few bytes clearly has
[much] less of an impact than the "eliminating redundant modex
information" idea, so we'll work on that one first.

Additionally, reducing the modex size, paired with other ongoing ORTE
scalability efforts, may obviate the need to eliminate the modex (at

least for now...). Or, more specifically, efforts for eliminatingthe

modex can be pushed to beyond v1.3.

Of course, the same ideas can apply to other BTLs. We're onlyworking

on the openib BTL for now.



_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] Scalability of openib modex

Reply via email to