I've had this conversation independently with several people now, so
I'm sending it to the list rather than continuing to have the same
conversation over and over. :-)
------
As most of you know, Jon and I are working on the new openib
"CPC" (connect pseudo-component) stuff in /tmp-public/openib-cpc2.
There are two main reasons for it:
1. Add support for RDMA CM (they need it for iWarp support)
2. Add support for IB CM (which will hopefully be a more scalable
connect system as compared to the current RML/OOB-based method of
making IB QPs)
When complete, there will be 4 CPCs: RDMA CM, IB CM, OOB, and XOOB
(same as OOB but with ConnectX XRC extensions).
RDMA CM has some known scaling issues, and at least some known
workarounds -- I won't discuss the merits/drawbacks of RDMA CM here.
IB CM has unknown scaling characteristics, but seems to look good on
paper (e.g., it uses UD for a 3-way handshake to make an IB QP).
On the trunk, it's a per-MPI process decision as to which CPC you'll
use. Per ticket #1191, one of the goals of the /tmp-public branch is
to make CPC decision be a per-openib-BTL-module decision. So you can
mix iWarp and IB hardware in a single host, for example. This fits
in
quite well with the "mpirun should work out of the box" philosophy of
Open MPI.
In the openib BTL, each BTL module is paired with a specific HCA/NIC
(verbs) port. And depending on the interface hardware and software,
one or more CPCs may be available for each. Hence, for each BTL
module in each MPI process, we may send one or more CPC connect
information blobs in the modex (note that the oob and xoob CPCs don't
need to send anything additional in the modex).
Jon and I are actually getting closer to completion on the branch,
and
it seems to be working.
In conjunction with several other scalability discussions that are
ongoing right now, several of us have toyed with two basic ideas to
improve scalability of job launch / startup:
1. the possibility of eliminating the modex altogether (e.g., have
ORTE dump enough information to each MPI process to figure out/
calculate/locally lookup [in local files?] BTL addressing information
for all peers in MPI_COMM_WORLD, etc.), a la Portals.
2. reducing the amount of data in the modex.
One obvious idea for #2 is to have only one process on each host send
all/the majority of openib BTL modex information for that host. The
rationale here is that all MPI processes on a single host will share
much of the same BTL addressing information, so why send it N times?
Local rank 0 can modex send all/the majority of the modex for the
openib BTL modules; local ranks 1-N can either send nothing or a
[very] small piece of differentiating information (e.g., IBCM service
ID).
This effectively makes the modex info for the openib BTL scale with
the number of nodes, not the number of processes. This can be a big
win in terms of overall modex size that needs to be both gathered and
bcast.
I worked up a spreadsheet showing the current size of the modex in
the
openib-cpc2 branch right now (using some "somewhat" contrived machine
size/ppn/port combinations), and then compared it to the size after
implementing the #2 idea shown above (see attached PDF).
I also included a 3rd comparison for if Jon/I are able to reduce the
CPC modex blob sizes -- we don't know yet if that'll work or not.
But
the numbers show that reducing the blobs by a few bytes clearly has
[much] less of an impact than the "eliminating redundant modex
information" idea, so we'll work on that one first.
Additionally, reducing the modex size, paired with other ongoing ORTE
scalability efforts, may obviate the need to eliminate the modex (at
least for now...). Or, more specifically, efforts for eliminating
the
modex can be pushed to beyond v1.3.
Of course, the same ideas can apply to other BTLs. We're only
working
on the openib BTL for now.