I've had this conversation independently with several people now, so I'm sending it to the list rather than continuing to have the same conversation over and over. :-)

------

As most of you know, Jon and I are working on the new openib "CPC" (connect pseudo-component) stuff in /tmp-public/openib-cpc2. There are two main reasons for it:

1. Add support for RDMA CM (they need it for iWarp support)
2. Add support for IB CM (which will hopefully be a more scalable connect system as compared to the current RML/OOB-based method of making IB QPs)

When complete, there will be 4 CPCs: RDMA CM, IB CM, OOB, and XOOB (same as OOB but with ConnectX XRC extensions).

RDMA CM has some known scaling issues, and at least some known workarounds -- I won't discuss the merits/drawbacks of RDMA CM here. IB CM has unknown scaling characteristics, but seems to look good on paper (e.g., it uses UD for a 3-way handshake to make an IB QP).

On the trunk, it's a per-MPI process decision as to which CPC you'll use. Per ticket #1191, one of the goals of the /tmp-public branch is to make CPC decision be a per-openib-BTL-module decision. So you can mix iWarp and IB hardware in a single host, for example. This fits in quite well with the "mpirun should work out of the box" philosophy of Open MPI.

In the openib BTL, each BTL module is paired with a specific HCA/NIC (verbs) port. And depending on the interface hardware and software, one or more CPCs may be available for each. Hence, for each BTL module in each MPI process, we may send one or more CPC connect information blobs in the modex (note that the oob and xoob CPCs don't need to send anything additional in the modex).

Jon and I are actually getting closer to completion on the branch, and it seems to be working.

In conjunction with several other scalability discussions that are ongoing right now, several of us have toyed with two basic ideas to improve scalability of job launch / startup:

1. the possibility of eliminating the modex altogether (e.g., have ORTE dump enough information to each MPI process to figure out/ calculate/locally lookup [in local files?] BTL addressing information for all peers in MPI_COMM_WORLD, etc.), a la Portals.

2. reducing the amount of data in the modex.

One obvious idea for #2 is to have only one process on each host send all/the majority of openib BTL modex information for that host. The rationale here is that all MPI processes on a single host will share much of the same BTL addressing information, so why send it N times? Local rank 0 can modex send all/the majority of the modex for the openib BTL modules; local ranks 1-N can either send nothing or a [very] small piece of differentiating information (e.g., IBCM service ID).

This effectively makes the modex info for the openib BTL scale with the number of nodes, not the number of processes. This can be a big win in terms of overall modex size that needs to be both gathered and bcast.

I worked up a spreadsheet showing the current size of the modex in the openib-cpc2 branch right now (using some "somewhat" contrived machine size/ppn/port combinations), and then compared it to the size after implementing the #2 idea shown above (see attached PDF).

I also included a 3rd comparison for if Jon/I are able to reduce the CPC modex blob sizes -- we don't know yet if that'll work or not. But the numbers show that reducing the blobs by a few bytes clearly has [much] less of an impact than the "eliminating redundant modex information" idea, so we'll work on that one first.

Additionally, reducing the modex size, paired with other ongoing ORTE scalability efforts, may obviate the need to eliminate the modex (at least for now...). Or, more specifically, efforts for eliminating the modex can be pushed to beyond v1.3.

Of course, the same ideas can apply to other BTLs. We're only working on the openib BTL for now.

--
Jeff Squyres
Cisco Systems

Attachment: modex-sizes.pdf
Description: Adobe PDF document


Reply via email to