On Apr 3, 2008, at 8:52 AM, Gleb Natapov wrote:
It'll increase it compared to the optimization that we're about to
make.  But it will certainly be a large decrease compared to what
we're doing today

May be I don't understand something in what you propose then. Currently
when I run two procs on the same node and each proc uses different HCA
each one of them sends message that describes the HCA in use by the
proc. The message is of the form <mtu, subnet, lid, apm_lid, cpc>.
Each proc send one of those so there are two message total on the wire.
You propose that one of them should send description of both
available ports (that is one of them sends two messages of the form
above) and then each proc send additional message with the index of the
HCA that it is going to use. And this is more data on the wire after
proposed optimization than we have now.

I guess what I'm trying to address is optimizing the common case. What I perceive the common case to be is:

- high PPN values (4, 8, 16, ...)
- PPN is larger than the number of verbs-capable ports
- homogeneous openfabrics network

Yes, you will definitely find other cases. But I'd guess that this is, by far, the most common case (especially at scale). I don't want to penalize the common case for the sake of some one-off installations.

I'm basing this optimization on the assumption that PPN's will be larger than the number of available ports, so there will guarantee to be duplication in the modex message. Removing that duplication is the main goal of this optimization.

                 (see the spreadsheet that I sent last week).
I've looked at it but I could not decipher it :( I don't understand
where all these numbers a come from.

Why didn't you ask?  :-)

The size of the openib modex is explained in btl_openib_component.c in the branch. It's a packed message now; we don't just blindly copy an entire struct. Here's the comment:

    /* The message is packed into multiple parts:
* 1. a uint8_t indicating the number of modules (ports) in the message
     * 2. for each module:
     *    a. the common module data
     *    b. a uint8_t indicating how many CPCs follow
     *    c. for each CPC:
     *       a. a uint8_t indicating the index of the CPC in the all[]
     *          array in btl_openib_connect_base.c
     *       b. a uint8_t indicating the priority of this CPC
     *       c. a uint8_t indicating the length of the blob to follow
     *       d. a blob that is only meaningful to that CPC
     */

The common module data is what I sent in the other message.

I guess I don't see the problem...?
I like things to be simple. KISS principle I guess.

I can see your point that this is getting fairly complicated. :-\ See below.

And I do care about
heterogeneous include/exclude too.

How much? I still think we can support it just fine; I just want to make [what I perceive to be] the common case better.

I looked at what kind of data we send during openib modex and I created file with 10000 openib modex messages. mtu, subnet id and cpc list where
the same in each message but lid/apm_lid where different, this is
pretty close approximation of the data that is sent from HN to each
process. The uncompressed file size is 489K compressed file size is 43K.
More then 10 times smaller.


Was this the full modex message, or just the openib part?

Those are promising sizes (43k), though; how long does it take to compress/uncompress this data in memory? That also must be factored into the overall time.

How about a revised and combined proposal:

- openib: Use a simplified "send all ACTIVE ports" per-host message (i.e., before include/exclude and carto is applied) - openib: Send a small bitmap for each proc indicating which ports each btl module will use - modex: Compress the result (probably only if it's larger than some threshhold size?) when sending, decompress upon receive

This keeps it simple -- no special cases for heterogeneous include/ exclude, etc. And if compression is cheap (can you do some experiments to find out?), perhaps we can link against libz (I see the libz in at least RHEL4 is BSD licensed, so there's no issue there) and de/compress in memory.

--
Jeff Squyres
Cisco Systems

Reply via email to