Re: [OMPI devel] RFC: changes to modex

Jeff Squyres Thu, 3 Apr 2008 11:16:28 -0400

On Apr 3, 2008, at 8:52 AM, Gleb Natapov wrote:

It'll increase it compared to the optimization that we're about to
make.  But it will certainly be a large decrease compared to what
we're doing today

May be I don't understand something in what you propose then.Currently

when I run two procs on the same node and each proc uses different HCA
each one of them sends message that describes the HCA in use by the
proc. The message is of the form <mtu, subnet, lid, apm_lid, cpc>.

Each proc send one of those so there are two message total on thewire.

You propose that one of them should send description of both
available ports (that is one of them sends two messages of the form

above) and then each proc send additional message with the index ofthe

HCA that it is going to use. And this is more data on the wire after
proposed optimization than we have now.

I guess what I'm trying to address is optimizing the common case.What I perceive the common case to be is:


- high PPN values (4, 8, 16, ...)
- PPN is larger than the number of verbs-capable ports
- homogeneous openfabrics network

Yes, you will definitely find other cases. But I'd guess that thisis, by far, the most common case (especially at scale). I don't wantto penalize the common case for the sake of some one-off installations.

I'm basing this optimization on the assumption that PPN's will belarger than the number of available ports, so there will guarantee tobe duplication in the modex message. Removing that duplication is themain goal of this optimization.

                 (see the spreadsheet that I sent last week).

I've looked at it but I could not decipher it :( I don't understand
where all these numbers a come from.


Why didn't you ask?  :-)

The size of the openib modex is explained in btl_openib_component.c inthe branch. It's a packed message now; we don't just blindly copy anentire struct. Here's the comment:


    /* The message is packed into multiple parts:

* 1. a uint8_t indicating the number of modules (ports) in themessage

     * 2. for each module:
     *    a. the common module data
     *    b. a uint8_t indicating how many CPCs follow
     *    c. for each CPC:
     *       a. a uint8_t indicating the index of the CPC in the all[]
     *          array in btl_openib_connect_base.c
     *       b. a uint8_t indicating the priority of this CPC
     *       c. a uint8_t indicating the length of the blob to follow
     *       d. a blob that is only meaningful to that CPC
     */

The common module data is what I sent in the other message.

I guess I don't see the problem...?

I like things to be simple. KISS principle I guess.

I can see your point that this is getting fairly complicated. :-\See below.

And I do care about
heterogeneous include/exclude too.

How much? I still think we can support it just fine; I just want tomake [what I perceive to be] the common case better.

I looked at what kind of data we send during openib modex and Icreatedfile with 10000 openib modex messages. mtu, subnet id and cpc listwhere
the same in each message but lid/apm_lid where different, this is
pretty close approximation of the data that is sent from HN to each
process. The uncompressed file size is 489K compressed file size is43K.
More then 10 times smaller.



Was this the full modex message, or just the openib part?

Those are promising sizes (43k), though; how long does it take tocompress/uncompress this data in memory? That also must be factoredinto the overall time.


How about a revised and combined proposal:

- openib: Use a simplified "send all ACTIVE ports" per-host message(i.e., before include/exclude and carto is applied)- openib: Send a small bitmap for each proc indicating which portseach btl module will use- modex: Compress the result (probably only if it's larger than somethreshhold size?) when sending, decompress upon receive

This keeps it simple -- no special cases for heterogeneous include/exclude, etc. And if compression is cheap (can you do someexperiments to find out?), perhaps we can link against libz (I see thelibz in at least RHEL4 is BSD licensed, so there's no issue there) andde/compress in memory.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] RFC: changes to modex

Reply via email to