Re: [OMPI devel] RFC: revamp topo framework

Luigi Scorzato Tue, 3 Nov 2009 03:40:19 -0500


On 30 Oct 2009, at 20:28, Jeff Squyres wrote:

What George is describing is the Right answer, but it may take youa little time.
FWIW: the complexity of a topo component is actually pretty low.It's essentially a bunch of glue code (that I can probably mostlyprovide) and your mapping algorithms about how to reorder thecommunicator ranks.
To be clear: topo components are *ONLY* about re-ordering ranks ina communicator -- the back-end of MPI_CART_CREATE and friends.
The BTL components that George is talking about are Byte TransferLayer components; essentially the brains behind MPI_SEND andfriends. Open MPI has a per-device list of BTLs that can serviceeach peer MPI process. Hence, if you're sending to another MPIprocess on the same host, the first BTL in the list will be theshared memory BTL. If you're sending to an MPI process on adifferent server that you're connected to via ethernet, the TCP BTLmay be at the top of the list. And so on.
Is sounds like you actually want to make *two* components:

- topo: for reordering ranks during MPI_CART_CREATE and friends
- btl: use the underlying network primitives for sending when possible
As George indicated, the BTL module in each MPI process candetermine during startup which MPI process peers it can talk to.It can then tell the upper-layer routing algorithm "I can talk topeer processes X, Y, and Z -- I cannot talk to peer processes A, B,and C". The upper-layer router (the PML module) will then put yourBTL at the top of the list for peer processes X, Y, and Z, and willnot put your BTL on the list ofr peer processes A, B, and C. ForA, B, and C, other BTLs will be used (e.g., TCP).
Does that make sense?
To answer your question from a prior mail: the unity topo componentis used for the remapping of ranks in MPI_CART_CREATE. Look inompi/mca/topo/unity/.

Thanks to everybody for the clarifications. The function I waslooking for is mca_topo_base_cart_create() in ompi/mca/topo/base/topo_base_cart_create.c And more precisely I needed the loop:


   p = topo_data->mtc_dims_or_index;
   coords =  topo_data->mtc_coords;
   dummy_rank = *new_rank;
   for (i=0;
        (i < topo_data->mtc_ndims_or_nnodes && i < ndims);
        ++i, ++p) {
        dim = *p;
        nprocs /= dim;
        *coords++ = dummy_rank / nprocs;
        dummy_rank %= nprocs;
    }

This defines the precise relation between ranks and coordinates. OnceI know this, I do not even need to write a topo component, because Ican define the ranks of my computing nodes in a rankfile in orderthat they get the coordinates that they need physically.

A different issue is the BTL component. This is actually where myapproach 1 and 2 differ (my previous distinction was confusing, dueto my lack of understanding of the distinction between topo and btlcomponents).

In the 1st approach I would redefine some crucial (for my code) MPIfunctions in a way that they call the low level torus primitives,when the communication occurs between nearest neighbors, and fallback to open-mpi functions otherwise.The 2nd approach would be to develop our torus-btl. The fact that onecan choose a "priority list of networks" is definitely great anddissipates my worries about the feasibility of the 2nd approach in mycase. The only remaining question is whether I can get familiar withbtl stuff fast enough. What do you suggest me to read in order tolearn quickly how to create a BTL component?


Many thanks and best regards, Luigi

Re: [OMPI devel] RFC: revamp topo framework

Reply via email to