On 30 Oct 2009, at 20:28, Jeff Squyres wrote:

What George is describing is the Right answer, but it may take you a little time.

FWIW: the complexity of a topo component is actually pretty low. It's essentially a bunch of glue code (that I can probably mostly provide) and your mapping algorithms about how to reorder the communicator ranks.

To be clear: topo components are *ONLY* about re-ordering ranks in a communicator -- the back-end of MPI_CART_CREATE and friends.

The BTL components that George is talking about are Byte Transfer Layer components; essentially the brains behind MPI_SEND and friends. Open MPI has a per-device list of BTLs that can service each peer MPI process. Hence, if you're sending to another MPI process on the same host, the first BTL in the list will be the shared memory BTL. If you're sending to an MPI process on a different server that you're connected to via ethernet, the TCP BTL may be at the top of the list. And so on.

Is sounds like you actually want to make *two* components:

- topo: for reordering ranks during MPI_CART_CREATE and friends
- btl: use the underlying network primitives for sending when possible

As George indicated, the BTL module in each MPI process can determine during startup which MPI process peers it can talk to. It can then tell the upper-layer routing algorithm "I can talk to peer processes X, Y, and Z -- I cannot talk to peer processes A, B, and C". The upper-layer router (the PML module) will then put your BTL at the top of the list for peer processes X, Y, and Z, and will not put your BTL on the list ofr peer processes A, B, and C. For A, B, and C, other BTLs will be used (e.g., TCP).

Does that make sense?

To answer your question from a prior mail: the unity topo component is used for the remapping of ranks in MPI_CART_CREATE. Look in ompi/mca/topo/unity/.


Thanks to everybody for the clarifications. The function I was looking for is mca_topo_base_cart_create() in ompi/mca/topo/base/ topo_base_cart_create.c And more precisely I needed the loop:

   p = topo_data->mtc_dims_or_index;
   coords =  topo_data->mtc_coords;
   dummy_rank = *new_rank;
   for (i=0;
        (i < topo_data->mtc_ndims_or_nnodes && i < ndims);
        ++i, ++p) {
        dim = *p;
        nprocs /= dim;
        *coords++ = dummy_rank / nprocs;
        dummy_rank %= nprocs;
    }

This defines the precise relation between ranks and coordinates. Once I know this, I do not even need to write a topo component, because I can define the ranks of my computing nodes in a rankfile in order that they get the coordinates that they need physically.

A different issue is the BTL component. This is actually where my approach 1 and 2 differ (my previous distinction was confusing, due to my lack of understanding of the distinction between topo and btl components).

In the 1st approach I would redefine some crucial (for my code) MPI functions in a way that they call the low level torus primitives, when the communication occurs between nearest neighbors, and fall back to open-mpi functions otherwise. The 2nd approach would be to develop our torus-btl. The fact that one can choose a "priority list of networks" is definitely great and dissipates my worries about the feasibility of the 2nd approach in my case. The only remaining question is whether I can get familiar with btl stuff fast enough. What do you suggest me to read in order to learn quickly how to create a BTL component?

Many thanks and best regards, Luigi




Reply via email to