On 30 Oct 2009, at 20:28, Jeff Squyres wrote:
What George is describing is the Right answer, but it may take you
a little time.
FWIW: the complexity of a topo component is actually pretty low.
It's essentially a bunch of glue code (that I can probably mostly
provide) and your mapping algorithms about how to reorder the
communicator ranks.
To be clear: topo components are *ONLY* about re-ordering ranks in
a communicator -- the back-end of MPI_CART_CREATE and friends.
The BTL components that George is talking about are Byte Transfer
Layer components; essentially the brains behind MPI_SEND and
friends. Open MPI has a per-device list of BTLs that can service
each peer MPI process. Hence, if you're sending to another MPI
process on the same host, the first BTL in the list will be the
shared memory BTL. If you're sending to an MPI process on a
different server that you're connected to via ethernet, the TCP BTL
may be at the top of the list. And so on.
Is sounds like you actually want to make *two* components:
- topo: for reordering ranks during MPI_CART_CREATE and friends
- btl: use the underlying network primitives for sending when possible
As George indicated, the BTL module in each MPI process can
determine during startup which MPI process peers it can talk to.
It can then tell the upper-layer routing algorithm "I can talk to
peer processes X, Y, and Z -- I cannot talk to peer processes A, B,
and C". The upper-layer router (the PML module) will then put your
BTL at the top of the list for peer processes X, Y, and Z, and will
not put your BTL on the list ofr peer processes A, B, and C. For
A, B, and C, other BTLs will be used (e.g., TCP).
Does that make sense?
To answer your question from a prior mail: the unity topo component
is used for the remapping of ranks in MPI_CART_CREATE. Look in
ompi/mca/topo/unity/.
Thanks to everybody for the clarifications. The function I was
looking for is mca_topo_base_cart_create() in ompi/mca/topo/base/
topo_base_cart_create.c And more precisely I needed the loop:
p = topo_data->mtc_dims_or_index;
coords = topo_data->mtc_coords;
dummy_rank = *new_rank;
for (i=0;
(i < topo_data->mtc_ndims_or_nnodes && i < ndims);
++i, ++p) {
dim = *p;
nprocs /= dim;
*coords++ = dummy_rank / nprocs;
dummy_rank %= nprocs;
}
This defines the precise relation between ranks and coordinates. Once
I know this, I do not even need to write a topo component, because I
can define the ranks of my computing nodes in a rankfile in order
that they get the coordinates that they need physically.
A different issue is the BTL component. This is actually where my
approach 1 and 2 differ (my previous distinction was confusing, due
to my lack of understanding of the distinction between topo and btl
components).
In the 1st approach I would redefine some crucial (for my code) MPI
functions in a way that they call the low level torus primitives,
when the communication occurs between nearest neighbors, and fall
back to open-mpi functions otherwise.
The 2nd approach would be to develop our torus-btl. The fact that one
can choose a "priority list of networks" is definitely great and
dissipates my worries about the feasibility of the 2nd approach in my
case. The only remaining question is whether I can get familiar with
btl stuff fast enough. What do you suggest me to read in order to
learn quickly how to create a BTL component?
Many thanks and best regards, Luigi