On Oct 22, 2007, at 3:06 PM, Brad Penoff wrote:

We had some questions about the best way to make use of Open MPI's
features for a new BTL...  the general theme is making use of the
opal_event's versus a btl_progress function.  When is it best to do
one versus the other?

In our Paris engineering meeting, we had a lengthy discussion about a related topic. The end result of our conversation will result in a few things:

- We'll be updating libevent in the not-distant future (see previous mail today about that) - After updating libevent, we'll be updating to use the more modern epoll (and friends) interfaces. They're manually disabled [with good reason] in our libevent for reasons that are too boring to describe (but I can if you care). - BTLs with a device under them are free to use libevent for fd-based progress and/or a progress function. Software layers without underlying devices should not use progress functions. - We'll eventually be adding a blocking interface to the BTLs. More info TBD on that.

We are working on several designs for an SCTP BTL for Open MPI.  The
familiar one is to use "TCP-style" one-to-one sockets, which have a
socket per endpoint pair, just like the TCP BTL does now.  However, a
more unfamiliar one is to use a single "UDP-style" one-to-many socket
per BTL.  To illustrate, pretend you have 3 processes... each process
only has one socket upon which connections are established, messages
are sent, and messages are received to/from the other two processes.
It is this design that currently we have some questions about....

So far, we have not been implementing our own btl_progress function.
This means that within opal_progress(), poll() is called based on the
opal events registered within the BTL.  Like TCP, for example, when an
MPI_Send happens, the endpoint_send_event is added and POLLOUT is
added for this socket for a given endpoint.  Since MPI_Send is
blocking, it doesn't really matter that this socket is used for other
btl_endpoints because it is the only endpoint with an opal event for
sending added.  However, this is not the case with non-blocking...

When we have multiple outstanding non-blocking requests to different
endpoints, we have to queue them since the endpoints share the same
one-to-many socket and events are associated with a single
btl_endpoint.

From proc C, say we have this pseudo code running:
iSend(proc A)
iSend(proc B)
Waitall()

Within Waitall, our current design using opal events has the iSend to
proc A eventually complete but prior to this, the iSend to proc B
can't start until proc A's is done.  We currently queue the endpoints
waiting for the poll() POLLOUT event and dequeue from this queue when
the event from proc A's endpoint is deleted (and add proc B's endpoint
to the POLLOUT event).

Can you think of a way using the existing framework to eliminate the
restriction of the send to proc B having to complete prior to the send
to proc B starting?

I assume you meant "send to proc *A* having to complete..."

We were trying to use the existing framework but for our case, it may make more sense to implement our own btl_progress function since poll() doesn't really make sense for a single socket anyway... Do you think that would be best?

I guess I don't quite understand -- are you saying that you can have 2 concurrent writes occurring on the same socket to 2 different destinations?

If so, and if libevent doesn't match the SCTP paradigm, then I say: sure, write your own progress function.

George: can you confirm / deny?

We noticed that mca_bml_r2_progress calls btl_progress[i]() which is
set in mca_bml_r2_add_procs if NULL !=
btl->btl_component->btl_progress.  Is there an example of a btl that
implements its own btl_progress function?  I just want to make sure
this is even a possibility before traveling down this path...  and
maybe learn from others prior.

The openib btl has its own progress function.

--
Jeff Squyres
Cisco Systems

Reply via email to