Ok, I think we're mostly converged on a solution. This might not get implemented immediately (got some other pending v1.3 stuff to bug fix, etc.), but it'll happen for v1.3.

- endpoint creation will mpool alloc/register a small buffer for handshake - cpc does not need to call _post_recvs()); instead, it can just post the single small buffer on each BSRQ QP (from the small buffer on the endpoint) - cpc will call _connected() (in the main thread, not the CPC progress thread) when all BSRQ QPs are connected - if _post_recvs() was previously called, do the normal "finish setting up" stuff and declare the endpoint CONNECTED
  - if _post_recvs() was not previously called, then:
    - call _post_recvs()
    - send a short CTS message on the 1st BSRQ QP
    - wait for CTS from peer
- when both CTS from peer has arrived *and* we have sent our CTS, declare endpoint CONNECTED

Doing it this way adds no overhead to OOB/XOOB (who don't need this extra handshake). I think the code can be factored nicely to make this not too complicated.

I'll work on this once I figure out the memory corruption I'm seeing in the receive_queues patch...

Note that this addresses the wireup multi-threading issues -- not iWarp SRQ issues. We'll tackle those separately, and possibly not for the initial v1.3.0 release.


On May 20, 2008, at 6:02 AM, Gleb Natapov wrote:

On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
5. ...?
What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need
to
prepost one buffer on the active side and two buffers on the passive
side.


This is probably the most viable alternative -- it would be easiest if
we did this for all CPC's, not just for IBCM:

- for PPRQ: CPCs only post a small number of receive buffers, suitable
for another handshake that will run in the upper-level openib BTL
- for SRQ: CPCs don't post anything (because the SRQ already "belongs"
to the upper level openib BTL)

Do we have a BSRQ restriction that there *must* be at least one PPRQ?
No. We don't have such restriction and I wouldn't want to add it.

If so, we could always run the upper-level openib BTL really-post- the-
buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
have the CPC post a single receive on this QP -- see below), which
would make things much easier.  If we don't already have this
restriction, would we mind adding it? We have one PPRQ in our default
receive_queues value, anyway.
If there is not PPRQ then we can relay on RNR/retransmit logic in case
there is not enough buffer in SRQ. We do that anyway in openib BTL code.


With this rationale, once the CPC says "ok, all BSRQ QP's are
connected", then _endpoint.c can run a CTS handshake to post the
"real" buffers, where each side does the following:

- CPC calls _endpoint_connected() to tell the upper level BTL that it
is fully connected (the function is invoked in the main thread)
- _endpoint_connected() posts all the "real" buffers to all the BSRQ
QP's on the endpoint
- _endpoint_connected() then sends a CTS control message to remote
peer via smallest RC PPRQ
- upon receipt of CTS:
  - release the buffer (***)
  - set endpoint state of CONNECTED and let all pending messages
flow... (as it happens today)

So it actually doesn't even have to be a handshake -- it's just an
additional CTS sent over the newly-created RC QP.  Since it's RC, we
don't have to do much -- just wait for the CTS to know that the remote
side has actually posted all the receives that we expect it to have.
Since the CTS flows over a PPRQ, there's no issue about receiving the
CTS on an SRQ (because the SRQ may not have any buffers posted at any
given time).
Correct. Full handshake is not needed. The trick is to allocate those
initial buffers in a smart way. IMO initial buffer should be very
small (a couple of bytes only) and be preallocated on endpoint creation.
This will solve locking problem.

--
                        Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to