More like trying to work around the race condition that exists:  The
server side sends an rdma message first thus violating the iwarp
protocol.  For those who want the gory details:  when the server sends
first -and- that rdma message arrives at the client _before_ the client transitions into rdma mode, then that rdma message gets passed up to the host driver as streaming mode data. So when I originally ran OMPI/ udapl
on chelsio's rnic, the client actually received the mpa start response
-and- the first fpdu from the server while still in streaming mode.
This results in a connection abort because it violates the mpa startup
protocol...


I would recommend making a state transition diagram of the lazy connection establishment. I did this during implementation of the Open IB BTL. Of course, I threw it out as soon as the code was committed :-). This shouldn't take very long and then you can simply modify the state diagram to prevent these race conditions, perhaps identifying an existing state where you can post any buffers that you need. Don't forget to throw away the state diagram as soon as the code is committed ;-).

- Galen



By causing the server to sleep just after accepting the connection, I
give the client time to transition into rdma mode...

It was just a hack to continue testing OMPI/udapl/chelsio.  And it
revealed the problem being discussed in this thread:  OMPI udapl btl
doesn't pre-post the recvs for the sendrecv exchange.


Neither the client nor the server side of the udapl btl
connection setup pre-post RECV buffers before connecting.
This can allow a SEND to arrive before a RECV buffer is
available.  I _think_ IB will handle this issue by
retransmitting the SEND.  Chelsio's iWARP device, however,
TERMINATEs the connection.  My sleep() makes this condition
happen every time.




A compliant DAPL program also ensures that there are adequate
receive buffers in place before the remote peer Sends. It is
explicitly noted that failure to follow this real will invoke
a transport/device dependent penalty. It may be that the sendq
will be fenced, or it may be that the connection will be terminated.

So any RDMA BTL should pre-post recv buffers before initiating or
accepting a connection.



I know of no udapl restiction saying a recv must be posted before a send.
And yes we do pre post recv buffers but since the BTL creates 2
connections per peer, one for eager size messages and one for max size messages the BTL needs to know which connection the current endpoint is
to service so that it can post the proper size recv buffer.

Also, I agree in theory the btl could potentially post the recv which
currently occurs in mca_btl_udapl_sendrecv before the connect or accept but I think in practise we had issue doing this and we had to wait until
a DAT_CONNECTION_EVENT_ESTABLISHED was received.


Well I'm trying it now.  It should work.  If it doesn't, then dapl or
the underlying providers are broken.




From what I can tell, the udapl btl exchanges memory info as a first


order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this
exchange, however, should really be done _before_ the
dat_ep_connect() on the active side, and _before_ the
dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued,
thus allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is
posted before the client can post a SEND for that buffer.
And further, the ULP must enforce flow control somehow so
that a SEND never arrives without a RECV buffer being available.


maybe this is a rule iwarp imposes on its ULPs but not uDAPL.

Perhaps this is just a bug and I opened it up with my sleep()

Or is the uDAPL btl assuming the transport will deal with
lack of RECV buffer at the time a SEND arrives?


There may be a race condition here but you really have to try hard to
see it.


I agree its a small race condition that I made very large by my
sleep! :-) But I can kill 2 birds with one stone here: I'm testing now
a change to the sendrecv exchange to:

1) prepost the recvs just after dat_ep_create

2) force the side that issues the dat_ep_connect to send its addr-data
first

3) force the side that issues the dat_cr_accept to wait for the send
from the peer, then post its addr-data send

This will plug both race conditions.  I'll post the patch once I debug
it and we can discuss if you thinks a good solution and/or if it work
for solaris udapl.

 From Steve  previously.
"Also: Given there is a message exchange _always_ after connection
setup, then we can change that exchange to support the
client-must-send-first issue..."

I agree I am sure we can do something but if it includes an additional
message we should consider a mca parameter to govern this because the
connection wireup is already costly enough.


It won't add an additional message.  Stay tuned for a patch!

Steve.

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to