On Thu, 2007-05-10 at 23:10 -0400, Donald Kerr wrote: > > Caitlin Bestler wrote: > > >devel-boun...@open-mpi.org wrote: > > > > > >>>There are two new issues so far: > >>> > >>>1) this has uncovered a connection migration issue in the Chelsio > >>>driver/firmware. We are developing and testing a fix for this now. > >>>Should be ready tomorrow hopefully. > >>> > >>> > >>> > >>I have a fix for the above issue and I can continue with OMPI testing. > >> > >>To work around the client-must-send issue, I put a nice fat > >>sleep in the udapl btl right after it calls dat_cr_accept(), > >>in mca_btl_udapl_accept_connect(). This, however, exposes > >>another issue with the udapl btl: > >> > >> > sleeping after accept? What are you trying to do here force a race > condition? >
More like trying to work around the race condition that exists: The server side sends an rdma message first thus violating the iwarp protocol. For those who want the gory details: when the server sends first -and- that rdma message arrives at the client _before_ the client transitions into rdma mode, then that rdma message gets passed up to the host driver as streaming mode data. So when I originally ran OMPI/udapl on chelsio's rnic, the client actually received the mpa start response -and- the first fpdu from the server while still in streaming mode. This results in a connection abort because it violates the mpa startup protocol... By causing the server to sleep just after accepting the connection, I give the client time to transition into rdma mode... It was just a hack to continue testing OMPI/udapl/chelsio. And it revealed the problem being discussed in this thread: OMPI udapl btl doesn't pre-post the recvs for the sendrecv exchange. > >>Neither the client nor the server side of the udapl btl > >>connection setup pre-post RECV buffers before connecting. > >>This can allow a SEND to arrive before a RECV buffer is > >>available. I _think_ IB will handle this issue by > >>retransmitting the SEND. Chelsio's iWARP device, however, > >>TERMINATEs the connection. My sleep() makes this condition > >>happen every time. > >> > >> > >> > > > >A compliant DAPL program also ensures that there are adequate > >receive buffers in place before the remote peer Sends. It is > >explicitly noted that failure to follow this real will invoke > >a transport/device dependent penalty. It may be that the sendq > >will be fenced, or it may be that the connection will be terminated. > > > >So any RDMA BTL should pre-post recv buffers before initiating or > >accepting a connection. > > > > > > > I know of no udapl restiction saying a recv must be posted before a send. > And yes we do pre post recv buffers but since the BTL creates 2 > connections per peer, one for eager size messages and one for max size > messages the BTL needs to know which connection the current endpoint is > to service so that it can post the proper size recv buffer. > > Also, I agree in theory the btl could potentially post the recv which > currently occurs in mca_btl_udapl_sendrecv before the connect or accept > but I think in practise we had issue doing this and we had to wait until > a DAT_CONNECTION_EVENT_ESTABLISHED was received. > Well I'm trying it now. It should work. If it doesn't, then dapl or the underlying providers are broken. > > > > > > > >>>From what I can tell, the udapl btl exchanges memory info as a first > >>> > >>> > >>order of business after connection establishment > >>(mba_btl_udapl_sendrecv(). The RECV buffer post for this > >>exchange, however, should really be done _before_ the > >>dat_ep_connect() on the active side, and _before_ the > >>dat_cr_accept() on the server side. > >>Currently its done after the ESTABLISHED event is dequeued, > >>thus allowing the race condition. > >> > >>I believe the rules are the ULP must ensure that a RECV is > >>posted before the client can post a SEND for that buffer. > >>And further, the ULP must enforce flow control somehow so > >>that a SEND never arrives without a RECV buffer being available. > >> > >> > maybe this is a rule iwarp imposes on its ULPs but not uDAPL. > > >>Perhaps this is just a bug and I opened it up with my sleep() > >> > >>Or is the uDAPL btl assuming the transport will deal with > >>lack of RECV buffer at the time a SEND arrives? > >> > >> > There may be a race condition here but you really have to try hard to > see it. > I agree its a small race condition that I made very large by my sleep! :-) But I can kill 2 birds with one stone here: I'm testing now a change to the sendrecv exchange to: 1) prepost the recvs just after dat_ep_create 2) force the side that issues the dat_ep_connect to send its addr-data first 3) force the side that issues the dat_cr_accept to wait for the send from the peer, then post its addr-data send This will plug both race conditions. I'll post the patch once I debug it and we can discuss if you thinks a good solution and/or if it work for solaris udapl. > From Steve previously. > "Also: Given there is a message exchange _always_ after connection > setup, then we can change that exchange to support the > client-must-send-first issue..." > > I agree I am sure we can do something but if it includes an additional > message we should consider a mca parameter to govern this because the > connection wireup is already costly enough. > It won't add an additional message. Stay tuned for a patch! Steve.