[EMAIL PROTECTED] wrote on Wed, 14 Jun 2006 15:21 +0200: > If we did this within BMI, we would be paying an extra round trip > latency time for each large TCP message, which we should probably try to > avoid. > > I vote for just changing the ordering of sys-io.sm so that it does not > post write flows until a positive write ack is received from the server. > That is basically equivalent to performing one handshake (or > rendezvous) for the whole flow rather than one per BMI message. > > The sys-io.sm state machine already does that for reads. Read flows do > not get posted until an ack is received from the server.
Agree this is probably the best thing to do. But it may slow things down. Unfortunately with the stream semantics and no rendezvous protocol in bmi_tcp, you're mostly stuck with that. You could go with two sockets per connection: data + control, but that seems like hacking around the problem. You could also switch to SCTP and hope that gets widely adopted someday. :) Some IB insight: because it is legal in BMI to post a send before the receiver has posted a receive, even for expected messages, and because IB will break the connection if data arrives with no buffers posted, we have to implement rendezvous and credit-based flow control. And to do big transfers well, we have to exchange RDMA addresses before the transfer can go anyway. So we won't see this problem with IB, and can actually overlap the write ack part with the RDMA address exchange that follows with nothing breaking. (There is a limit of 20 messages per connection in flight, including RDMA address exchange and unexpecteds and whatnot, but we're not approaching that.) I haven't timed the gains from this overlap, but once I see the sys-io patch you need, I'll run some tests and see if there's any detectable difference. -- Pete _______________________________________________ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers