I talked with Steve a bunch on the phone about this.

1. This "connector must RDMA first" issue is an iWARP restriction -- it's not specific to udapl or verbs. For example, if you try to use udapl with iWARP on Solaris, you'll have the same issue (I have no idea whether you have iWARP drivers in Solaris or not).

2. Per his prior e-mail (which I didn't fully grok until I talked to him), using the RDMA CM in the openib BTL will not magically fix this issue for us.

3. So for any of the BTLs to support iWARP -- regardless of underlying protocol or OS -- they are going to have to obey this restriction.

4. Luckily, in iWARP, the restriction can be met by either send/ receive semantics *or* RDMA semantics. You don't have to specifically use RDMA verbs semantics, for example. This is good because of the way that OMPI works (the first fragment that will be transmitted is pretty much guaranteed to be a send/receive fragment, not an RDMA fragment) -- it makes the logistics slightly simpler.

Galen Shipman and I talked about this a bit and suggest the following:

- During the connection dance (probably for both the udapl and openib BTLs), whichever peer ends up being the connection initiator (don't forget about the race condition where 2 peers may simultaneously decide to initiate -- this case is handled properly in the OMPI code; but just make sure you modify the side that ends up being actual initiator), they can send their pending fragment immediately (and Steve is right that there will always be a pending fragment, because OMPI doesn't make a connection until the first send).

- The other peer (the receiver of the connection) must wait to send its pending fragment(s) until it receives the first frag from the connection initiator. This can be accomplished either with another flag on the OMPI module struct or perhaps making it part of the connection protocol (i.e., don't transition the endpoint to be CONNECTED until the first fragment is received). Either of which can be used to queue up fragments on the receiver until the first fragment is received from the initiator. I'd have to look in the code deeper, but I'm *guessing* that it might be best to use the already-existing state flag (i.e., checking for CONNECTED) because then you won't be introducing any more conditionals in the critical path.




On May 9, 2007, at 4:45 PM, Donald Kerr wrote:

I guess I have not read enough about iwarp yet but if iwarp is sitting
below ib verbs or udapl in the stack and is trying to impose
restrictions which ib verbs or udapl do not adhere to then maybe iwarp
is in the wrong place in the ofed stack.

Having said that I do agree the OMPI community needs to consider where
iwarp plays in its own stack. If it has not already.

Steve Wise wrote:

On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:


So then I agree with Andrew, I think you are trying to impose
restrictions on uDAPL which are not part of the Spec.




true, but if you want a single btl for IB and IW, then you'll need to
address this issue in some way...


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to