Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
On May 9, 2007, at 1:37 AM, Or Gerlitz wrote: Doing a bit of zoom out from the "how to make ofed's udapl work for ompi" thread, my thinking is that the ompi udapl btl enablement is actually only the first step, where for production/longterm/etc you want to have an rdmacm btl. I think this is a bit of a misunderstanding. The "BTL" in Open MPI is a byte transfer layer; it is a point-to-point abstraction for moving bytes between two processes. BTL components (read: plugins) are typically distinguished by the underlying protocols used. For example, we have an RC verbs-based BTL and we have a separate uDAPL- based BTL. Andrew is also working on a research-quality UD verbs- based BTL. Hence, how a particular BTL component makes connections between process peers is really a side-effect of moving bytes around, and not the focus of the BTL. So having a "rdmacm" BTL doesn't really make sense. If both the RC and UD verbs-based BTLs someday use the RDMA CM for connections, we might abstract the connection management out to a common piece of code between the two. But that's a different issue. If we end up having a mixed BTL someday that uses both RC and UD, then the need for the common code may go away. But that's in the future. Reasoning here is made of many arguments, among them the quickest i can make are: A) it seems that ompi would want to use not only RC but rather also UD multicast and unicast, which are not covered by udapl B) there's actually no real justification to maintain two APIs (namely udapl vs libibvers/librdmacm), so down the road, only one of them would survive (udapl is implemented ***over*** libibverbs/ librdmacm so if the latteres dies same does udapl). Specifically, I hear here and there that the OFED stack is now on its way to be deployed all over the place, specifically in commercial Unix OSs (which want modern! code that supports IPoIB-CM,RDS,SRP,iSER, etc you named it) so eventually the rdmacm btl can be used also over Solaris et al. I think that's not quite the point. 1. A piece of history: the uDAPL BTL was originally developed by a grad student just as an excuse to learn the BTL interface and OMPI internals. We already had an RC verbs-based BTL at the time. 2. When Sun joined Open MPI, they took over the development and maintenance of the uDAPL BTL because uDAPL is the only high performance stack on Solaris. 3. It's fine that Sun will someday support the same verbs interface that OFED does. But *today*, they don't. So for their current customers, they need to support uDAPL. As such, we have done little/ no testing of uDAPL on OFED since Sun took over the uDAPL BTL -- all testing since that point has been on Solaris uDAPL. All of our Linux/ OFED efforts have been on the verbs interface. 4. The Open MPI focus on uDAPL over OFED at the moment is simply to jump-start iWARP testing. Both NetEffect and Chelsio have chimed in to say that they will do the RDMA CM work for Open MPI, but uDAPL can be used as a temporary workaround that can be used [effectively] immediately while they get up to speed on the Open MPI code base and do the RDMA CM work. -- Jeff Squyres Cisco Systems
[OMPI devel] OMPI over ofed udapl - bugs opened
606 opened to track the udapl change. 607 opened to track the ompi change to remove the port number stashing hack. Status: I have a patch from Arlin to test today. I will test with that patch and with the OMPI port hack removed. Stay tuned... Steve. On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: > Steve Wise wrote: > > >I would like the group to consider including changes needed to OMPI > >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. > > > >This will provide OMPI support over iwarp devices via udapl until we can > >get rdma-cm support added to OMPI. > > > > > >Steve. > > > > > > > Steve,cCan you open a bug to track this?
Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
On Wed, 2007-05-09 at 08:37 +0300, Or Gerlitz wrote: > Andrew Friedley wrote: > > Jeff Squyres wrote: > FWIW, yes, adding RDMA CM support has actually been on my to-do list > for a while, but it keeps getting bumped by higher priority items. > It would be *much* better if some iWARP companies got involved in > Open MPI... > > > Hmm I'm interested. I've already done some work switching over to RDMA > > CM for some research stuff I've been doing; it's not publicly accessible > > w/o the 3rd party agreement. I can help answer questions on what > > exactly needs to change, and do some testing. > > Doing a bit of zoom out from the "how to make ofed's udapl work for > ompi" thread, my thinking is that the ompi udapl btl enablement is > actually only the first step, where for production/longterm/etc you want > to have an rdmacm btl. Reasoning here is made of many arguments, among > them the quickest i can make are: > > A) it seems that ompi would want to use not only RC but rather also UD > multicast and unicast, which are not covered by udapl > > B) there's actually no real justification to maintain two APIs (namely > udapl vs libibvers/librdmacm), so down the road, only one of them would > survive (udapl is implemented ***over*** libibverbs/librdmacm so if the > latteres dies same does udapl). Specifically, I hear here and there that > the OFED stack is now on its way to be deployed all over the place, > specifically in commercial Unix OSs (which want modern! code that > supports IPoIB-CM,RDS,SRP,iSER, etc you named it) so eventually the > rdmacm btl can be used also over Solaris et al. > Agreed. enabling udapl will get OMPI over iwarp immediately (and hopefully in ofed-1.2). Post ofed-1.2, I think OMPI _should_ create a rdma-cm btl. That's the plan... Steve.
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
Although as Boris pointed out, perhaps the hack in OMPI is no longer needed at all... On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote: > 606 opened to track the udapl change. > > 607 opened to track the ompi change to remove the port number stashing > hack. > > Status: I have a patch from Arlin to test today. I will test with that > patch and with the OMPI port hack removed. Stay tuned... > > > > Steve. > > On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: > > Steve Wise wrote: > > > > >I would like the group to consider including changes needed to OMPI > > >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. > > > > > >This will provide OMPI support over iwarp devices via udapl until we can > > >get rdma-cm support added to OMPI. > > > > > > > > >Steve. > > > > > > > > > > > Steve,cCan you open a bug to track this? > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
FWIW, I would marginally prefer if this bug is tracked in the Open MPI trac ticket system, not the OFA bugzilla (Steve W. will have write access there as soon as Chelsio submits their OMPI 3rd party contribution agreement). We've traditionally [mostly] tracked OMPI bugs in the OMPI bug system and OFED-specific OMPI packaging problems in the OFA bugzilla. It's a gray area, I admit. But since I'm not the uDAPL maintainer in Open MPI, moving the bug over there will allow the Right people to see it (some OMPI developers are cross subscribed to the OFA general list, but not all). For example, this udapl problem is likely related to the existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ ticket/890). On May 9, 2007, at 10:37 AM, Steve Wise wrote: Although as Boris pointed out, perhaps the hack in OMPI is no longer needed at all... On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote: 606 opened to track the udapl change. 607 opened to track the ompi change to remove the port number stashing hack. Status: I have a patch from Arlin to test today. I will test with that patch and with the OMPI port hack removed. Stay tuned... Steve. On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: Steve Wise wrote: I would like the group to consider including changes needed to OMPI and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. This will provide OMPI support over iwarp devices via udapl until we can get rdma-cm support added to OMPI. Steve. Steve,cCan you open a bug to track this? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM
On May 9, 2007, at 10:30 AM, Steve Wise wrote: Agreed. enabling udapl will get OMPI over iwarp immediately (and hopefully in ofed-1.2). Post ofed-1.2, I think OMPI _should_ create a rdma-cm btl. That's the plan... Yes and no. Please see my other reply about an "rdma cm" BTL... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
I agree OMPI trac ticket #890 should cover this. I will test the suggested fix, just removing that one line from btl_udapl.c, on Solaris. I am still not set up on Linux so hopefully Steve can confirm there. -DON Jeff Squyres wrote: FWIW, I would marginally prefer if this bug is tracked in the Open MPI trac ticket system, not the OFA bugzilla (Steve W. will have write access there as soon as Chelsio submits their OMPI 3rd party contribution agreement). We've traditionally [mostly] tracked OMPI bugs in the OMPI bug system and OFED-specific OMPI packaging problems in the OFA bugzilla. It's a gray area, I admit. But since I'm not the uDAPL maintainer in Open MPI, moving the bug over there will allow the Right people to see it (some OMPI developers are cross subscribed to the OFA general list, but not all). For example, this udapl problem is likely related to the existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ ticket/890). On May 9, 2007, at 10:37 AM, Steve Wise wrote: Although as Boris pointed out, perhaps the hack in OMPI is no longer needed at all... On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote: 606 opened to track the udapl change. 607 opened to track the ompi change to remove the port number stashing hack. Status: I have a patch from Arlin to test today. I will test with that patch and with the OMPI port hack removed. Stay tuned... Steve. On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote: Steve Wise wrote: I would like the group to consider including changes needed to OMPI and/or ofa udapl to get OMPI working again on udapl for ofed-1.2. This will provide OMPI support over iwarp devices via udapl until we can get rdma-cm support added to OMPI. Steve. Steve,cCan you open a bug to track this? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote: > I agree OMPI trac ticket #890 should cover this. I will test the > suggested fix, just removing that one line from btl_udapl.c, on Solaris. > I am still not set up on Linux so hopefully Steve can confirm there. > All, First, I haven't tested Arlins dat_ep_query() fix yet as we have determined its not needed. The OMPI udapl btl never calls dat_ep_query()... So running OMPI with the suggested fix (removing the overwriting of the hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp rnic still doesn't work. There are two new issues so far: 1) this has uncovered a connection migration issue in the Chelsio driver/firmware. We are developing and testing a fix for this now. Should be ready tomorrow hopefully. 2) OMPI is not adhering to the iwarp protocol requirement that the ULP, in this case OMPI, initiating the iwarp connection (the side issuing the dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA message. So if a OMPI process _accepts_ an rdma connection, then it cannot send on that connection until it receives some sort of rdma operation from the client process. It appears the current OMPI connection setup model doesn't enforce this. This combined with the bug above causes an immediate connection failure on chelsio's rnic. After I fix #1 above, things might get slightly better but my guess is we will still have connection setup problems if the server side sends before the client side finishes streaming->rdma mode transition. There have been a series of discussions on the ofa general list about this issue, and the conclusion to date is that it cannot be resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because sending an RDMA message involves the ULP's work queue and completion queue, so the CM cannot do this under the covers in a mannor that doesn't affect the application. Thus, the applications must deal with this. Here is a possible solution: I assume in OMPI that connections are only initiated when the mpi application does a send operation. Given that, then udapl btl must ensure that if a given rank accepts a connection, it cannot not send anything until the rank at the other end of the connection sends first. Since the other side initiated the connection, it will have pending data to send... I haven't looked into how painful this will be to implement. Thoughts? FYI: IETF Draft requiring this behavior: http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt See section 7 for specifics. Steve.
[OMPI devel] Nightly trunk tarball AC/AM change
Hi all - After a minor hiccup last night, nightly tarballs for the trunk (and eventually v1.3 branch) are now made with AC 2.61, AM 1.10, and LT 2.1a. Don't forget the mandatory update of AC and AM for the trunk coming saturday morning! Brian
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
Steve Wise wrote: There have been a series of discussions on the ofa general list about this issue, and the conclusion to date is that it cannot be resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because sending an RDMA message involves the ULP's work queue and completion queue, so the CM cannot do this under the covers in a mannor that doesn't affect the application. Thus, the applications must deal with this. Why can't uDAPL deal with this? As a uDAPL user, I really don't care what API uDAPL is using under the hood to move data from one place to another, nor the quirks of that API. The whole point of uDAPL is to form a network-agnostic abstraction layer. AFAIK, the uDAPL spec doesn't enforce any such requirement on RDMA communication either. In my opinion, exposing such behavior above uDAPL is incorrect and is part of why uDAPL has seen limited adoption -- every single uDAPL implementation behaves in different ways, making it extremely difficult to write an application to work on any uDAPL implementation. Sorry if this sounds harsh, but this comes from many hours of banging my head on the wall due to working around these sorts of problems :) Here is a possible solution: I assume in OMPI that connections are only initiated when the mpi application does a send operation. Given that, then udapl btl must ensure that if a given rank accepts a connection, it cannot not send anything until the rank at the other end of the connection sends first. Since the other side initiated the connection, it will have pending data to send... I haven't looked into how painful this will be to implement. Thoughts? Following on what I wrote above, I think Open MPI is the wrong place to be dealing with this. There's enough of these hacks as it is; I'm not interested in seeing more get added. Andrew
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
I missing some context here. Where are you plugging iwarp and OMPI together? Steve Wise wrote: On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote: I agree OMPI trac ticket #890 should cover this. I will test the suggested fix, just removing that one line from btl_udapl.c, on Solaris. I am still not set up on Linux so hopefully Steve can confirm there. All, First, I haven't tested Arlins dat_ep_query() fix yet as we have determined its not needed. The OMPI udapl btl never calls dat_ep_query()... So running OMPI with the suggested fix (removing the overwriting of the hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp rnic still doesn't work. There are two new issues so far: 1) this has uncovered a connection migration issue in the Chelsio driver/firmware. We are developing and testing a fix for this now. Should be ready tomorrow hopefully. 2) OMPI is not adhering to the iwarp protocol requirement that the ULP, in this case OMPI, initiating the iwarp connection (the side issuing the dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA message. So if a OMPI process _accepts_ an rdma connection, then it cannot send on that connection until it receives some sort of rdma operation from the client process. It appears the current OMPI connection setup model doesn't enforce this. This combined with the bug above causes an immediate connection failure on chelsio's rnic. After I fix #1 above, things might get slightly better but my guess is we will still have connection setup problems if the server side sends before the client side finishes streaming->rdma mode transition. There have been a series of discussions on the ofa general list about this issue, and the conclusion to date is that it cannot be resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because sending an RDMA message involves the ULP's work queue and completion queue, so the CM cannot do this under the covers in a mannor that doesn't affect the application. Thus, the applications must deal with this. Here is a possible solution: I assume in OMPI that connections are only initiated when the mpi application does a send operation. Given that, then udapl btl must ensure that if a given rank accepts a connection, it cannot not send anything until the rank at the other end of the connection sends first. Since the other side initiated the connection, it will have pending data to send... I haven't looked into how painful this will be to implement. Thoughts? FYI: IETF Draft requiring this behavior: http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt See section 7 for specifics. Steve. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote: > I missing some context here. Where are you plugging iwarp and OMPI > together? ofed-1.2 supports iwarp and the chelsio rnic. It can be accessed directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl. I'm attempting to run OMPI over udapl over chelsio's rnic. Steve.
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
So then I agree with Andrew, I think you are trying to impose restrictions on uDAPL which are not part of the Spec. -DON Steve Wise wrote: On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote: I missing some context here. Where are you plugging iwarp and OMPI together? ofed-1.2 supports iwarp and the chelsio rnic. It can be accessed directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl. I'm attempting to run OMPI over udapl over chelsio's rnic. Steve.
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote: > So then I agree with Andrew, I think you are trying to impose > restrictions on uDAPL which are not part of the Spec. > true, but if you want a single btl for IB and IW, then you'll need to address this issue in some way...
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
I guess I have not read enough about iwarp yet but if iwarp is sitting below ib verbs or udapl in the stack and is trying to impose restrictions which ib verbs or udapl do not adhere to then maybe iwarp is in the wrong place in the ofed stack. Having said that I do agree the OMPI community needs to consider where iwarp plays in its own stack. If it has not already. Steve Wise wrote: On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote: So then I agree with Andrew, I think you are trying to impose restrictions on uDAPL which are not part of the Spec. true, but if you want a single btl for IB and IW, then you'll need to address this issue in some way... ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
> > 2) OMPI is not adhering to the iwarp protocol requirement > that the ULP, > in this case OMPI, initiating the iwarp connection (the side > issuing the > dat_ep_connect() or rdma_connect()) _MUST_ be the first to > send an RDMA > message. So if a OMPI process _accepts_ an rdma connection, then it > cannot send on that connection until it receives some sort of rdma > operation from the client process. It appears the current OMPI > connection setup model doesn't enforce this. > This is actually an MPA requirement, and accoring to *protocol* specs having the active side send a zero length RDMA Write should be able to fix the problem. However there is language in the RDMAC verbs that clearly implies that the active side must Send something, and that an RDMA Write is insufficient. Therefore, the only truly safe thing for an iWARP btl to do (or a udapl btl since that is also an iWARP btl) is to have the active layer send an MPI Layer "nop" of some kind immediately after establishing the connection if there is nothing else to send.
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
I talked with Steve a bunch on the phone about this. 1. This "connector must RDMA first" issue is an iWARP restriction -- it's not specific to udapl or verbs. For example, if you try to use udapl with iWARP on Solaris, you'll have the same issue (I have no idea whether you have iWARP drivers in Solaris or not). 2. Per his prior e-mail (which I didn't fully grok until I talked to him), using the RDMA CM in the openib BTL will not magically fix this issue for us. 3. So for any of the BTLs to support iWARP -- regardless of underlying protocol or OS -- they are going to have to obey this restriction. 4. Luckily, in iWARP, the restriction can be met by either send/ receive semantics *or* RDMA semantics. You don't have to specifically use RDMA verbs semantics, for example. This is good because of the way that OMPI works (the first fragment that will be transmitted is pretty much guaranteed to be a send/receive fragment, not an RDMA fragment) -- it makes the logistics slightly simpler. Galen Shipman and I talked about this a bit and suggest the following: - During the connection dance (probably for both the udapl and openib BTLs), whichever peer ends up being the connection initiator (don't forget about the race condition where 2 peers may simultaneously decide to initiate -- this case is handled properly in the OMPI code; but just make sure you modify the side that ends up being actual initiator), they can send their pending fragment immediately (and Steve is right that there will always be a pending fragment, because OMPI doesn't make a connection until the first send). - The other peer (the receiver of the connection) must wait to send its pending fragment(s) until it receives the first frag from the connection initiator. This can be accomplished either with another flag on the OMPI module struct or perhaps making it part of the connection protocol (i.e., don't transition the endpoint to be CONNECTED until the first fragment is received). Either of which can be used to queue up fragments on the receiver until the first fragment is received from the initiator. I'd have to look in the code deeper, but I'm *guessing* that it might be best to use the already-existing state flag (i.e., checking for CONNECTED) because then you won't be introducing any more conditionals in the critical path. On May 9, 2007, at 4:45 PM, Donald Kerr wrote: I guess I have not read enough about iwarp yet but if iwarp is sitting below ib verbs or udapl in the stack and is trying to impose restrictions which ib verbs or udapl do not adhere to then maybe iwarp is in the wrong place in the ofed stack. Having said that I do agree the OMPI community needs to consider where iwarp plays in its own stack. If it has not already. Steve Wise wrote: On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote: So then I agree with Andrew, I think you are trying to impose restrictions on uDAPL which are not part of the Spec. true, but if you want a single btl for IB and IW, then you'll need to address this issue in some way... ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
Jeff Squyres wrote: > > - The other peer (the receiver of the connection) must wait > to send its pending fragment(s) until it receives the first > frag from the connection initiator. This can be accomplished > either with another flag on the OMPI module struct or perhaps > making it part of the connection protocol (i.e., don't > transition the endpoint to be CONNECTED until the first > fragment is received). Either of which can be used to queue > up fragments on the receiver until the first fragment is > received from the initiator. I'd have to look in the code > deeper, but I'm *guessing* that it might be best to use the > already-existing state flag (i.e., checking for CONNECTED) > because then you won't be introducing any more conditionals > in the critical path. > The transport provider has several options on ensuring that the passive side does not put a message on the wire before the first message is received. What the transport layer cannot do is create the first message from the active side. Because it will have send/recv semantics it will complete a receive work request, which the application layer has to post with that expectation. this nop does not have to be visible above OMPI, but I'm pretty sure OMPI has to generate it. That isn't exactly fair to the application layer, but the RDMAC verbs are water under the bridge. Assuming OMPI wants to work with *any* iWARP RNIC then it needs to ensure that the active side will send something promptly in all cases.
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
Understood, and I agree. FWIW: note that the CONNECTED state that I refered to is internal to OMPI's endpoint abstraction (not an iwarp/udapl/verbs/etc. state). It's part of our connection dance protocol. On May 9, 2007, at 5:33 PM, Caitlin Bestler wrote: Jeff Squyres wrote: - The other peer (the receiver of the connection) must wait to send its pending fragment(s) until it receives the first frag from the connection initiator. This can be accomplished either with another flag on the OMPI module struct or perhaps making it part of the connection protocol (i.e., don't transition the endpoint to be CONNECTED until the first fragment is received). Either of which can be used to queue up fragments on the receiver until the first fragment is received from the initiator. I'd have to look in the code deeper, but I'm *guessing* that it might be best to use the already-existing state flag (i.e., checking for CONNECTED) because then you won't be introducing any more conditionals in the critical path. The transport provider has several options on ensuring that the passive side does not put a message on the wire before the first message is received. What the transport layer cannot do is create the first message from the active side. Because it will have send/recv semantics it will complete a receive work request, which the application layer has to post with that expectation. this nop does not have to be visible above OMPI, but I'm pretty sure OMPI has to generate it. That isn't exactly fair to the application layer, but the RDMAC verbs are water under the bridge. Assuming OMPI wants to work with *any* iWARP RNIC then it needs to ensure that the active side will send something promptly in all cases. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: > > Steve Wise wrote: > > There have been a series of discussions on the ofa general list about > > this issue, and the conclusion to date is that it cannot be resolved in > > the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because > > sending an RDMA message involves the ULP's work queue and completion > > queue, so the CM cannot do this under the covers in a mannor that > > doesn't affect the application. Thus, the applications must deal with > > this. > > Why can't uDAPL deal with this? As a uDAPL user, I really don't care > what API uDAPL is using under the hood to move data from one place to > another, nor the quirks of that API. The whole point of uDAPL is to > form a network-agnostic abstraction layer. AFAIK, the uDAPL spec > doesn't enforce any such requirement on RDMA communication either. In > my opinion, exposing such behavior above uDAPL is incorrect and is part > of why uDAPL has seen limited adoption -- every single uDAPL > implementation behaves in different ways, making it extremely difficult > to write an application to work on any uDAPL implementation. Sorry if > this sounds harsh, but this comes from many hours of banging my head on > the wall due to working around these sorts of problems :) > I understand your frustration. I think the MPA protocol is deficient in this respect and should have required the necessary "first FPDU" to be sent under the covers by the RNICs. A RTR packet if you will. To resolve this issue "properly", in my opinion, would involve changing the IETF MPA spec and also breaking all the existing iwarp HW. We can't do that. The reason it is hard or impossible to solve this in the DAPL layer is that any rdma operation on the QP affects the state of that QP and the associate CQs. In addition, if you use an RDMA send to enforce this you impact the other side by consuming a RECV buffer. So its hard if not impossible to do this under the covers without affecting the application's resources. Also, the DAPL specification had a goal to not impose any additional protocol on the wire. If you add this under the covers, then you add such a "protocol" and break interoperability between a connection accessed via DAPL on one end and some other API on the other end. > > > > Here is a possible solution: > > > > I assume in OMPI that connections are only initiated when the mpi > > application does a send operation. Given that, then udapl btl must > > ensure that if a given rank accepts a connection, it cannot not send > > anything until the rank at the other end of the connection sends first. > > Since the other side initiated the connection, it will have pending data > > to send... > > > > I haven't looked into how painful this will be to implement. > > > > Thoughts? > > Following on what I wrote above, I think Open MPI is the wrong place to > be dealing with this. There's enough of these hacks as it is; I'm not > interested in seeing more get added. > Unfortunately, I haven't been able to come up with a solution that works with existing iWARP HW and is interoperable. Steve.
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
Therefore, the only truly safe thing for an iWARP btl to do (or a udapl btl since that is also an iWARP btl) is to have the active layer send an MPI Layer "nop" of some kind immediately after establishing the connection if there is nothing else to send. This is fine for an iWARP/RDMACM/whatever BTL (or anything else that uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT specifically there to support just iWARP (though it may include it), and that OFED's uDAPL should be adjusted to handle this. Again, uDAPL is a network *independent* abstraction, so requiring network-dependent behavior from the uDAPL consumer is wrong. A related question -- how does this 'connection initiator must send first' requirement relate to UD? Andrew
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
general-boun...@lists.openfabrics.org wrote: >> Therefore, the only truly safe thing for an iWARP btl to do (or a >> udapl btl since that is also an iWARP btl) is to have the active >> layer send an MPI Layer "nop" of some kind immediately after >> establishing the connection if there is nothing else to send. > > This is fine for an iWARP/RDMACM/whatever BTL (or anything > else that uses the OFA verbs interface(s)), but my argument > is that uDAPL is NOT specifically there to support just iWARP > (though it may include it), and that OFED's uDAPL should be > adjusted to handle this. Again, uDAPL is a network > *independent* abstraction, so requiring network-dependent > behavior from the uDAPL consumer is wrong. > DAPL strives to define network independent solutions. In this case the network independent solution is that the active side *always* sends the first message. This works for both iWARP and InfiniBand. And away from the HPC market it is almost a non-requirement (which is why the RDMAC managed to goof on this in its specification. A zero-length RDMA Write is enough to deal with the wire protocol problem, but people implemented to the RDMAC verbs.) > > A related question -- how does this 'connection initiator > must send first' requirement relate to UD? > iWARP UD is called UDP. It has nothing to do with MPA or RDMA. An API that mapped to either IB UD or UDP is definitely feasible, but hasn't been important enough to anyone to draft as of yet.
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
Steve Wise wrote: On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: Steve Wise wrote: There have been a series of discussions on the ofa general list about this issue, and the conclusion to date is that it cannot be resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because sending an RDMA message involves the ULP's work queue and completion queue, so the CM cannot do this under the covers in a mannor that doesn't affect the application. Thus, the applications must deal with this. Why can't uDAPL deal with this? As a uDAPL user, I really don't care what API uDAPL is using under the hood to move data from one place to another, nor the quirks of that API. The whole point of uDAPL is to form a network-agnostic abstraction layer. AFAIK, the uDAPL spec doesn't enforce any such requirement on RDMA communication either. In my opinion, exposing such behavior above uDAPL is incorrect and is part of why uDAPL has seen limited adoption -- every single uDAPL implementation behaves in different ways, making it extremely difficult to write an application to work on any uDAPL implementation. Sorry if this sounds harsh, but this comes from many hours of banging my head on the wall due to working around these sorts of problems :) I understand your frustration. I think the MPA protocol is deficient in this respect and should have required the necessary "first FPDU" to be sent under the covers by the RNICs. A RTR packet if you will. To resolve this issue "properly", in my opinion, would involve changing the IETF MPA spec and also breaking all the existing iwarp HW. We can't do that. Understood. The reason it is hard or impossible to solve this in the DAPL layer is that any rdma operation on the QP affects the state of that QP and the associate CQs. In addition, if you use an RDMA send to enforce this you impact the other side by consuming a RECV buffer. So its hard if not impossible to do this under the covers without affecting the application's resources. Is there no way to do this before passing connection established events to the uDAPL consumer? I need to go read up on the uDAPL API to really understand why this wouldn't work. Also, the DAPL specification had a goal to not impose any additional protocol on the wire. If you add this under the covers, then you add such a "protocol" and break interoperability between a connection accessed via DAPL on one end and some other API on the other end. So I guess there's no 'right' solution, at least at the uDAPL level. With RDMACM/OFA verbs, there's at least the argument that you can design the API/semantics however you please, while uDAPL is already standardized. I hope you guys are documenting this in a way that makes this issue extremely clear to both uDAPL and OFA verbs (is this the right naming?) users. Maybe it's been done already, but is it possible to emit some sort of loud warning/error when the accept()'ing side tries to send before a receive? Andrew
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
On Wed, 2007-05-09 at 17:46 -0700, Andrew Friedley wrote: > > Therefore, the only truly safe thing for an iWARP btl to do (or a > > udapl btl since that is also an iWARP btl) is to have the active > > layer send an MPI Layer "nop" of some kind immediately after > > establishing the connection if there is nothing else to send. > > This is fine for an iWARP/RDMACM/whatever BTL (or anything else that > uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT > specifically there to support just iWARP (though it may include it), and > that OFED's uDAPL should be adjusted to handle this. Again, uDAPL is a > network *independent* abstraction, so requiring network-dependent > behavior from the uDAPL consumer is wrong. > > A related question -- how does this 'connection initiator must send > first' requirement relate to UD? > It doesn't. UD isn't supported in IWARP.
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
devel-boun...@open-mpi.org wrote: > Steve Wise wrote: >> There have been a series of discussions on the ofa general list about >> this issue, and the conclusion to date is that it cannot be resolved >> in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly >> because sending an RDMA message involves the ULP's work queue and >> completion queue, so the CM cannot do this under the covers in a >> mannor that doesn't affect the application. Thus, the applications >> must deal with this. > > Why can't uDAPL deal with this? As a uDAPL user, I really > don't care what API uDAPL is using under the hood to move > data from one place to another, nor the quirks of that API. > The whole point of uDAPL is to form a network-agnostic > abstraction layer. AFAIK, the uDAPL spec doesn't enforce any > such requirement on RDMA communication either. In my > opinion, exposing such behavior above uDAPL is incorrect and > is part of why uDAPL has seen limited adoption -- every > single uDAPL implementation behaves in different ways, making > it extremely difficult to write an application to work on any > uDAPL implementation. Sorry if this sounds harsh, but this > comes from many hours of banging my head on the wall due to > working around these sorts of problems :) > The simple answer is that uDAPL cannot deal with this. The RDMAC verbs specification was overly focused on client/server and therefore did not realize that there was any harm in requiring that the active side did the first send. But given that DAPL could not rewrite either the RDMAC or InfiniBand verbs it had to come up with the best solution that matched the verbs as they were. One of the explicit ground rules was that DAPL MUST support all RDMA devices that were IBTA or RDMAC compliant. Given those rules, if the active side does not send a message the passive side might be held off indefinitely, and sending a message cause consumption of a receive buffer and therefore cannot be transparent to the uDAPL consumer. Given those constraints there is literally nothing that can be done to work around this problem by either DAPL or OFA.
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote: > > Steve Wise wrote: > > On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: > >> Steve Wise wrote: > >>> There have been a series of discussions on the ofa general list about > >>> this issue, and the conclusion to date is that it cannot be resolved in > >>> the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because > >>> sending an RDMA message involves the ULP's work queue and completion > >>> queue, so the CM cannot do this under the covers in a mannor that > >>> doesn't affect the application. Thus, the applications must deal with > >>> this. > >> Why can't uDAPL deal with this? As a uDAPL user, I really don't care > >> what API uDAPL is using under the hood to move data from one place to > >> another, nor the quirks of that API. The whole point of uDAPL is to > >> form a network-agnostic abstraction layer. AFAIK, the uDAPL spec > >> doesn't enforce any such requirement on RDMA communication either. In > >> my opinion, exposing such behavior above uDAPL is incorrect and is part > >> of why uDAPL has seen limited adoption -- every single uDAPL > >> implementation behaves in different ways, making it extremely difficult > >> to write an application to work on any uDAPL implementation. Sorry if > >> this sounds harsh, but this comes from many hours of banging my head on > >> the wall due to working around these sorts of problems :) > >> > > > > I understand your frustration. I think the MPA protocol is deficient in > > this respect and should have required the necessary "first FPDU" to be > > sent under the covers by the RNICs. A RTR packet if you will. To > > resolve this issue "properly", in my opinion, would involve changing the > > IETF MPA spec and also breaking all the existing iwarp HW. We can't do > > that. > > Understood. > > > The reason it is hard or impossible to solve this in the DAPL layer is > > that any rdma operation on the QP affects the state of that QP and the > > associate CQs. In addition, if you use an RDMA send to enforce this you > > impact the other side by consuming a RECV buffer. So its hard if not > > impossible to do this under the covers without affecting the > > application's resources. > > Is there no way to do this before passing connection established events > to the uDAPL consumer? I need to go read up on the uDAPL API to really > understand why this wouldn't work. > Perhaps the dapl or maybe even a OFA iWARP CM could defer passing up the "established" event on the passive side until an incoming SEND is detected. I know we've discussed this before, but I'm not sure why this was not a workable solution. Perhaps Caitlin or some iwarp folks can recall? > > > > Also, the DAPL specification had a goal to not impose any additional > > protocol on the wire. If you add this under the covers, then you add > > such a "protocol" and break interoperability between a connection > > accessed via DAPL on one end and some other API on the other end. > > So I guess there's no 'right' solution, at least at the uDAPL level. > With RDMACM/OFA verbs, there's at least the argument that you can design > the API/semantics however you please, while uDAPL is already standardized. Yes, but its still difficult to post a SEND under the covers because it consumes the application resources in the form of QP and CQ space and a RECV buffer. So to date, we have...punted and pushed to problem to the ULP. > > I hope you guys are documenting this in a way that makes this issue > extremely clear to both uDAPL and OFA verbs (is this the right naming?) > users. Maybe it's been done already, but is it possible to emit some > sort of loud warning/error when the accept()'ing side tries to send > before a receive? > The connection comes tumbling down. How's that for loud? :) Seriously though, it isn't documented well enough. But we're bleeding edge here. And I'm still hoping somebody will come up with an elegant solution that doesn't break interoperability, applications and/or iwarp hw (i'm a dreamer :). Steve.
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
On Wed, 2007-05-09 at 15:01 -0700, Sean Hefty wrote: > > The reason it is hard or impossible to solve this in the DAPL layer is > > that any rdma operation on the QP affects the state of that QP and the > > associate CQs. In addition, if you use an RDMA send to enforce this you > > impact the other side by consuming a RECV buffer. So its hard if not > > impossible to do this under the covers without affecting the > > application's resources. > > I agree that this is hard, but I don't believe that it's impossible. > > > Also, the DAPL specification had a goal to not impose any additional > > protocol on the wire. If you add this under the covers, then you add > > such a "protocol" and break interoperability between a connection > > accessed via DAPL on one end and some other API on the other end. > > IMO, this is a unrealized dream. DAPL does generate wire protocol. For > example, when running over IB, DAPL's selection of a service ID and CM > protocol > is visible on the wire. A DAPL that establishes connections using the RDMA > CM > will likely have a different wire protocol than a version of DAPL that > establishes connections talking directly to the IB CM. The two DAPLs will > not > interoperate unless they agree on how they will map to service IDs and, in > the > case of using the RDMA CM, the format of the private data carried in the CM > messages. I wasn't aware of this. > > Even in the case of iWarp, DAPL's selection of a local port number affects > the > data visible on the wire. TO communicate, a remote end point must know how > this > mapping occurs. You mean the local port on the active side? The remote end point doesn't need to know this at all... Steve.
Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened
general-boun...@lists.openfabrics.org wrote: > On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote: >> >> Steve Wise wrote: >>> On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote: Steve Wise wrote: > There have been a series of discussions on the ofa general list > about this issue, and the conclusion to date is that it cannot be > resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. > Mainly because sending an RDMA message involves the ULP's work > queue and completion queue, so the CM cannot do this under the > covers in a mannor that doesn't affect the application. > Thus, the > applications must deal with this. Why can't uDAPL deal with this? As a uDAPL user, I really don't care what API uDAPL is using under the hood to move data from one place to another, nor the quirks of that API. The whole point of uDAPL is to form a network-agnostic abstraction layer. AFAIK, the uDAPL spec doesn't enforce any such requirement on RDMA communication either. In my opinion, exposing such behavior above uDAPL is incorrect and is part of why uDAPL has seen limited adoption -- every single uDAPL implementation behaves in different ways, making it extremely difficult to write an application to work on any uDAPL implementation. Sorry if this sounds harsh, but this comes from many hours of banging my head on the wall due to working around these sorts of problems :) >>> >>> I understand your frustration. I think the MPA protocol is >>> deficient in this respect and should have required the necessary >>> "first FPDU" to be sent under the covers by the RNICs. A RTR packet >>> if you will. To resolve this issue "properly", in my opinion, would >>> involve changing the IETF MPA spec and also breaking all the >>> existing iwarp HW. We can't do that. >> >> Understood. >> >>> The reason it is hard or impossible to solve this in the DAPL layer >>> is that any rdma operation on the QP affects the state of that QP >>> and the associate CQs. In addition, if you use an RDMA send to >>> enforce this you impact the other side by consuming a RECV buffer. >>> So its hard if not impossible to do this under the covers without >>> affecting the application's resources. >> >> Is there no way to do this before passing connection established >> events to the uDAPL consumer? I need to go read up on the uDAPL API >> to really understand why this wouldn't work. >> > > Perhaps the dapl or maybe even a OFA iWARP CM could defer > passing up the "established" event on the passive side until > an incoming SEND is detected. I know we've discussed this > before, but I'm not sure why this was not a workable > solution. Perhaps Caitlin or some iwarp folks can recall? > That was what the RNIC-PI flag would have enabled. DAPL could check for that flag in a transport/device independent way, and delay the established event until it was safe to post (but no longer than required, for IB and iWARP NICs that fenced the first transmit the Established Event could be generated immediately). So yes, the transport layer (OFA or DAPL) CAN hide this on the passive side. But as you point out, that doesn't solve the problem of needing the Send from the active side. Since the Consumer posts RECV buffers *before* indicating whether the QP/EP will be used on the passive or active end, and there are no standard verbs to jam a receive buffer to the head of an RQ, there is no way to hide a send/recv exchange from the application layer. The fact that it can't be made transparent on the active side certainly diminishes the value of making it traansparent on the receive side. It's still a good idea, but I don't think it has percolated to the top of anyone's TODO list yet. When it does, the RNIC-PI proposed flag is a simple capability flag that is quite easy for any provider to statically set.
Re: [OMPI devel] OMPI over ofed udapl - bugs opened
Steve Wise wrote: I hope you guys are documenting this in a way that makes this issue extremely clear to both uDAPL and OFA verbs (is this the right naming?) users. Maybe it's been done already, but is it possible to emit some sort of loud warning/error when the accept()'ing side tries to send before a receive? The connection comes tumbling down. How's that for loud? :) works :) Seriously though, it isn't documented well enough. But we're bleeding edge here. And I'm still hoping somebody will come up with an elegant solution that doesn't break interoperability, applications and/or iwarp hw (i'm a dreamer :). Well, if documenting it once saves someone a headache and a few hours of their time, it's probably worth it. Seems like everyone understands now what the problem is, that it sucks, and it can't be fixed lower down the stack :) Thanks for explaining Caitlin/Steve. As Jeff wrote, dealing with it in the BTLs really won't be that hard, just makes things a little more complicated to maintain. Andrew