Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM

2007-05-09 Thread Jeff Squyres

On May 9, 2007, at 1:37 AM, Or Gerlitz wrote:

Doing a bit of zoom out from the "how to make ofed's udapl work for  
ompi" thread, my thinking is that the ompi udapl btl enablement is  
actually only the first step, where for production/longterm/etc you  
want to have an rdmacm btl.


I think this is a bit of a misunderstanding.  The "BTL" in Open MPI  
is a byte transfer layer; it is a point-to-point abstraction for  
moving bytes between two processes.  BTL components (read: plugins)  
are typically distinguished by the underlying protocols used.  For  
example, we have an RC verbs-based BTL and we have a separate uDAPL- 
based BTL.  Andrew is also working on a research-quality UD verbs- 
based BTL.


Hence, how a particular BTL component makes connections between  
process peers is really a side-effect of moving bytes around, and not  
the focus of the BTL.  So having a "rdmacm" BTL doesn't really make  
sense.  If both the RC and UD verbs-based BTLs someday use the RDMA  
CM for connections, we might abstract the connection management out  
to a common piece of code between the two.  But that's a different  
issue.  If we end up having a mixed BTL someday that uses both RC and  
UD, then the need for the common code may go away.  But that's in the  
future.


Reasoning here is made of many arguments, among them the quickest i  
can make are:


A) it seems that ompi would want to use not only RC but rather also  
UD multicast and unicast, which are not covered by udapl


B) there's actually no real justification to maintain two APIs  
(namely udapl vs libibvers/librdmacm), so down the road, only one  
of them would survive (udapl is implemented ***over*** libibverbs/ 
librdmacm so if the latteres dies same does udapl). Specifically, I  
hear here and there that the OFED stack is now on its way to be  
deployed all over the place, specifically in commercial Unix OSs  
(which want modern! code that supports IPoIB-CM,RDS,SRP,iSER, etc  
you named it) so eventually the rdmacm btl can be used also over  
Solaris et al.


I think that's not quite the point.

1. A piece of history: the uDAPL BTL was originally developed by a  
grad student just as an excuse to learn the BTL interface and OMPI  
internals.  We already had an RC verbs-based BTL at the time.


2. When Sun joined Open MPI, they took over the development and  
maintenance of the uDAPL BTL because uDAPL is the only high  
performance stack on Solaris.


3. It's fine that Sun will someday support the same verbs interface  
that OFED does.  But *today*, they don't.  So for their current  
customers, they need to support uDAPL.  As such, we have done little/ 
no testing of uDAPL on OFED since Sun took over the uDAPL BTL -- all  
testing since that point has been on Solaris uDAPL.  All of our Linux/ 
OFED efforts have been on the verbs interface.


4. The Open MPI focus on uDAPL over OFED at the moment is simply to  
jump-start iWARP testing.  Both NetEffect and Chelsio have chimed in  
to say that they will do the RDMA CM work for Open MPI, but uDAPL can  
be used as a temporary workaround that can be used [effectively]  
immediately while they get up to speed on the Open MPI code base and  
do the RDMA CM work.


--
Jeff Squyres
Cisco Systems



[OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise

606 opened to track the udapl change.

607 opened to track the ompi change to remove the port number stashing
hack.

Status: I have a patch from Arlin to test today.  I will test with that
patch and with the OMPI port hack removed.  Stay tuned...



Steve.

On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:
> Steve Wise wrote:
> 
> >I would like the group to consider including changes needed to OMPI
> >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.  
> >
> >This will provide OMPI support over iwarp devices via udapl until we can
> >get rdma-cm support added to OMPI.  
> >
> >
> >Steve.
> >  
> >  
> >
> Steve,cCan you open a bug to track this?



Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 08:37 +0300, Or Gerlitz wrote:
> Andrew Friedley wrote:
> > Jeff Squyres wrote:
>  FWIW, yes, adding RDMA CM support has actually been on my to-do list
>  for a while, but it keeps getting bumped by higher priority items.
>  It would be *much* better if some iWARP companies got involved in
>  Open MPI...
> 
> > Hmm I'm interested.  I've already done some work switching over to RDMA 
> > CM for some research stuff I've been doing; it's not publicly accessible 
> > w/o the 3rd party agreement.  I can help answer questions on what 
> > exactly needs to change, and do some testing.
> 
> Doing a bit of zoom out from the "how to make ofed's udapl work for 
> ompi" thread, my thinking is that the ompi udapl btl enablement is 
> actually only the first step, where for production/longterm/etc you want 
> to have an rdmacm btl. Reasoning here is made of many arguments, among 
> them the quickest i can make are:
> 
> A) it seems that ompi would want to use not only RC but rather also UD 
> multicast and unicast, which are not covered by udapl
> 
> B) there's actually no real justification to maintain two APIs (namely 
> udapl vs libibvers/librdmacm), so down the road, only one of them would 
> survive (udapl is implemented ***over*** libibverbs/librdmacm so if the 
> latteres dies same does udapl). Specifically, I hear here and there that 
> the OFED stack is now on its way to be deployed all over the place, 
> specifically in commercial Unix OSs (which want modern! code that 
> supports IPoIB-CM,RDS,SRP,iSER, etc you named it) so eventually the 
> rdmacm btl can be used also over Solaris et al.
> 

Agreed.  enabling udapl will get OMPI over iwarp immediately (and
hopefully in ofed-1.2).  Post ofed-1.2, I think OMPI _should_ create a
rdma-cm btl.  That's the plan...

Steve.







Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise

Although as Boris pointed out, perhaps the hack in OMPI is no longer
needed at all...


On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote:
> 606 opened to track the udapl change.
> 
> 607 opened to track the ompi change to remove the port number stashing
> hack.
> 
> Status: I have a patch from Arlin to test today.  I will test with that
> patch and with the OMPI port hack removed.  Stay tuned...
> 
> 
> 
> Steve.
> 
> On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:
> > Steve Wise wrote:
> > 
> > >I would like the group to consider including changes needed to OMPI
> > >and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.  
> > >
> > >This will provide OMPI support over iwarp devices via udapl until we can
> > >get rdma-cm support added to OMPI.  
> > >
> > >
> > >Steve.
> > >  
> > >  
> > >
> > Steve,cCan you open a bug to track this?
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Jeff Squyres
FWIW, I would marginally prefer if this bug is tracked in the Open  
MPI trac ticket system, not the OFA bugzilla (Steve W. will have  
write access there as soon as Chelsio submits their OMPI 3rd party  
contribution agreement).  We've traditionally [mostly] tracked OMPI  
bugs in the OMPI bug system and OFED-specific OMPI packaging problems  
in the OFA bugzilla.  It's a gray area, I admit.


But since I'm not the uDAPL maintainer in Open MPI, moving the bug  
over there will allow the Right people to see it (some OMPI  
developers are cross subscribed to the OFA general list, but not  
all).  For example, this udapl problem is likely related to the  
existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ 
ticket/890).



On May 9, 2007, at 10:37 AM, Steve Wise wrote:



Although as Boris pointed out, perhaps the hack in OMPI is no longer
needed at all...


On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote:

606 opened to track the udapl change.

607 opened to track the ompi change to remove the port number  
stashing

hack.

Status: I have a patch from Arlin to test today.  I will test with  
that

patch and with the OMPI port hack removed.  Stay tuned...



Steve.

On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:

Steve Wise wrote:


I would like the group to consider including changes needed to OMPI
and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.

This will provide OMPI support over iwarp devices via udapl  
until we can

get rdma-cm support added to OMPI.


Steve.




Steve,cCan you open a bug to track this?


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [ofa-general] OpenMPI and RDMA-CM

2007-05-09 Thread Jeff Squyres

On May 9, 2007, at 10:30 AM, Steve Wise wrote:


Agreed.  enabling udapl will get OMPI over iwarp immediately (and
hopefully in ofed-1.2).  Post ofed-1.2, I think OMPI _should_ create a
rdma-cm btl.  That's the plan...


Yes and no.  Please see my other reply about an "rdma cm" BTL...

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Donald Kerr


I agree OMPI trac ticket #890 should cover this. I will test the 
suggested fix, just removing that one line from btl_udapl.c, on Solaris. 
I am still not set up on Linux so hopefully Steve can confirm there.


-DON

Jeff Squyres wrote:

FWIW, I would marginally prefer if this bug is tracked in the Open  
MPI trac ticket system, not the OFA bugzilla (Steve W. will have  
write access there as soon as Chelsio submits their OMPI 3rd party  
contribution agreement).  We've traditionally [mostly] tracked OMPI  
bugs in the OMPI bug system and OFED-specific OMPI packaging problems  
in the OFA bugzilla.  It's a gray area, I admit.


But since I'm not the uDAPL maintainer in Open MPI, moving the bug  
over there will allow the Right people to see it (some OMPI  
developers are cross subscribed to the OFA general list, but not  
all).  For example, this udapl problem is likely related to the  
existing OMPI trac ticket 890 (https://svn.open-mpi.org/trac/ompi/ 
ticket/890).



On May 9, 2007, at 10:37 AM, Steve Wise wrote:

 


Although as Boris pointed out, perhaps the hack in OMPI is no longer
needed at all...


On Wed, 2007-05-09 at 08:41 -0500, Steve Wise wrote:
   


606 opened to track the udapl change.

607 opened to track the ompi change to remove the port number  
stashing

hack.

Status: I have a patch from Arlin to test today.  I will test with  
that

patch and with the OMPI port hack removed.  Stay tuned...



Steve.

On Tue, 2007-05-08 at 15:47 -0700, Arlin Davis wrote:
 


Steve Wise wrote:

   


I would like the group to consider including changes needed to OMPI
and/or ofa udapl to get OMPI working again on udapl for ofed-1.2.

This will provide OMPI support over iwarp devices via udapl  
until we can

get rdma-cm support added to OMPI.


Steve.



 


Steve,cCan you open a bug to track this?
   


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 




 



Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote:
> I agree OMPI trac ticket #890 should cover this. I will test the 
> suggested fix, just removing that one line from btl_udapl.c, on Solaris. 
> I am still not set up on Linux so hopefully Steve can confirm there.
> 

All,

First, I haven't tested Arlins dat_ep_query() fix yet as we have
determined its not needed.  The OMPI udapl btl never calls
dat_ep_query()... 

So running OMPI with the suggested fix (removing the overwriting of the
hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp
rnic still doesn't work.  

There are two new issues so far:

1) this has uncovered a connection migration issue in the Chelsio
driver/firmware.  We are developing and testing a fix for this now.
Should be ready tomorrow hopefully.

2) OMPI is not adhering to the iwarp protocol requirement that the ULP,
in this case OMPI, initiating the iwarp connection (the side issuing the
dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA
message.  So if a OMPI process _accepts_ an rdma connection, then it
cannot send on that connection until it receives some sort of rdma
operation from the client process.  It appears the current OMPI
connection setup model doesn't enforce this.

This combined with the bug above causes an immediate connection failure
on chelsio's rnic.  After I fix #1 above, things might get slightly
better but my guess is we will still have connection setup problems if
the server side sends before the client side finishes streaming->rdma
mode transition.  

There have been a series of discussions on the ofa general list about
this issue, and the conclusion to date is that it cannot be resolved in
the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
sending an RDMA message involves the ULP's work queue and completion
queue, so the CM cannot do this under the covers in a mannor that
doesn't affect the application.  Thus, the applications must deal with
this.


Here is a possible solution: 

I assume in OMPI that connections are only initiated when the mpi
application does a send operation.   Given that, then udapl btl must
ensure that if a given rank accepts a connection, it cannot not send
anything until the rank at the other end of the connection sends first.
Since the other side initiated the connection, it will have pending data
to send...

I haven't looked into how painful this will be to implement.

Thoughts?


FYI:

IETF Draft requiring this behavior:

http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt

See section 7 for specifics.

Steve.




[OMPI devel] Nightly trunk tarball AC/AM change

2007-05-09 Thread Brian Barrett

Hi all -

After a minor hiccup last night, nightly tarballs for the trunk (and  
eventually v1.3 branch) are now made with AC 2.61, AM 1.10, and LT  
2.1a.  Don't forget the mandatory update of AC and AM for the trunk  
coming saturday morning!



Brian


Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Andrew Friedley



Steve Wise wrote:

There have been a series of discussions on the ofa general list about
this issue, and the conclusion to date is that it cannot be resolved in
the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
sending an RDMA message involves the ULP's work queue and completion
queue, so the CM cannot do this under the covers in a mannor that
doesn't affect the application.  Thus, the applications must deal with
this.


Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
what API uDAPL is using under the hood to move data from one place to 
another, nor the quirks of that API.  The whole point of uDAPL is to 
form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
doesn't enforce any such requirement on RDMA communication either.  In 
my opinion, exposing such behavior above uDAPL is incorrect and is part 
of why uDAPL has seen limited adoption -- every single uDAPL 
implementation behaves in different ways, making it extremely difficult 
to write an application to work on any uDAPL implementation.  Sorry if 
this sounds harsh, but this comes from many hours of banging my head on 
the wall due to working around these sorts of problems :)




Here is a possible solution: 


I assume in OMPI that connections are only initiated when the mpi
application does a send operation.   Given that, then udapl btl must
ensure that if a given rank accepts a connection, it cannot not send
anything until the rank at the other end of the connection sends first.
Since the other side initiated the connection, it will have pending data
to send...

I haven't looked into how painful this will be to implement.

Thoughts?


Following on what I wrote above, I think Open MPI is the wrong place to 
be dealing with this.  There's enough of these hacks as it is; I'm not 
interested in seeing more get added.


Andrew


Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Donald Kerr
I missing some context here. Where are you plugging iwarp and OMPI 
together?


Steve Wise wrote:


On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote:
 

I agree OMPI trac ticket #890 should cover this. I will test the 
suggested fix, just removing that one line from btl_udapl.c, on Solaris. 
I am still not set up on Linux so hopefully Steve can confirm there.


   



All,

First, I haven't tested Arlins dat_ep_query() fix yet as we have
determined its not needed.  The OMPI udapl btl never calls
dat_ep_query()... 


So running OMPI with the suggested fix (removing the overwriting of the
hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp
rnic still doesn't work.  


There are two new issues so far:

1) this has uncovered a connection migration issue in the Chelsio
driver/firmware.  We are developing and testing a fix for this now.
Should be ready tomorrow hopefully.

2) OMPI is not adhering to the iwarp protocol requirement that the ULP,
in this case OMPI, initiating the iwarp connection (the side issuing the
dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA
message.  So if a OMPI process _accepts_ an rdma connection, then it
cannot send on that connection until it receives some sort of rdma
operation from the client process.  It appears the current OMPI
connection setup model doesn't enforce this.

This combined with the bug above causes an immediate connection failure
on chelsio's rnic.  After I fix #1 above, things might get slightly
better but my guess is we will still have connection setup problems if
the server side sends before the client side finishes streaming->rdma
mode transition.  


There have been a series of discussions on the ofa general list about
this issue, and the conclusion to date is that it cannot be resolved in
the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
sending an RDMA message involves the ULP's work queue and completion
queue, so the CM cannot do this under the covers in a mannor that
doesn't affect the application.  Thus, the applications must deal with
this.


Here is a possible solution: 


I assume in OMPI that connections are only initiated when the mpi
application does a send operation.   Given that, then udapl btl must
ensure that if a given rank accepts a connection, it cannot not send
anything until the rank at the other end of the connection sends first.
Since the other side initiated the connection, it will have pending data
to send...

I haven't looked into how painful this will be to implement.

Thoughts?


FYI:

IETF Draft requiring this behavior:

http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt

See section 7 for specifics.

Steve.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 



Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote:
> I missing some context here. Where are you plugging iwarp and OMPI 
> together? 

ofed-1.2 supports iwarp and the chelsio rnic.  It can be accessed
directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl.  

I'm attempting to run OMPI over udapl over chelsio's rnic.

Steve.





Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Donald Kerr
So then I agree with Andrew, I think you are trying to impose 
restrictions on uDAPL which are not part of the Spec.


-DON

Steve Wise wrote:


On Wed, 2007-05-09 at 16:20 -0400, Donald Kerr wrote:
 

I missing some context here. Where are you plugging iwarp and OMPI 
together? 
   



ofed-1.2 supports iwarp and the chelsio rnic.  It can be accessed
directly via the ofa verbs and ofa rdma-cm _as well as_ via udapl.  


I'm attempting to run OMPI over udapl over chelsio's rnic.

Steve.



 



Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:
> So then I agree with Andrew, I think you are trying to impose 
> restrictions on uDAPL which are not part of the Spec.
> 

true, but if you want a single btl for IB and IW, then you'll need to
address this issue in some way...




Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Donald Kerr
I guess I have not read enough about iwarp yet but if iwarp is sitting 
below ib verbs or udapl in the stack and is trying to impose 
restrictions which ib verbs or udapl do not adhere to then maybe iwarp 
is in the wrong place in the ofed stack.


Having said that I do agree the OMPI community needs to consider where 
iwarp plays in its own stack. If it has not already.


Steve Wise wrote:


On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:
 

So then I agree with Andrew, I think you are trying to impose 
restrictions on uDAPL which are not part of the Spec.


   



true, but if you want a single btl for IB and IW, then you'll need to
address this issue in some way...


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 



Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Caitlin Bestler

> 
> 2) OMPI is not adhering to the iwarp protocol requirement
> that the ULP,
> in this case OMPI, initiating the iwarp connection (the side
> issuing the
> dat_ep_connect() or rdma_connect()) _MUST_ be the first to
> send an RDMA
> message.  So if a OMPI process _accepts_ an rdma connection, then it
> cannot send on that connection until it receives some sort of rdma
> operation from the client process.  It appears the current OMPI
> connection setup model doesn't enforce this.
> 

This is actually an MPA requirement, and accoring to *protocol* specs
having the active side send a zero length RDMA Write should be able
to fix the problem. However there is language in the RDMAC verbs that
clearly implies that the active side must Send something, and that an
RDMA Write is insufficient.

Therefore, the only truly safe thing for an iWARP btl to do (or a
udapl btl since that is also an iWARP btl) is to have the active
layer send an MPI Layer "nop" of some kind immediately after 
establishing the connection if there is nothing else to send.





Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Jeff Squyres

I talked with Steve a bunch on the phone about this.

1. This "connector must RDMA first" issue is an iWARP restriction --  
it's not specific to udapl or verbs.  For example, if you try to use  
udapl with iWARP on Solaris, you'll have the same issue (I have no  
idea whether you have iWARP drivers in Solaris or not).


2. Per his prior e-mail (which I didn't fully grok until I talked to  
him), using the RDMA CM in the openib BTL will not magically fix this  
issue for us.


3. So for any of the BTLs to support iWARP -- regardless of  
underlying protocol or OS -- they are going to have to obey this  
restriction.


4. Luckily, in iWARP, the restriction can be met by either send/ 
receive semantics *or* RDMA semantics.  You don't have to  
specifically use RDMA verbs semantics, for example.  This is good  
because of the way that OMPI works (the first fragment that will be  
transmitted is pretty much guaranteed to be a send/receive fragment,  
not an RDMA fragment) -- it makes the logistics slightly simpler.


Galen Shipman and I talked about this a bit and suggest the following:

- During the connection dance (probably for both the udapl and openib  
BTLs), whichever peer ends up being the connection initiator (don't  
forget about the race condition where 2 peers may simultaneously  
decide to initiate -- this case is handled properly in the OMPI code;  
but just make sure you modify the side that ends up being actual  
initiator), they can send their pending fragment immediately (and  
Steve is right that there will always be a pending fragment, because  
OMPI doesn't make a connection until the first send).


- The other peer (the receiver of the connection) must wait to send  
its pending fragment(s) until it receives the first frag from the  
connection initiator.  This can be accomplished either with another  
flag on the OMPI module struct or perhaps making it part of the  
connection protocol (i.e., don't transition the endpoint to be  
CONNECTED until the first fragment is received).  Either of which can  
be used to queue up fragments on the receiver until the first  
fragment is received from the initiator.  I'd have to look in the  
code deeper, but I'm *guessing* that it might be best to use the  
already-existing state flag (i.e., checking for CONNECTED) because  
then you won't be introducing any more conditionals in the critical  
path.





On May 9, 2007, at 4:45 PM, Donald Kerr wrote:


I guess I have not read enough about iwarp yet but if iwarp is sitting
below ib verbs or udapl in the stack and is trying to impose
restrictions which ib verbs or udapl do not adhere to then maybe iwarp
is in the wrong place in the ofed stack.

Having said that I do agree the OMPI community needs to consider where
iwarp plays in its own stack. If it has not already.

Steve Wise wrote:


On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:



So then I agree with Andrew, I think you are trying to impose
restrictions on uDAPL which are not part of the Spec.





true, but if you want a single btl for IB and IW, then you'll need to
address this issue in some way...


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Caitlin Bestler
Jeff Squyres wrote:

> 
> - The other peer (the receiver of the connection) must wait
> to send its pending fragment(s) until it receives the first
> frag from the connection initiator.  This can be accomplished
> either with another flag on the OMPI module struct or perhaps
> making it part of the connection protocol (i.e., don't
> transition the endpoint to be CONNECTED until the first
> fragment is received).  Either of which can be used to queue
> up fragments on the receiver until the first fragment is
> received from the initiator.  I'd have to look in the code
> deeper, but I'm *guessing* that it might be best to use the
> already-existing state flag (i.e., checking for CONNECTED)
> because then you won't be introducing any more conditionals
> in the critical path.
> 

The transport provider has several options on ensuring that
the passive side does not put a message on the wire before
the first message is received.

What the transport layer cannot do is create the first message
from the active side. Because it will have send/recv semantics
it will complete a receive work request, which the application
layer has to post with that expectation.

this nop does not have to be visible above OMPI, but I'm pretty
sure OMPI has to generate it. That isn't exactly fair to the 
application layer, but the RDMAC verbs are water under the
bridge. Assuming OMPI wants to work with *any* iWARP RNIC
then it needs to ensure that the active side will send something
promptly in all cases.







Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Jeff Squyres

Understood, and I agree.

FWIW: note that the CONNECTED state that I refered to is internal to  
OMPI's endpoint abstraction (not an iwarp/udapl/verbs/etc. state).   
It's part of our connection dance protocol.



On May 9, 2007, at 5:33 PM, Caitlin Bestler wrote:


Jeff Squyres wrote:



- The other peer (the receiver of the connection) must wait
to send its pending fragment(s) until it receives the first
frag from the connection initiator.  This can be accomplished
either with another flag on the OMPI module struct or perhaps
making it part of the connection protocol (i.e., don't
transition the endpoint to be CONNECTED until the first
fragment is received).  Either of which can be used to queue
up fragments on the receiver until the first fragment is
received from the initiator.  I'd have to look in the code
deeper, but I'm *guessing* that it might be best to use the
already-existing state flag (i.e., checking for CONNECTED)
because then you won't be introducing any more conditionals
in the critical path.



The transport provider has several options on ensuring that
the passive side does not put a message on the wire before
the first message is received.

What the transport layer cannot do is create the first message
from the active side. Because it will have send/recv semantics
it will complete a receive work request, which the application
layer has to post with that expectation.

this nop does not have to be visible above OMPI, but I'm pretty
sure OMPI has to generate it. That isn't exactly fair to the
application layer, but the RDMAC verbs are water under the
bridge. Assuming OMPI wants to work with *any* iWARP RNIC
then it needs to ensure that the active side will send something
promptly in all cases.





--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
> 
> Steve Wise wrote:
> > There have been a series of discussions on the ofa general list about
> > this issue, and the conclusion to date is that it cannot be resolved in
> > the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
> > sending an RDMA message involves the ULP's work queue and completion
> > queue, so the CM cannot do this under the covers in a mannor that
> > doesn't affect the application.  Thus, the applications must deal with
> > this.
> 
> Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
> what API uDAPL is using under the hood to move data from one place to 
> another, nor the quirks of that API.  The whole point of uDAPL is to 
> form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
> doesn't enforce any such requirement on RDMA communication either.  In 
> my opinion, exposing such behavior above uDAPL is incorrect and is part 
> of why uDAPL has seen limited adoption -- every single uDAPL 
> implementation behaves in different ways, making it extremely difficult 
> to write an application to work on any uDAPL implementation.  Sorry if 
> this sounds harsh, but this comes from many hours of banging my head on 
> the wall due to working around these sorts of problems :)
> 

I understand your frustration.  I think the MPA protocol is deficient in
this respect and should have required the necessary "first FPDU" to be
sent under the covers by the RNICs. A RTR packet if you will.  To
resolve this issue "properly", in my opinion, would involve changing the
IETF MPA spec and also breaking all the existing iwarp HW.  We can't do
that.

The reason it is hard or impossible to solve this in the DAPL layer is
that any rdma operation on the QP affects the state of that QP and the
associate CQs.  In addition, if you use an RDMA send to enforce this you
impact the other side by consuming a RECV buffer. So its hard if not
impossible to do this under the covers without affecting the
application's resources.

Also, the DAPL specification had a goal to not impose any additional
protocol on the wire.  If you add this under the covers, then you add
such a "protocol" and break interoperability between a connection
accessed via DAPL on one end and some other API on the other end.

> > 
> > Here is a possible solution: 
> > 
> > I assume in OMPI that connections are only initiated when the mpi
> > application does a send operation.   Given that, then udapl btl must
> > ensure that if a given rank accepts a connection, it cannot not send
> > anything until the rank at the other end of the connection sends first.
> > Since the other side initiated the connection, it will have pending data
> > to send...
> > 
> > I haven't looked into how painful this will be to implement.
> > 
> > Thoughts?
> 
> Following on what I wrote above, I think Open MPI is the wrong place to 
> be dealing with this.  There's enough of these hacks as it is; I'm not 
> interested in seeing more get added.
> 

Unfortunately, I haven't been able to come up with a solution that works
with existing iWARP HW and is interoperable. 

Steve.



Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Andrew Friedley



Therefore, the only truly safe thing for an iWARP btl to do (or a
udapl btl since that is also an iWARP btl) is to have the active
layer send an MPI Layer "nop" of some kind immediately after 
establishing the connection if there is nothing else to send.


This is fine for an iWARP/RDMACM/whatever BTL (or anything else that 
uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT 
specifically there to support just iWARP (though it may include it), and 
that OFED's uDAPL should be adjusted to handle this.  Again, uDAPL is a 
network *independent* abstraction, so requiring network-dependent 
behavior from the uDAPL consumer is wrong.


A related question -- how does this 'connection initiator must send 
first' requirement relate to UD?


Andrew


Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Caitlin Bestler
general-boun...@lists.openfabrics.org wrote:
>> Therefore, the only truly safe thing for an iWARP btl to do (or a
>> udapl btl since that is also an iWARP btl) is to have the active
>> layer send an MPI Layer "nop" of some kind immediately after
>> establishing the connection if there is nothing else to send.
> 
> This is fine for an iWARP/RDMACM/whatever BTL (or anything
> else that uses the OFA verbs interface(s)), but my argument
> is that uDAPL is NOT specifically there to support just iWARP
> (though it may include it), and that OFED's uDAPL should be
> adjusted to handle this.  Again, uDAPL is a network
> *independent* abstraction, so requiring network-dependent
> behavior from the uDAPL consumer is wrong.
>

DAPL strives to define network independent solutions. In this
case the network independent solution is that the active side
*always* sends the first message. This works for both iWARP
and InfiniBand. And away from the HPC market it is almost a
non-requirement (which is why the RDMAC managed to goof on
this in its specification. A zero-length RDMA Write is enough
to deal with the wire protocol problem, but people implemented
to the RDMAC verbs.)


>
> A related question -- how does this 'connection initiator
> must send first' requirement relate to UD?
> 

iWARP UD is called UDP. It has nothing to do with MPA
or RDMA. An API that mapped to either IB UD or UDP is
definitely feasible, but hasn't been important enough
to anyone to draft as of yet.





Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Andrew Friedley



Steve Wise wrote:

On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:

Steve Wise wrote:

There have been a series of discussions on the ofa general list about
this issue, and the conclusion to date is that it cannot be resolved in
the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
sending an RDMA message involves the ULP's work queue and completion
queue, so the CM cannot do this under the covers in a mannor that
doesn't affect the application.  Thus, the applications must deal with
this.
Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
what API uDAPL is using under the hood to move data from one place to 
another, nor the quirks of that API.  The whole point of uDAPL is to 
form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
doesn't enforce any such requirement on RDMA communication either.  In 
my opinion, exposing such behavior above uDAPL is incorrect and is part 
of why uDAPL has seen limited adoption -- every single uDAPL 
implementation behaves in different ways, making it extremely difficult 
to write an application to work on any uDAPL implementation.  Sorry if 
this sounds harsh, but this comes from many hours of banging my head on 
the wall due to working around these sorts of problems :)




I understand your frustration.  I think the MPA protocol is deficient in
this respect and should have required the necessary "first FPDU" to be
sent under the covers by the RNICs. A RTR packet if you will.  To
resolve this issue "properly", in my opinion, would involve changing the
IETF MPA spec and also breaking all the existing iwarp HW.  We can't do
that.


Understood.


The reason it is hard or impossible to solve this in the DAPL layer is
that any rdma operation on the QP affects the state of that QP and the
associate CQs.  In addition, if you use an RDMA send to enforce this you
impact the other side by consuming a RECV buffer. So its hard if not
impossible to do this under the covers without affecting the
application's resources.


Is there no way to do this before passing connection established events 
to the uDAPL consumer?  I need to go read up on the uDAPL API to really 
understand why this wouldn't work.




Also, the DAPL specification had a goal to not impose any additional
protocol on the wire.  If you add this under the covers, then you add
such a "protocol" and break interoperability between a connection
accessed via DAPL on one end and some other API on the other end.


So I guess there's no 'right' solution, at least at the uDAPL level. 
With RDMACM/OFA verbs, there's at least the argument that you can design 
the API/semantics however you please, while uDAPL is already standardized.


I hope you guys are documenting this in a way that makes this issue 
extremely clear to both uDAPL and OFA verbs (is this the right naming?) 
users.  Maybe it's been done already, but is it possible to emit some 
sort of loud warning/error when the accept()'ing side tries to send 
before a receive?


Andrew


Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 17:46 -0700, Andrew Friedley wrote:
> > Therefore, the only truly safe thing for an iWARP btl to do (or a
> > udapl btl since that is also an iWARP btl) is to have the active
> > layer send an MPI Layer "nop" of some kind immediately after 
> > establishing the connection if there is nothing else to send.
> 
> This is fine for an iWARP/RDMACM/whatever BTL (or anything else that 
> uses the OFA verbs interface(s)), but my argument is that uDAPL is NOT 
> specifically there to support just iWARP (though it may include it), and 
> that OFED's uDAPL should be adjusted to handle this.  Again, uDAPL is a 
> network *independent* abstraction, so requiring network-dependent 
> behavior from the uDAPL consumer is wrong.
> 
> A related question -- how does this 'connection initiator must send 
> first' requirement relate to UD?
> 

It doesn't.  UD isn't supported in IWARP.  



Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Caitlin Bestler
devel-boun...@open-mpi.org wrote:
> Steve Wise wrote:
>> There have been a series of discussions on the ofa general list about
>> this issue, and the conclusion to date is that it cannot be resolved
>> in the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly
>> because sending an RDMA message involves the ULP's work queue and
>> completion queue, so the CM cannot do this under the covers in a
>> mannor that doesn't affect the application.  Thus, the applications
>> must deal with this.
> 
> Why can't uDAPL deal with this?  As a uDAPL user, I really
> don't care what API uDAPL is using under the hood to move
> data from one place to another, nor the quirks of that API.
> The whole point of uDAPL is to form a network-agnostic
> abstraction layer.  AFAIK, the uDAPL spec doesn't enforce any
> such requirement on RDMA communication either.  In my
> opinion, exposing such behavior above uDAPL is incorrect and
> is part of why uDAPL has seen limited adoption -- every
> single uDAPL implementation behaves in different ways, making
> it extremely difficult to write an application to work on any
> uDAPL implementation.  Sorry if this sounds harsh, but this
> comes from many hours of banging my head on the wall due to
> working around these sorts of problems :)
> 

The simple answer is that uDAPL cannot deal with this.

The RDMAC verbs specification was overly focused on client/server
and therefore did not realize that there was any harm in requiring
that the active side did the first send. But given that DAPL could
not rewrite either the RDMAC or InfiniBand verbs it had to come up
with the best solution that matched the verbs as they were. One of
the explicit ground rules was that DAPL MUST support all RDMA devices
that were IBTA or RDMAC compliant. Given those rules, if the active
side does not send a message the passive side might be held off
indefinitely, and sending a message cause consumption of a receive
buffer and therefore cannot be transparent to the uDAPL consumer.

Given those constraints there is literally nothing that can be
done to work around this problem by either DAPL or OFA.




Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote:
> 
> Steve Wise wrote:
> > On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
> >> Steve Wise wrote:
> >>> There have been a series of discussions on the ofa general list about
> >>> this issue, and the conclusion to date is that it cannot be resolved in
> >>> the rdma-cm or iwarp-cm code of the linux rdma stack.  Mainly because
> >>> sending an RDMA message involves the ULP's work queue and completion
> >>> queue, so the CM cannot do this under the covers in a mannor that
> >>> doesn't affect the application.  Thus, the applications must deal with
> >>> this.
> >> Why can't uDAPL deal with this?  As a uDAPL user, I really don't care 
> >> what API uDAPL is using under the hood to move data from one place to 
> >> another, nor the quirks of that API.  The whole point of uDAPL is to 
> >> form a network-agnostic abstraction layer.  AFAIK, the uDAPL spec 
> >> doesn't enforce any such requirement on RDMA communication either.  In 
> >> my opinion, exposing such behavior above uDAPL is incorrect and is part 
> >> of why uDAPL has seen limited adoption -- every single uDAPL 
> >> implementation behaves in different ways, making it extremely difficult 
> >> to write an application to work on any uDAPL implementation.  Sorry if 
> >> this sounds harsh, but this comes from many hours of banging my head on 
> >> the wall due to working around these sorts of problems :)
> >>
> > 
> > I understand your frustration.  I think the MPA protocol is deficient in
> > this respect and should have required the necessary "first FPDU" to be
> > sent under the covers by the RNICs. A RTR packet if you will.  To
> > resolve this issue "properly", in my opinion, would involve changing the
> > IETF MPA spec and also breaking all the existing iwarp HW.  We can't do
> > that.
> 
> Understood.
> 
> > The reason it is hard or impossible to solve this in the DAPL layer is
> > that any rdma operation on the QP affects the state of that QP and the
> > associate CQs.  In addition, if you use an RDMA send to enforce this you
> > impact the other side by consuming a RECV buffer. So its hard if not
> > impossible to do this under the covers without affecting the
> > application's resources.
> 
> Is there no way to do this before passing connection established events 
> to the uDAPL consumer?  I need to go read up on the uDAPL API to really 
> understand why this wouldn't work.
> 

Perhaps the dapl or maybe even a OFA iWARP CM could defer passing up the
"established" event on the passive side until an incoming SEND is
detected.  I know we've discussed this before, but I'm not sure why this
was not a workable solution.  Perhaps Caitlin or some iwarp folks can
recall?  

> > 
> > Also, the DAPL specification had a goal to not impose any additional
> > protocol on the wire.  If you add this under the covers, then you add
> > such a "protocol" and break interoperability between a connection
> > accessed via DAPL on one end and some other API on the other end.
> 
> So I guess there's no 'right' solution, at least at the uDAPL level. 
> With RDMACM/OFA verbs, there's at least the argument that you can design 
> the API/semantics however you please, while uDAPL is already standardized.

Yes, but its still difficult to post a SEND under the covers because it
consumes the application resources in the form of QP and CQ space and a
RECV buffer.

So to date, we have...punted and pushed to problem to the ULP.

> 
> I hope you guys are documenting this in a way that makes this issue 
> extremely clear to both uDAPL and OFA verbs (is this the right naming?) 
> users.  Maybe it's been done already, but is it possible to emit some 
> sort of loud warning/error when the accept()'ing side tries to send 
> before a receive?
> 

The connection comes tumbling down.  How's that for loud? :)

Seriously though, it isn't documented well enough.  But we're bleeding
edge here. And I'm still hoping somebody will come up with an elegant
solution that doesn't break interoperability, applications and/or iwarp
hw (i'm a dreamer :). 


Steve.






Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Steve Wise
On Wed, 2007-05-09 at 15:01 -0700, Sean Hefty wrote:
> > The reason it is hard or impossible to solve this in the DAPL layer is
> > that any rdma operation on the QP affects the state of that QP and the
> > associate CQs.  In addition, if you use an RDMA send to enforce this you
> > impact the other side by consuming a RECV buffer. So its hard if not
> > impossible to do this under the covers without affecting the
> > application's resources.
> 
> I agree that this is hard, but I don't believe that it's impossible.
> 
> > Also, the DAPL specification had a goal to not impose any additional
> > protocol on the wire.  If you add this under the covers, then you add
> > such a "protocol" and break interoperability between a connection
> > accessed via DAPL on one end and some other API on the other end.
> 
> IMO, this is a unrealized dream.  DAPL does generate wire protocol.  For 
> example, when running over IB, DAPL's selection of a service ID and CM 
> protocol 
> is visible on the wire.  A DAPL that establishes connections using the RDMA 
> CM 
> will likely have a different wire protocol than a version of DAPL that 
> establishes connections talking directly to the IB CM.  The two DAPLs will 
> not 
> interoperate unless they agree on how they will map to service IDs and, in 
> the 
> case of using the RDMA CM, the format of the private data carried in the CM 
> messages.

I wasn't aware of this.

> 
> Even in the case of iWarp, DAPL's selection of a local port number affects 
> the 
> data visible on the wire.  TO communicate, a remote end point must know how 
> this 
> mapping occurs.

You mean the local port on the active side?  The remote end point
doesn't need to know this at all...

Steve.



Re: [OMPI devel] [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-09 Thread Caitlin Bestler
general-boun...@lists.openfabrics.org wrote:
> On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote:
>> 
>> Steve Wise wrote:
>>> On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
 Steve Wise wrote:
> There have been a series of discussions on the ofa general list
> about this issue, and the conclusion to date is that it cannot be
> resolved in the rdma-cm or iwarp-cm code of the linux rdma stack.
> Mainly because sending an RDMA message involves the ULP's work
> queue and completion queue, so the CM cannot do this under the
> covers in a mannor that doesn't affect the application.
>  Thus, the
> applications must deal with this.
 Why can't uDAPL deal with this?  As a uDAPL user, I really don't
 care what API uDAPL is using under the hood to move data from one
 place to another, nor the quirks of that API.  The whole point of
 uDAPL is to form a network-agnostic abstraction layer. AFAIK, the
 uDAPL spec doesn't enforce any such requirement on RDMA
 communication either.  In my opinion, exposing such behavior above
 uDAPL is incorrect and is part of why uDAPL has seen limited
 adoption -- every single uDAPL implementation behaves in different
 ways, making it extremely difficult to write an application to work
 on any uDAPL implementation.  Sorry if this sounds harsh, but this
 comes from many hours of banging my head on the wall due to working
 around these sorts of problems :)
 
>>> 
>>> I understand your frustration.  I think the MPA protocol is
>>> deficient in this respect and should have required the necessary
>>> "first FPDU" to be sent under the covers by the RNICs. A RTR packet
>>> if you will.  To resolve this issue "properly", in my opinion, would
>>> involve changing the IETF MPA spec and also breaking all the
>>> existing iwarp HW.  We can't do that.
>> 
>> Understood.
>> 
>>> The reason it is hard or impossible to solve this in the DAPL layer
>>> is that any rdma operation on the QP affects the state of that QP
>>> and the associate CQs.  In addition, if you use an RDMA send to
>>> enforce this you impact the other side by consuming a RECV buffer.
>>> So its hard if not impossible to do this under the covers without
>>> affecting the application's resources.
>> 
>> Is there no way to do this before passing connection established
>> events to the uDAPL consumer?  I need to go read up on the uDAPL API
>> to really understand why this wouldn't work.
>> 
> 
> Perhaps the dapl or maybe even a OFA iWARP CM could defer
> passing up the "established" event on the passive side until
> an incoming SEND is detected.  I know we've discussed this
> before, but I'm not sure why this was not a workable
> solution.  Perhaps Caitlin or some iwarp folks can recall?
> 

That was what the RNIC-PI flag would have enabled. DAPL could
check for that flag in a transport/device independent way, and
delay the established event until it was safe to post (but no
longer than required, for IB and iWARP NICs that fenced the first
transmit the Established Event could be generated immediately).

So yes, the transport layer (OFA or DAPL) CAN hide this on 
the passive side.

But as you point out, that doesn't solve the problem of needing
the Send from the active side. Since the Consumer posts RECV
buffers *before* indicating whether the QP/EP will be used 
on the passive or active end, and there are no standard verbs
to jam a receive buffer to the head of an RQ, there is no way
to hide a send/recv exchange from the application layer.

The fact that it can't be made transparent on the active side
certainly diminishes the value of making it traansparent on
the receive side. It's still a good idea, but I don't think 
it has percolated to the top of anyone's TODO list yet.
When it does, the RNIC-PI proposed flag is a simple capability
flag that is quite easy for any provider to statically set.







Re: [OMPI devel] OMPI over ofed udapl - bugs opened

2007-05-09 Thread Andrew Friedley

Steve Wise wrote:
I hope you guys are documenting this in a way that makes this issue 
extremely clear to both uDAPL and OFA verbs (is this the right naming?) 
users.  Maybe it's been done already, but is it possible to emit some 
sort of loud warning/error when the accept()'ing side tries to send 
before a receive?




The connection comes tumbling down.  How's that for loud? :)


works :)


Seriously though, it isn't documented well enough.  But we're bleeding
edge here. And I'm still hoping somebody will come up with an elegant
solution that doesn't break interoperability, applications and/or iwarp
hw (i'm a dreamer :). 


Well, if documenting it once saves someone a headache and a few hours of 
their time, it's probably worth it.


Seems like everyone understands now what the problem is, that it sucks, 
and it can't be fixed lower down the stack :)  Thanks for explaining 
Caitlin/Steve.  As Jeff wrote, dealing with it in the BTLs really won't 
be that hard, just makes things a little more complicated to maintain.


Andrew