Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Jeff Squyres

On Mar 9, 2008, at 3:39 PM, Gleb Natapov wrote:

1. There was a discussion about this on openfabrics mailing list and  
the
conclusion was that what Open MPI does is correct according to IB/ 
iWarp

spec.

2. Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not return  
before

post receive is processed?



3. I personally don't like the idea to add another layer of  
complexity to openib
BTL code just to work around HW that doesn't follow spec. If work  
around
is simple that is OK, but in this case it is not so simple and will  
add
code path that is rarely tested. A simple workaround for the problem  
may
be to not configure multiple QPs if HW has a bug (and we can extend  
ini

file to contain this info).



These are all valid points.

In thinking about Gleb's proposal a bit more (extend the INI file  
syntax to accept per-HCA receive_queues values), it might be only  
somewhat less efficient (and a lot less code) than sending all flow  
control messages on the respective qp's anyway.  So let's explore the  
math...


The "let's use multiple QP's for short messages" scheme (a.k.a. BSRQ)  
was invented to get better registered memory utilization.  Pushing all  
the FC messages down to the QP with the smallest buffer size was a  
desirable side-effect that made registered memory utilization even  
better (because short FC messages were naturally on the QP with the  
smallest buffer size).  Specifically, today in openib/IB (SVN trunk),  
here's the default queue layout:


pp: 256 buffers of size 128
srq: 256 buffers of size 4k
srq: 256 buffers of size 12k (eager limit)
srq: 256 buffers of size 64k (max send size)

And then we add 4 more buffers on the pp qp for flow control messages  
(since we only currently send FC messages for pp qp's).  Total  
registered memory for a job with 1 remote peer: (256+4)*128 + 256*4k +  
256*12k + 256*64k = ~20M.  This is somewhat deceiving, because the  
total registered memory scales slowly with the number of procs in the  
job (e.g., with 2 remote peers, in only increases by 33k because we're  
using srq's).


With Gleb's proposals, you'd only have one pp qp, assumedly 64k (or  
whatever the max send size is):


pp: 256 buffers of size 64k (max send size)

And then add 4 more for flow control messages.  So total registered  
memory for a job with 1 remote peer: (256+4)*64k = ~17M.  But that  
figure is approximately a per-peer cost -- so a job with 2 remote  
peers would use ~34M of registered memory, etc.  This will [obviously]  
scale extremely poorly (and is one of the reasons that BSRQ exists).


However, I wonder if there's a compromise (assuming you can't fix  
ibv_post_recv() to not return until the buffers are actually  
available, which, I agree with Gleb, seems like the best fix).  Since  
we only use FC messages on pp qp's, why not make a "you can only have  
1 pp qp and it must be qp 0" restriction for the Chelsio RNIC?  This  
fits nicely into our default receive_queues value, anyway.  That way,  
all FC messages will naturally go over qp 0 anyway (since that will be  
the only pp qp).  Then, the only problem you have to solve is sending  
the *initial* credits message at wireup time (to know when the receive  
buffers have actually been posted to the srq's).  Perhaps something  
like this:


1. you can export an attribute from the RNIC that advertises that  
ibv_post_recv() works this way (so that OMPI can detect it at run time)


2. hide the extra wireup / initial credit coordination in the rdma cpc  
when this attribute is detected (or make an mca param / ini file param  
that specifically requests for this extra rdma cm cpc behavior (or not).


What would make this proposal moot is if the Chelsio RNIC can't do  
SRQs (I don't remember offhand).  If it can't (and you can't fix  
ibv_post_recv()), then you might as well do Gleb's "just use one qp"  
proposal.  You'll get lousy registered memory utilization, but the  
bigger problem you'll have is the scalability issues for large-peer- 
count jobs (e.g., using the values above, 17M of registered memory per  
peer; I assume you'll have to tune that down via .ini file params).


What about that?

--
Jeff Squyres
Cisco Systems



[OMPI devel] cisco weekend mtt failures

2008-03-10 Thread Jeff Squyres
Oops -- my "delete old MTT stuff" script broke recently and allowed my  
disks to fill up over the weekend.  So there's a bunch of false  
failures in Cisco's MTT from this weekend (builds failed because of  
lack of disk space).


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] use of AC_CACHE_CHECK in otf

2008-03-10 Thread Matthias Jurenz
Fixed. Thank for your hint, Ralf.

On Do, 2008-03-06 at 22:23 +0100, Ralf Wildenhues wrote:

> In ompi/contrib/vt/vt/extlib/otf/acinclude.m4, in the macros WITH_DEBUG
> and WITH_VERBOSE, dubious constructs such as
> 
> AC_CACHE_CHECK([debug],
> [debug],
> [debug=])
> 
> are used.  These have the following problems:
> 
> * Cache variables need to match *_cv_* in order to actually be saved
> (where the bit before _cv_ is preferably a package or author prefix,
> for namespace cleanliness; see
> .
> The next Autoconf version will warn about this.
> 
> * There is little need to cache information that the user provided on
> the configure command line.  If configure is rerun by './config.status
> --recheck', it remembers the original configure command line.  Only if
> the user manually reruns configure (and keeps config.cache) does this
> make a difference.
> 
> So I suggest you remove those two instances of AC_CACHE_CHECK usage,
> or forward this information to the author of oft.
> 
> Thanks,
> Ralf
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

--
Matthias Jurenz,
Center for Information Services and 
High Performance Computing (ZIH), TU Dresden, 
Willersbau A106, Zellescher Weg 12, 01062 Dresden
phone +49-351-463-31945, fax +49-351-463-37773


smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Steve Wise

Gleb Natapov wrote:

On Sun, Mar 09, 2008 at 02:48:09PM -0500, Jon Mason wrote:
  

Issue (as described by Steve Wise):

Currently OMPI uses qp 0 for all credit updates (by design).  This breaks
when running over the chelsio rnic due to a race condition between
advertising the availability of a buffer using qp0 when the buffer was
posted on one of the other qps.  It is possible (and easily reproducible)
that the peer gets the advertisement and sends data into the qp in question
_before_ the rnic has processed the recv buffer and made it available for
placement.  This results in a connection termination.  BTW, other hca's
have this issue too.  ehca, for example, claims they have the same race
condition.  I think the timing hole is much smaller though for devices that
have 2 separate work queues for the SQ and RQ of a QP.  Chelsio has a
single work queue to implement both SQ and RQ, so processing of RQ work
requests gets queued up behind pending SQ entries which can make this race
condition more prevalent.


There was a discussion about this on openfabrics mailing list and the
conclusion was that what Open MPI does is correct according to IB/iWarp
spec.

  
Hey Gleb.  Yes, the conclusion was the rdma device and driver should 
ensure this.  But also note that the ehca IB device also has this same 
race condition.  So I wonder if the other IB devices really do also have 
this race condition?  I think it is worse for the cxgb3 device due to 
its architecture (a single queue for both send and recv work requests).



I don't know of any way to avoid this issue other that to ensure that all
credit updates for qp X are posted only on qp X.  If we do this, then the
chelsio HW/FW ensures that the RECV is posted before the subsequent send
operation that advertises the buffer is processed.


Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not return before
post receive is processed?

  
I've been trying come up with a solution in the lib/driver/fw to enforce 
this behavior.  The only way I can see doing it is to follow the recv 
work requests with a 0B write work request, and spinning or blocking 
until the 0B write completes (note: 0B write doesn't emit anything on 
the wire for the cxgb3 device).  This will guarantee that the recv's are 
ready before returning from the libcxgb3 post_recv function.  However 
this is problematic because there can be real OMPI work completions in 
the CQ that need processing.  So I don't know how to do this in the 
driver/library. 

Also note, any such solution will entirely drain the SQ each time a recv 
is posted.  This will kill performance.


(just thinking out loud here): The OMPi code could be designed to _not_ 
assume recv's are posted until the CPC indicates they are ready. IE sort 
of asynchronous behavior.   When the recvs are ready, the CPC could 
up-call the btl and then the credits could be updated.  This sounds 
painful though :)

To address this Jeff Squyres recommends:

1. make an mca parameter that governs this behavior (i.e., whether to send
all flow control messages on QP0 or on their respective QPs)

2. extend the ini file parsing code to accept this parameter as well (need
to add a strcmp or two)

3. extend the ini file to fill in this value for all the nic's listed (to
include yours).

4. extend the logic in the rest of the btl to send the flow control
messages either across qp0 or the respective qp, depending on the value of
the mca param / ini value.


I am happy to do the work to enable this, but I would like to get
everyone's feed back before I start down this path.  Jeff said Gleb did
the work to change openib to behave this way, so any insight would be
helpful.



I personally don't like the idea to add another layer of complexity to openib
BTL code just to work around HW that doesn't follow spec. If work around
is simple that is OK, but in this case it is not so simple and will add
code path that is rarely tested. A simple workaround for the problem may
be to not configure multiple QPs if HW has a bug (and we can extend ini
file to contain this info).

  


It doesn't sound too complex to implement the above design.  In fact, 
that's the way this btl used to work, no?There are large customers 
requesting OMPI over cxgb3 and we're ready to provide the effort to get 
this done.  So I request we come to an agreement on how to support this 
device efficiently (and for ompi-1.3).


On the single-QP angle, Can I just run OMPI with only specifying 1 QP?  
Or will that require coding changes?



Thanks!

Steve.




Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Steve Wise

Jeff Squyres wrote:

On Mar 9, 2008, at 3:39 PM, Gleb Natapov wrote:

  
1. There was a discussion about this on openfabrics mailing list and  
the
conclusion was that what Open MPI does is correct according to IB/ 
iWarp

spec.

2. Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not return  
before

post receive is processed?


3. I personally don't like the idea to add another layer of  
complexity to openib
BTL code just to work around HW that doesn't follow spec. If work  
around
is simple that is OK, but in this case it is not so simple and will  
add
code path that is rarely tested. A simple workaround for the problem  
may
be to not configure multiple QPs if HW has a bug (and we can extend  
ini

file to contain this info).




These are all valid points.

In thinking about Gleb's proposal a bit more (extend the INI file  
syntax to accept per-HCA receive_queues values), it might be only  
somewhat less efficient (and a lot less code) than sending all flow  
control messages on the respective qp's anyway.  So let's explore the  
math...


The "let's use multiple QP's for short messages" scheme (a.k.a. BSRQ)  
was invented to get better registered memory utilization.  Pushing all  
the FC messages down to the QP with the smallest buffer size was a  
desirable side-effect that made registered memory utilization even  
better (because short FC messages were naturally on the QP with the  
smallest buffer size).  Specifically, today in openib/IB (SVN trunk),  
here's the default queue layout:


pp: 256 buffers of size 128
srq: 256 buffers of size 4k
srq: 256 buffers of size 12k (eager limit)
srq: 256 buffers of size 64k (max send size)

And then we add 4 more buffers on the pp qp for flow control messages  
(since we only currently send FC messages for pp qp's).  Total  
registered memory for a job with 1 remote peer: (256+4)*128 + 256*4k +  
256*12k + 256*64k = ~20M.  This is somewhat deceiving, because the  
total registered memory scales slowly with the number of procs in the  
job (e.g., with 2 remote peers, in only increases by 33k because we're  
using srq's).


With Gleb's proposals, you'd only have one pp qp, assumedly 64k (or  
whatever the max send size is):


pp: 256 buffers of size 64k (max send size)

And then add 4 more for flow control messages.  So total registered  
memory for a job with 1 remote peer: (256+4)*64k = ~17M.  But that  
figure is approximately a per-peer cost -- so a job with 2 remote  
peers would use ~34M of registered memory, etc.  This will [obviously]  
scale extremely poorly (and is one of the reasons that BSRQ exists).


However, I wonder if there's a compromise (assuming you can't fix  
ibv_post_recv() to not return until the buffers are actually  
available, which, I agree with Gleb, seems like the best fix).  Since  
we only use FC messages on pp qp's, why not make a "you can only have  
1 pp qp and it must be qp 0" restriction for the Chelsio RNIC?  This  
fits nicely into our default receive_queues value, anyway.  That way,  
all FC messages will naturally go over qp 0 anyway (since that will be  
the only pp qp).  Then, the only problem you have to solve is sending  
the *initial* credits message at wireup time (to know when the receive  
buffers have actually been posted to the srq's).  Perhaps something  
like this:


1. you can export an attribute from the RNIC that advertises that  
ibv_post_recv() works this way (so that OMPI can detect it at run time)


2. hide the extra wireup / initial credit coordination in the rdma cpc  
when this attribute is detected (or make an mca param / ini file param  
that specifically requests for this extra rdma cm cpc behavior (or not).


What would make this proposal moot is if the Chelsio RNIC can't do  
SRQs (I don't remember offhand).  If it can't (and you can't fix  
ibv_post_recv()), then you might as well do Gleb's "just use one qp"  
proposal.  You'll get lousy registered memory utilization, but the  
bigger problem you'll have is the scalability issues for large-peer- 
count jobs (e.g., using the values above, 17M of registered memory per  
peer; I assume you'll have to tune that down via .ini file params).


What about that?

  

This gen of the chelsio rnic doesn't support SRQs.

I don't think we can fix post_recv to behave like we want.

A single PP QP might be fine for now, and chelsio's next-gen part will 
support SRQs and not have this funky issue.


But why use such a large buffer size for a single PP QP?  Why not use 
something around 16KB?



Steve.


Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Jeff Squyres

On Mar 10, 2008, at 9:50 AM, Steve Wise wrote:

(just thinking out loud here): The OMPi code could be designed to  
_not_
assume recv's are posted until the CPC indicates they are ready. IE  
sort

of asynchronous behavior.   When the recvs are ready, the CPC could
up-call the btl and then the credits could be updated.  This sounds
painful though :)


That's the way it works, but only for the initial credits.  The CPC is  
not involved beyond that.


So it's likely that you'll still have this problem after initial  
wireup for OMPI PP QP's (except as I noted below, if we only allow  
that chelsio rnic to only have one PP QP and it has to be qp 0).



On the single-QP angle, Can I just run OMPI with only specifying 1 QP?
Or will that require coding changes?



No coding changes required; just change the value of  
mca_btl_openib_receive_queues.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Jeff Squyres

On Mar 10, 2008, at 9:57 AM, Steve Wise wrote:


A single PP QP might be fine for now, and chelsio's next-gen part will
support SRQs and not have this funky issue.


Good!


But why use such a large buffer size for a single PP QP?  Why not use
something around 16KB?



You can do that, but you'll also need to make the max_send_size be  
16kb (and therefore ob1 will switch to rendezvous protocol above that  
size).  See our paper on the long message protocol that OMPI uses --  
the minimum "large" message size was specifically designed to be kinda  
big so that we could do some send/recv to offset the registration  
penalty of pinning user's large buffers.


--
Jeff Squyres
Cisco Systems



[OMPI devel] orte\mca\smr

2008-03-10 Thread Leonardo Fialho

Hi all,

Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm...

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478



Re: [OMPI devel] orte\mca\smr

2008-03-10 Thread Jeff Squyres
Yes, it all got consolidated down into plm.  We need to update the  
FAQ; the ORTE frameworks changed quite a bit in the recent ORTE merge...


Ralph's on vacation this week.  A detailed answer to your question may  
not occur until he returns...



On Mar 10, 2008, at 10:05 AM, Leonardo Fialho wrote:


Hi all,

Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm...

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




Re: [OMPI devel] orte\mca\smr

2008-03-10 Thread Leonardo Fialho

Hi Jeff,

I need to implement a heart bit/watchdog monitoring system, I´m looking 
for the "best place" to put it and I don´t want to put duplicated code. 
I´ll try to put it into PLM for now, and when I get a Ralph´s response I 
change it, if necessary.


Jeff Squyres escribió:
Yes, it all got consolidated down into plm.  We need to update the  
FAQ; the ORTE frameworks changed quite a bit in the recent ORTE merge...


Ralph's on vacation this week.  A detailed answer to your question may  
not occur until he returns...



On Mar 10, 2008, at 10:05 AM, Leonardo Fialho wrote:

  

Hi all,

Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm...

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




  



--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478



[OMPI devel] MATLAB interface

2008-03-10 Thread aguillen

Hello,

 I developed an interface to call MPI functions from deployed MATLAB
applications. It works for many MPI implementations and, fortunately,
OpenMPI is not an exception.

 If you are interested in knowing more, I would very pleased in helping
the project.

Thank you for your great job.

Alberto.




Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Gleb Natapov
On Mon, Mar 10, 2008 at 09:50:13AM -0500, Steve Wise wrote:
> > I personally don't like the idea to add another layer of complexity to 
> > openib
> > BTL code just to work around HW that doesn't follow spec. If work around
> > is simple that is OK, but in this case it is not so simple and will add
> > code path that is rarely tested. A simple workaround for the problem may
> > be to not configure multiple QPs if HW has a bug (and we can extend ini
> > file to contain this info).
> >
> >   
> 
> It doesn't sound too complex to implement the above design.  In fact, 
> that's the way this btl used to work, no?There are large customers 
> requesting OMPI over cxgb3 and we're ready to provide the effort to get 
> this done.  So I request we come to an agreement on how to support this 
> device efficiently (and for ompi-1.3).
Yes. The btl used to work like that before. But the current way of doing
credit management requires much less memory, so I don't think reverting
to the old way is a right thing. And having two different ways of
sending credit updates seems like additional complexity. The problem is
not just with writing code, but this code will have to be maintained for
unknown period of time (will this problem be solved in your next gen HW?).
I am OK with adding old fc in addition to current fc if the code is trivial
and easy to maintain. Do you think it is possible to add what you want
and meet these requirements?

--
Gleb.


Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Steve Wise

Gleb Natapov wrote:

On Mon, Mar 10, 2008 at 09:50:13AM -0500, Steve Wise wrote:
  

I personally don't like the idea to add another layer of complexity to openib
BTL code just to work around HW that doesn't follow spec. If work around
is simple that is OK, but in this case it is not so simple and will add
code path that is rarely tested. A simple workaround for the problem may
be to not configure multiple QPs if HW has a bug (and we can extend ini
file to contain this info).

  
  
It doesn't sound too complex to implement the above design.  In fact, 
that's the way this btl used to work, no?There are large customers 
requesting OMPI over cxgb3 and we're ready to provide the effort to get 
this done.  So I request we come to an agreement on how to support this 
device efficiently (and for ompi-1.3).


Yes. The btl used to work like that before. But the current way of doing
credit management requires much less memory, so I don't think reverting
to the old way is a right thing. And having two different ways of
sending credit updates seems like additional complexity. The problem is
not just with writing code, but this code will have to be maintained for
unknown period of time (will this problem be solved in your next gen HW?).
  

Yes.

I am OK with adding old fc in addition to current fc if the code is trivial
and easy to maintain. Do you think it is possible to add what you want
and meet these requirements?
  

I hope so! :)

But I think we're going to end up using just a single PP QP for this 
version of the chelsio HW. We're exploring how that works now. The next 
gen rnic from chelsio will support SRQs and fix this post_recv issue, so 
we can then plug in properly with OMPI.


Steve.



--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  




Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Steve Wise




Jeff Squyres wrote:

  On Mar 10, 2008, at 9:57 AM, Steve Wise wrote:

  
  
A single PP QP might be fine for now, and chelsio's next-gen part will
support SRQs and not have this funky issue.

  
  
Good!

  
  
But why use such a large buffer size for a single PP QP?  Why not use
something around 16KB?

  
  

You can do that, but you'll also need to make the max_send_size be  
16kb (and therefore ob1 will switch to rendezvous protocol above that  
size).  See our paper on the long message protocol that OMPI uses --  
the minimum "large" message size was specifically designed to be kinda  
big so that we could do some send/recv to offset the registration  
penalty of pinning user's large buffers.

  


Does OMPI do lazy dereg to maintain a cache of registered user buffers?





Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

2008-03-10 Thread Jon Mason
On Mon, Mar 10, 2008 at 10:03:27AM -0500, Jeff Squyres wrote:
> On Mar 10, 2008, at 9:50 AM, Steve Wise wrote:
> 
> > (just thinking out loud here): The OMPi code could be designed to  
> > _not_
> > assume recv's are posted until the CPC indicates they are ready. IE  
> > sort
> > of asynchronous behavior.   When the recvs are ready, the CPC could
> > up-call the btl and then the credits could be updated.  This sounds
> > painful though :)
> 
> That's the way it works, but only for the initial credits.  The CPC is  
> not involved beyond that.
> 
> So it's likely that you'll still have this problem after initial  
> wireup for OMPI PP QP's (except as I noted below, if we only allow  
> that chelsio rnic to only have one PP QP and it has to be qp 0).
> 
> > On the single-QP angle, Can I just run OMPI with only specifying 1 QP?
> > Or will that require coding changes?
> 
> 
> No coding changes required; just change the value of  
> mca_btl_openib_receive_queues.

Specifying only 1 PP QP via command line seems to be working.  It now
passes a tests that failed 100% of the time with the credit issue on my
2 node cluster.  Futher tests on a larger setup are still pending, but
this looks like a good workaround.

I think adding an additional field to the mca-btl-openib-hca-params.ini
file to have the 1 PP QP by default would be a good long(er) term
solution to this.  This way those adapters that have this deficiency can
specify it and should work "out of the box".

Thoughts?

Thanks,
Jon

> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel