Re: [OMPI devel] Threaded progress for CPCs

2008-05-21 Thread Jeff Squyres

One more point that Pasha and I hashed out yesterday in IM...

To avoid the problem of posting a short handshake buffer to already- 
existing SRQs, we will only do the extra handshake if there are PPRQ's  
in receive_queues.  The handshake will go across the smallest PPRQ,  
and represent all QPs in receive_queues (even the SRQs).


If there are no PPRQ's in the receive_queues value, we'll just skip  
the handshake and rely on IB's SRQ RNR retransmitting to fix any race  
conditions.


One point that needs clarification: whether IBCM and RDMACM *require*  
posting receive buffers on the new QP's.  If so, this scheme will run  
into trouble because we do not want to post any buffers on SRQs; that  
gets racy and difficult to synchronize right (especially if multiple  
remote peers are simultaneously trying to connect to a single SRQ).   
I'll check this out today or tomorrow.


We'll have to re-visit this when iWARP NICs start supporting SRQ, but  
if the above assumption is true (no need to post any receive buffers  
for IBCM and RDMACM), it will be good enough for v1.3.



On May 20, 2008, at 12:37 PM, Jeff Squyres wrote:


Ok, I think we're mostly converged on a solution.  This might not get
implemented immediately (got some other pending v1.3 stuff to bug fix,
etc.), but it'll happen for v1.3.

- endpoint creation will mpool alloc/register a small buffer for
handshake
- cpc does not need to call _post_recvs()); instead, it can just post
the single small buffer on each BSRQ QP (from the small buffer on the
endpoint)
- cpc will call _connected() (in the main thread, not the CPC progress
thread) when all BSRQ QPs are connected
  - if _post_recvs() was previously called, do the normal "finish
setting up" stuff and declare the endpoint CONNECTED
  - if _post_recvs() was not previously called, then:
- call _post_recvs()
- send a short CTS message on the 1st BSRQ QP
- wait for CTS from peer
- when both CTS from peer has arrived *and* we have sent our CTS,
declare endpoint CONNECTED

Doing it this way adds no overhead to OOB/XOOB (who don't need this
extra handshake).  I think the code can be factored nicely to make
this not too complicated.

I'll work on this once I figure out the memory corruption I'm seeing
in the receive_queues patch...

Note that this addresses the wireup multi-threading issues -- not
iWarp SRQ issues. We'll tackle those separately, and possibly not for
the initial v1.3.0 release.


On May 20, 2008, at 6:02 AM, Gleb Natapov wrote:


On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:

5. ...?

What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free
list)
during IBCM connection stage and will run another three way
handshake
protocol using those buffers, but from the main thread. We will  
need

to
prepost one buffer on the active side and two buffers on the  
passive

side.



This is probably the most viable alternative -- it would be easiest
if
we did this for all CPC's, not just for IBCM:

- for PPRQ: CPCs only post a small number of receive buffers,
suitable
for another handshake that will run in the upper-level openib BTL
- for SRQ: CPCs don't post anything (because the SRQ already
"belongs"
to the upper level openib BTL)

Do we have a BSRQ restriction that there *must* be at least one  
PPRQ?

No. We don't have such restriction and I wouldn't want to add it.


If so, we could always run the upper-level openib BTL really-post-
the-
buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
have the CPC post a single receive on this QP -- see below), which
would make things much easier.  If we don't already have this
restriction, would we mind adding it?  We have one PPRQ in our
default
receive_queues value, anyway.
If there is not PPRQ then we can relay on RNR/retransmit logic in  
case

there is not enough buffer in SRQ. We do that anyway in openib BTL
code.



With this rationale, once the CPC says "ok, all BSRQ QP's are
connected", then _endpoint.c can run a CTS handshake to post the
"real" buffers, where each side does the following:

- CPC calls _endpoint_connected() to tell the upper level BTL that  
it

is fully connected (the function is invoked in the main thread)
- _endpoint_connected() posts all the "real" buffers to all the BSRQ
QP's on the endpoint
- _endpoint_connected() then sends a CTS control message to remote
peer via smallest RC PPRQ
- upon receipt of CTS:
 - release the buffer (***)
 - set endpoint state of CONNECTED and let all pending messages
flow... (as it happens today)

So it actually doesn't even have to be a handshake -- it's just an
additional CTS sent over the newly-created RC QP.  Since it's RC, we
don't have to do much -- just wait for the CTS to 

Re: [OMPI devel] Threaded progress for CPCs

2008-05-20 Thread Jeff Squyres
Ok, I think we're mostly converged on a solution.  This might not get  
implemented immediately (got some other pending v1.3 stuff to bug fix,  
etc.), but it'll happen for v1.3.


- endpoint creation will mpool alloc/register a small buffer for  
handshake
- cpc does not need to call _post_recvs()); instead, it can just post  
the single small buffer on each BSRQ QP (from the small buffer on the  
endpoint)
- cpc will call _connected() (in the main thread, not the CPC progress  
thread) when all BSRQ QPs are connected
  - if _post_recvs() was previously called, do the normal "finish  
setting up" stuff and declare the endpoint CONNECTED

  - if _post_recvs() was not previously called, then:
- call _post_recvs()
- send a short CTS message on the 1st BSRQ QP
- wait for CTS from peer
- when both CTS from peer has arrived *and* we have sent our CTS,  
declare endpoint CONNECTED


Doing it this way adds no overhead to OOB/XOOB (who don't need this  
extra handshake).  I think the code can be factored nicely to make  
this not too complicated.


I'll work on this once I figure out the memory corruption I'm seeing  
in the receive_queues patch...


Note that this addresses the wireup multi-threading issues -- not  
iWarp SRQ issues. We'll tackle those separately, and possibly not for  
the initial v1.3.0 release.



On May 20, 2008, at 6:02 AM, Gleb Natapov wrote:


On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:

5. ...?

What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free  
list)
during IBCM connection stage and will run another three way  
handshake

protocol using those buffers, but from the main thread. We will need
to
prepost one buffer on the active side and two buffers on the passive
side.



This is probably the most viable alternative -- it would be easiest  
if

we did this for all CPC's, not just for IBCM:

- for PPRQ: CPCs only post a small number of receive buffers,  
suitable

for another handshake that will run in the upper-level openib BTL
- for SRQ: CPCs don't post anything (because the SRQ already  
"belongs"

to the upper level openib BTL)

Do we have a BSRQ restriction that there *must* be at least one PPRQ?

No. We don't have such restriction and I wouldn't want to add it.

If so, we could always run the upper-level openib BTL really-post- 
the-

buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
have the CPC post a single receive on this QP -- see below), which
would make things much easier.  If we don't already have this
restriction, would we mind adding it?  We have one PPRQ in our  
default

receive_queues value, anyway.

If there is not PPRQ then we can relay on RNR/retransmit logic in case
there is not enough buffer in SRQ. We do that anyway in openib BTL  
code.




With this rationale, once the CPC says "ok, all BSRQ QP's are
connected", then _endpoint.c can run a CTS handshake to post the
"real" buffers, where each side does the following:

- CPC calls _endpoint_connected() to tell the upper level BTL that it
is fully connected (the function is invoked in the main thread)
- _endpoint_connected() posts all the "real" buffers to all the BSRQ
QP's on the endpoint
- _endpoint_connected() then sends a CTS control message to remote
peer via smallest RC PPRQ
- upon receipt of CTS:
  - release the buffer (***)
  - set endpoint state of CONNECTED and let all pending messages
flow... (as it happens today)

So it actually doesn't even have to be a handshake -- it's just an
additional CTS sent over the newly-created RC QP.  Since it's RC, we
don't have to do much -- just wait for the CTS to know that the  
remote

side has actually posted all the receives that we expect it to have.
Since the CTS flows over a PPRQ, there's no issue about receiving the
CTS on an SRQ (because the SRQ may not have any buffers posted at any
given time).

Correct. Full handshake is not needed. The trick is to allocate those
initial buffers in a smart way. IMO initial buffer should be very
small (a couple of bytes only) and be preallocated on endpoint  
creation.

This will solve locking problem.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Threaded progress for CPCs

2008-05-20 Thread Pavel Shamis (Pasha)


Is it possible to have sane SRQ implementation without HW flow  
control?



It seems pretty unlikely if the only available HW flow control is to  
terminate the connection.  ;-)


  

Even if we can get the iWARP semantics to work, this feels kinda
icky.  Perhaps I'm overreacting and this isn't a problem that needs  
to

be fixed -- after all, this situation is no different than what
happens after the initial connection, but it still feels icky.
  
What is so icky about it? Sender is faster than a receiver so flow  
control

kicks in.



My point is that we have no real flow control for SRQ.

  

2. The CM progress thread posts its own receive buffers when creating
a QP (which is a necessary step in both CMs).  However, this is
problematic in two cases:

  

[skip]

I don't like 1,2 and 3. :(



4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs.  We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives.  The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.
  

This is slightly better, but still...



Agreed; my reactions were pretty much the same as yours.

  

5. ...?
  

What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need  
to
prepost one buffer on the active side and two buffers on the passive  
side.




This is probably the most viable alternative -- it would be easiest if  
we did this for all CPC's, not just for IBCM:


- for PPRQ: CPCs only post a small number of receive buffers, suitable  
for another handshake that will run in the upper-level openib BTL
- for SRQ: CPCs don't post anything (because the SRQ already "belongs"  
to the upper level openib BTL)
  
Currently I Iwarp do not have SRQ at and and IMHO the SRQ in not 
possible without HW flow control

So lets resolve the problem only for PPRQ  ?

Do we have a BSRQ restriction that there *must* be at least one PPRQ?   
  

No it is not such restriction.
If so, we could always run the upper-level openib BTL really-post-the- 
buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,  
have the CPC post a single receive on this QP -- see below), which  
would make things much easier.  If we don't already have this  
restriction, would we mind adding it?  We have one PPRQ in our default  
receive_queues value, anyway.
  

I don't see such reason to add such restrictions, at least for IB.
We may add it for Iwarp only (actually we already have it for Iwarp)



Re: [OMPI devel] Threaded progress for CPCs

2008-05-20 Thread Gleb Natapov
On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
> >> 5. ...?
> > What about moving posting of receive buffers into main thread. With
> > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > prepost buffers automatically after first fragment received on the
> > endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
> > complicated. What if we'll prepost dummy buffers (not from free list)
> > during IBCM connection stage and will run another three way handshake
> > protocol using those buffers, but from the main thread. We will need  
> > to
> > prepost one buffer on the active side and two buffers on the passive  
> > side.
> 
> 
> This is probably the most viable alternative -- it would be easiest if  
> we did this for all CPC's, not just for IBCM:
> 
> - for PPRQ: CPCs only post a small number of receive buffers, suitable  
> for another handshake that will run in the upper-level openib BTL
> - for SRQ: CPCs don't post anything (because the SRQ already "belongs"  
> to the upper level openib BTL)
> 
> Do we have a BSRQ restriction that there *must* be at least one PPRQ?   
No. We don't have such restriction and I wouldn't want to add it.

> If so, we could always run the upper-level openib BTL really-post-the- 
> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,  
> have the CPC post a single receive on this QP -- see below), which  
> would make things much easier.  If we don't already have this  
> restriction, would we mind adding it?  We have one PPRQ in our default  
> receive_queues value, anyway.
If there is not PPRQ then we can relay on RNR/retransmit logic in case
there is not enough buffer in SRQ. We do that anyway in openib BTL code.

> 
> With this rationale, once the CPC says "ok, all BSRQ QP's are  
> connected", then _endpoint.c can run a CTS handshake to post the  
> "real" buffers, where each side does the following:
> 
> - CPC calls _endpoint_connected() to tell the upper level BTL that it  
> is fully connected (the function is invoked in the main thread)
> - _endpoint_connected() posts all the "real" buffers to all the BSRQ  
> QP's on the endpoint
> - _endpoint_connected() then sends a CTS control message to remote  
> peer via smallest RC PPRQ
> - upon receipt of CTS:
>- release the buffer (***)
>- set endpoint state of CONNECTED and let all pending messages  
> flow... (as it happens today)
> 
> So it actually doesn't even have to be a handshake -- it's just an  
> additional CTS sent over the newly-created RC QP.  Since it's RC, we  
> don't have to do much -- just wait for the CTS to know that the remote  
> side has actually posted all the receives that we expect it to have.   
> Since the CTS flows over a PPRQ, there's no issue about receiving the  
> CTS on an SRQ (because the SRQ may not have any buffers posted at any  
> given time).
Correct. Full handshake is not needed. The trick is to allocate those
initial buffers in a smart way. IMO initial buffer should be very
small (a couple of bytes only) and be preallocated on endpoint creation.
This will solve locking problem.

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Steve Wise

Jeff Squyres wrote:

On May 19, 2008, at 4:44 PM, Steve Wise wrote:

  

1. Posting more at low watermark can lead to DoS-like behavior when
you have a fast sender and a slow receiver.  This is exactly the
resource-exhaustion kind of behavior that a high quality MPI
implementation is supposed to avoid -- we really should to throttle
the sender somehow.

2. Resending ad infinitum simply eats up more bandwidth and takes  
away

network resources (e.g., switch resources) that other, legitimate
traffic.  Particularly if the receiver doesn't dip into the MPI layer
for many hours.  So yes, it *works*, but it's definitely sub-optimal.


  
The SRQ low water mark is simply an API method to allow applications  
to

try and never hit the "we're totally out recv bufs" problem.  That's a
tool that I think is needed for srq users no matter what flow control
method you use to try and avoid jeff's #1 item above.



If you had these buffers available, why didn't you post them when the  
QP was created / this sender was added?


  
Because you're trying to reduce memory requirements at the expense of 
under-provisioning the SRQ.  If you don't want the transport to drop and 
retransmit, then you might want an algorithm to increase the low water 
mark during bursty periods.
This mechanism *might* make sense if there was a sensible approach to  
know when to remove the "additional" buffers posted to an SRQ due to  
bursty traffic.  But how do you know when that is?


  


Thinking out loud: 
   - keep the SRQ up to the low water mark as a normal course of events
   - increase the low water mark value as you get more and more "low 
water mark exceeded" events

   - decrease the low water mark as these events become less frequent.

Dunno if this is worth the effort.



And if you don't like RNR retry/TCP retrans approach, which is bad for
reason #2 (and because TCP will eventually give up and reset the
connection),  then I think there needs to be some OMPI layer  
protocol to
stop senders that are abusing the SRQ pool for whatever reason (too  
fast
of a sender, sleeping beauty receiver never entering OMPI layer,  
whtaever).




That implies a progress thread.  If/when we add a progress thread, it  
will likely be for progressing long messages.  Myricom and MVAPICH  
have shown that rapidly firing progress threads and problematic to  
performance.  But even if you have that progress thread *only* wake up  
on the low watermark for the SRQ, you have two problems:


- there still could be many inbound messages that will overflow the  
SRQ and/or even more could be inbound by the time your STOP message  
gets to everyone (gets even worse as the MPI job scales up in total  
number of processes)


- in the case of a very large MPI job, sending the STOP message has  
obvious scalability problems (have to send it to everyone, which  
requires its own set of send buffers and WQEs/CQEs)


  

Ok, STOP messages won't scale...dumb idea.




Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Jeff Squyres

On May 19, 2008, at 4:44 PM, Steve Wise wrote:


1. Posting more at low watermark can lead to DoS-like behavior when
you have a fast sender and a slow receiver.  This is exactly the
resource-exhaustion kind of behavior that a high quality MPI
implementation is supposed to avoid -- we really should to throttle
the sender somehow.

2. Resending ad infinitum simply eats up more bandwidth and takes  
away

network resources (e.g., switch resources) that other, legitimate
traffic.  Particularly if the receiver doesn't dip into the MPI layer
for many hours.  So yes, it *works*, but it's definitely sub-optimal.


The SRQ low water mark is simply an API method to allow applications  
to

try and never hit the "we're totally out recv bufs" problem.  That's a
tool that I think is needed for srq users no matter what flow control
method you use to try and avoid jeff's #1 item above.


If you had these buffers available, why didn't you post them when the  
QP was created / this sender was added?


This mechanism *might* make sense if there was a sensible approach to  
know when to remove the "additional" buffers posted to an SRQ due to  
bursty traffic.  But how do you know when that is?



And if you don't like RNR retry/TCP retrans approach, which is bad for
reason #2 (and because TCP will eventually give up and reset the
connection),  then I think there needs to be some OMPI layer  
protocol to
stop senders that are abusing the SRQ pool for whatever reason (too  
fast
of a sender, sleeping beauty receiver never entering OMPI layer,  
whtaever).



That implies a progress thread.  If/when we add a progress thread, it  
will likely be for progressing long messages.  Myricom and MVAPICH  
have shown that rapidly firing progress threads and problematic to  
performance.  But even if you have that progress thread *only* wake up  
on the low watermark for the SRQ, you have two problems:


- there still could be many inbound messages that will overflow the  
SRQ and/or even more could be inbound by the time your STOP message  
gets to everyone (gets even worse as the MPI job scales up in total  
number of processes)


- in the case of a very large MPI job, sending the STOP message has  
obvious scalability problems (have to send it to everyone, which  
requires its own set of send buffers and WQEs/CQEs)


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Steve Wise

Jeff Squyres wrote:

On May 19, 2008, at 3:40 PM, Jon Mason wrote:

  
iWARP needs preposted recv buffers (or it will drop the  
connection).  So

this isn't a good option.

I was talking about SRQ only. You said above that iwarp does  
retransmit for SRQ.
openib BTL relies on HW retransmit when using SRQ, so if iwarp  
doesn't do it

reliably enough it can not be used with SRQ anyway.
  
How iWARP adapters behave with respect to SRQ retransmit is 100% HW  
dependent.



It was my understanding that it's at least the same as how TCP handles  
a dropped packet.  The HW may do better than that.


  
The HW can queue some of the receives internally or use the HW TCP  
stack to have
it retransmit.  Of course, this is a BAD thing to do.  The SRQ "low- 
water marker"

event is the best way to handle these cases.




I disagree.  I even think that the IB-retry-forever approach is bad.   
Here's why:


1. Posting more at low watermark can lead to DoS-like behavior when  
you have a fast sender and a slow receiver.  This is exactly the  
resource-exhaustion kind of behavior that a high quality MPI  
implementation is supposed to avoid -- we really should to throttle  
the sender somehow.


2. Resending ad infinitum simply eats up more bandwidth and takes away  
network resources (e.g., switch resources) that other, legitimate  
traffic.  Particularly if the receiver doesn't dip into the MPI layer  
for many hours.  So yes, it *works*, but it's definitely sub-optimal.


  
The SRQ low water mark is simply an API method to allow applications to 
try and never hit the "we're totally out recv bufs" problem.  That's a 
tool that I think is needed for srq users no matter what flow control 
method you use to try and avoid jeff's #1 item above.


And if you don't like RNR retry/TCP retrans approach, which is bad for 
reason #2 (and because TCP will eventually give up and reset the 
connection),  then I think there needs to be some OMPI layer protocol to 
stop senders that are abusing the SRQ pool for whatever reason (too fast 
of a sender, sleeping beauty receiver never entering OMPI layer, whtaever).


my 1/2 cent...


Steve.




Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Jeff Squyres

On May 19, 2008, at 3:40 PM, Jon Mason wrote:

iWARP needs preposted recv buffers (or it will drop the  
connection).  So

this isn't a good option.
I was talking about SRQ only. You said above that iwarp does  
retransmit for SRQ.
openib BTL relies on HW retransmit when using SRQ, so if iwarp  
doesn't do it

reliably enough it can not be used with SRQ anyway.


How iWARP adapters behave with respect to SRQ retransmit is 100% HW  
dependent.


It was my understanding that it's at least the same as how TCP handles  
a dropped packet.  The HW may do better than that.


The HW can queue some of the receives internally or use the HW TCP  
stack to have
it retransmit.  Of course, this is a BAD thing to do.  The SRQ "low- 
water marker"

event is the best way to handle these cases.



I disagree.  I even think that the IB-retry-forever approach is bad.   
Here's why:


1. Posting more at low watermark can lead to DoS-like behavior when  
you have a fast sender and a slow receiver.  This is exactly the  
resource-exhaustion kind of behavior that a high quality MPI  
implementation is supposed to avoid -- we really should to throttle  
the sender somehow.


2. Resending ad infinitum simply eats up more bandwidth and takes away  
network resources (e.g., switch resources) that other, legitimate  
traffic.  Particularly if the receiver doesn't dip into the MPI layer  
for many hours.  So yes, it *works*, but it's definitely sub-optimal.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Jon Mason
On Mon, May 19, 2008 at 10:12:19PM +0300, Gleb Natapov wrote:
> On Mon, May 19, 2008 at 01:52:22PM -0500, Jon Mason wrote:
> > On Mon, May 19, 2008 at 05:17:57PM +0300, Gleb Natapov wrote:
> > > On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote:
> > > > >> 5. ...?
> > > > >> 
> > > > > What about moving posting of receive buffers into main thread. With
> > > > > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > > > > prepost buffers automatically after first fragment received on the
> > > > > endpoint (in btl_openib_handle_incoming()). 
> > > > It still doesn't guaranty that we will not see RNR (as I understand we 
> > > > trying to resolve this problem  for iwarp?!)
> > > > 
> > > I don't think that iwarp has SRQ at all. And if it has then it should
> > 
> > While Chelsio does not currently have an adapter that has SRQs, there are
> > some other iWARP vendors that do have them.
> > 
> > > have HW flow control for it too. I don't see what advantage SRQ without
> > > flow control can provide over PPRQ.
> > 
> > Technically, this is not flow control, it is a retransmit.  iWARP can use
> > the HW TCP stack to retransmit, but it will not have the "retransmit
> > forever" ability that setting rnr_retry to 7 has for IB.
> For how long will it try to retransmit before dropping connection.
> 
> > 
> > > > So this solution will cost 1 buffer on each srq ... sounds acceptable 
> > > > for me. But I don't see too much
> > > > difference compared to #1, as I understand  we anyway will be need the 
> > > > pipe for communication with main thread.
> > > > so why don't use #1 ?
> > > What communication? No communication at all. Just don't prepost buffers
> > > to SRQ during connection establishment. Problem solved (only for SRQ of
> > > cause).
> > 
> > iWARP needs preposted recv buffers (or it will drop the connection).  So
> > this isn't a good option.
> I was talking about SRQ only. You said above that iwarp does retransmit for 
> SRQ.
> openib BTL relies on HW retransmit when using SRQ, so if iwarp doesn't do it
> reliably enough it can not be used with SRQ anyway.

How iWARP adapters behave with respect to SRQ retransmit is 100% HW dependent.
The HW can queue some of the receives internally or use the HW TCP stack to have
it retransmit.  Of course, this is a BAD thing to do.  The SRQ "low-water 
marker"
event is the best way to handle these cases.  

Thanks,
Jon

> 
> --
>   Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Mon, May 19, 2008 at 01:52:22PM -0500, Jon Mason wrote:
> On Mon, May 19, 2008 at 05:17:57PM +0300, Gleb Natapov wrote:
> > On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote:
> > > >> 5. ...?
> > > >> 
> > > > What about moving posting of receive buffers into main thread. With
> > > > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > > > prepost buffers automatically after first fragment received on the
> > > > endpoint (in btl_openib_handle_incoming()). 
> > > It still doesn't guaranty that we will not see RNR (as I understand we 
> > > trying to resolve this problem  for iwarp?!)
> > > 
> > I don't think that iwarp has SRQ at all. And if it has then it should
> 
> While Chelsio does not currently have an adapter that has SRQs, there are
> some other iWARP vendors that do have them.
> 
> > have HW flow control for it too. I don't see what advantage SRQ without
> > flow control can provide over PPRQ.
> 
> Technically, this is not flow control, it is a retransmit.  iWARP can use
> the HW TCP stack to retransmit, but it will not have the "retransmit
> forever" ability that setting rnr_retry to 7 has for IB.
For how long will it try to retransmit before dropping connection.

> 
> > > So this solution will cost 1 buffer on each srq ... sounds acceptable 
> > > for me. But I don't see too much
> > > difference compared to #1, as I understand  we anyway will be need the 
> > > pipe for communication with main thread.
> > > so why don't use #1 ?
> > What communication? No communication at all. Just don't prepost buffers
> > to SRQ during connection establishment. Problem solved (only for SRQ of
> > cause).
> 
> iWARP needs preposted recv buffers (or it will drop the connection).  So
> this isn't a good option.
I was talking about SRQ only. You said above that iwarp does retransmit for SRQ.
openib BTL relies on HW retransmit when using SRQ, so if iwarp doesn't do it
reliably enough it can not be used with SRQ anyway.

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Jon Mason
On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
> On May 19, 2008, at 8:25 AM, Gleb Natapov wrote:
> 
> > Is it possible to have sane SRQ implementation without HW flow  
> > control?
> 
> It seems pretty unlikely if the only available HW flow control is to  
> terminate the connection.  ;-)
> 
> >> Even if we can get the iWARP semantics to work, this feels kinda
> >> icky.  Perhaps I'm overreacting and this isn't a problem that needs  
> >> to
> >> be fixed -- after all, this situation is no different than what
> >> happens after the initial connection, but it still feels icky.
> > What is so icky about it? Sender is faster than a receiver so flow  
> > control
> > kicks in.
> 
> My point is that we have no real flow control for SRQ.
> 
> >> 2. The CM progress thread posts its own receive buffers when creating
> >> a QP (which is a necessary step in both CMs).  However, this is
> >> problematic in two cases:
> >>
> > [skip]
> >
> > I don't like 1,2 and 3. :(
> >
> >> 4. Have a separate mpool for drawing initial receive buffers for the
> >> CM-posted RQs.  We'd probably want this mpool to be always empty (or
> >> close to empty) -- it's ok to be slow to allocate / register more
> >> memory when a new connection request arrives.  The memory obtained
> >> from this mpool should be able to be returned to the "main" mpool
> >> after it is consumed.
> >
> > This is slightly better, but still...
> 
> Agreed; my reactions were pretty much the same as yours.
> 
> >> 5. ...?
> > What about moving posting of receive buffers into main thread. With
> > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > prepost buffers automatically after first fragment received on the
> > endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
> > complicated. What if we'll prepost dummy buffers (not from free list)
> > during IBCM connection stage and will run another three way handshake
> > protocol using those buffers, but from the main thread. We will need  
> > to
> > prepost one buffer on the active side and two buffers on the passive  
> > side.
> 
> 
> This is probably the most viable alternative -- it would be easiest if  
> we did this for all CPC's, not just for IBCM:
> 
> - for PPRQ: CPCs only post a small number of receive buffers, suitable  
> for another handshake that will run in the upper-level openib BTL
> - for SRQ: CPCs don't post anything (because the SRQ already "belongs"  
> to the upper level openib BTL)
> 
> Do we have a BSRQ restriction that there *must* be at least one PPRQ?   
> If so, we could always run the upper-level openib BTL really-post-the- 
> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,  
> have the CPC post a single receive on this QP -- see below), which  
> would make things much easier.  If we don't already have this  
> restriction, would we mind adding it?  We have one PPRQ in our default  
> receive_queues value, anyway.
> 
> With this rationale, once the CPC says "ok, all BSRQ QP's are  
> connected", then _endpoint.c can run a CTS handshake to post the  
> "real" buffers, where each side does the following:
> 
> - CPC calls _endpoint_connected() to tell the upper level BTL that it  
> is fully connected (the function is invoked in the main thread)
> - _endpoint_connected() posts all the "real" buffers to all the BSRQ  
> QP's on the endpoint
> - _endpoint_connected() then sends a CTS control message to remote  
> peer via smallest RC PPRQ
> - upon receipt of CTS:
>- release the buffer (***)
>- set endpoint state of CONNECTED and let all pending messages  
> flow... (as it happens today)
> 
> So it actually doesn't even have to be a handshake -- it's just an  
> additional CTS sent over the newly-created RC QP.  Since it's RC, we  
> don't have to do much -- just wait for the CTS to know that the remote  
> side has actually posted all the receives that we expect it to have.   
> Since the CTS flows over a PPRQ, there's no issue about receiving the  
> CTS on an SRQ (because the SRQ may not have any buffers posted at any  
> given time).
> 
> (***) The CTS can even be a zero byte message (maybe with inline data  
> if we need it?); we're just waiting for the *first* message to arrive  
> on the smallest BSRQ PPQP.  Here's a dumb question (because I've never  
> tried it and am on a plane where I can't try it) -- can you post a 0  
> byte buffer (or NULL) for a receive?  This would make returning the  
> buffer to the CPC much easier (i.e., you won't have to) because the  
> CPC [thread] will post the receive, but the upper level openib BTL  
> [main thread] will actually receive it.
> 
> We still have to solve what happens with iWARP on SRQ's, but that's  
> really a different issue.  I don't know if the iWARP vendors have  
> thought about this much yet...?

I like the idea of the cpc only posting enough buffers to handle its
connection setup.  This sounds the most optimal for RDMACM, and there
can even be HW 

Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Jon Mason
On Mon, May 19, 2008 at 05:17:57PM +0300, Gleb Natapov wrote:
> On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote:
> > >> 5. ...?
> > >> 
> > > What about moving posting of receive buffers into main thread. With
> > > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > > prepost buffers automatically after first fragment received on the
> > > endpoint (in btl_openib_handle_incoming()). 
> > It still doesn't guaranty that we will not see RNR (as I understand we 
> > trying to resolve this problem  for iwarp?!)
> > 
> I don't think that iwarp has SRQ at all. And if it has then it should

While Chelsio does not currently have an adapter that has SRQs, there are
some other iWARP vendors that do have them.

> have HW flow control for it too. I don't see what advantage SRQ without
> flow control can provide over PPRQ.

Technically, this is not flow control, it is a retransmit.  iWARP can use
the HW TCP stack to retransmit, but it will not have the "retransmit
forever" ability that setting rnr_retry to 7 has for IB.

> > So this solution will cost 1 buffer on each srq ... sounds acceptable 
> > for me. But I don't see too much
> > difference compared to #1, as I understand  we anyway will be need the 
> > pipe for communication with main thread.
> > so why don't use #1 ?
> What communication? No communication at all. Just don't prepost buffers
> to SRQ during connection establishment. Problem solved (only for SRQ of
> cause).

iWARP needs preposted recv buffers (or it will drop the connection).  So
this isn't a good option.

Thanks,
Jon

> 
> --
>   Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Jeff Squyres

On May 19, 2008, at 8:25 AM, Gleb Natapov wrote:

Is it possible to have sane SRQ implementation without HW flow  
control?


It seems pretty unlikely if the only available HW flow control is to  
terminate the connection.  ;-)



Even if we can get the iWARP semantics to work, this feels kinda
icky.  Perhaps I'm overreacting and this isn't a problem that needs  
to

be fixed -- after all, this situation is no different than what
happens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow  
control

kicks in.


My point is that we have no real flow control for SRQ.


2. The CM progress thread posts its own receive buffers when creating
a QP (which is a necessary step in both CMs).  However, this is
problematic in two cases:


[skip]

I don't like 1,2 and 3. :(


4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs.  We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives.  The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.


This is slightly better, but still...


Agreed; my reactions were pretty much the same as yours.


5. ...?

What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need  
to
prepost one buffer on the active side and two buffers on the passive  
side.



This is probably the most viable alternative -- it would be easiest if  
we did this for all CPC's, not just for IBCM:


- for PPRQ: CPCs only post a small number of receive buffers, suitable  
for another handshake that will run in the upper-level openib BTL
- for SRQ: CPCs don't post anything (because the SRQ already "belongs"  
to the upper level openib BTL)


Do we have a BSRQ restriction that there *must* be at least one PPRQ?   
If so, we could always run the upper-level openib BTL really-post-the- 
buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,  
have the CPC post a single receive on this QP -- see below), which  
would make things much easier.  If we don't already have this  
restriction, would we mind adding it?  We have one PPRQ in our default  
receive_queues value, anyway.


With this rationale, once the CPC says "ok, all BSRQ QP's are  
connected", then _endpoint.c can run a CTS handshake to post the  
"real" buffers, where each side does the following:


- CPC calls _endpoint_connected() to tell the upper level BTL that it  
is fully connected (the function is invoked in the main thread)
- _endpoint_connected() posts all the "real" buffers to all the BSRQ  
QP's on the endpoint
- _endpoint_connected() then sends a CTS control message to remote  
peer via smallest RC PPRQ

- upon receipt of CTS:
  - release the buffer (***)
  - set endpoint state of CONNECTED and let all pending messages  
flow... (as it happens today)


So it actually doesn't even have to be a handshake -- it's just an  
additional CTS sent over the newly-created RC QP.  Since it's RC, we  
don't have to do much -- just wait for the CTS to know that the remote  
side has actually posted all the receives that we expect it to have.   
Since the CTS flows over a PPRQ, there's no issue about receiving the  
CTS on an SRQ (because the SRQ may not have any buffers posted at any  
given time).


(***) The CTS can even be a zero byte message (maybe with inline data  
if we need it?); we're just waiting for the *first* message to arrive  
on the smallest BSRQ PPQP.  Here's a dumb question (because I've never  
tried it and am on a plane where I can't try it) -- can you post a 0  
byte buffer (or NULL) for a receive?  This would make returning the  
buffer to the CPC much easier (i.e., you won't have to) because the  
CPC [thread] will post the receive, but the upper level openib BTL  
[main thread] will actually receive it.


We still have to solve what happens with iWARP on SRQ's, but that's  
really a different issue.  I don't know if the iWARP vendors have  
thought about this much yet...?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Mon, May 19, 2008 at 07:39:13PM +0300, Pavel Shamis (Pasha) wrote:
 So this solution will cost 1 buffer on each srq ... sounds  
 acceptable for me. But I don't see too much
 difference compared to #1, as I understand  we anyway will be need  
 the pipe for communication with main thread.
 so why don't use #1 ?
 
>>> What communication? No communication at all. Just don't prepost buffers
>>> to SRQ during connection establishment. Problem solved (only for SRQ of
>>> cause).  
> As i know Jeff use the pipe for some status update (Jeff, please correct  
> me if  I wrong).
> If we still need pipe for communication , I prefer #1.
> If we don't have the pipe , I prefer your solution
>
The pipe will still be there. The pipe itself is not the problem. The
problem is that currently initial post_receives are done in the CPC
thread. post_receives involves access to some data structures that are
used in the main thread too (free lists, mpool, SRQ) so it has to be
either protected or eliminated. I think that eliminating it is a better
solution for now. For SRQ case it is also easy to do. PPRQ is more
complicated but IMHO possible.

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Pavel Shamis (Pasha)




What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()).   
It still doesn't guaranty that we will not see RNR (as I understand 
we trying to resolve this problem  for iwarp?!)




I don't think that iwarp has SRQ at all. And if it has then it should
have HW flow control for it too. I don't see what advantage SRQ without
flow control can provide over PPRQ.   

I'm agree that HW flow it is no reason for SRQ.
 
So this solution will cost 1 buffer on each srq ... sounds 
acceptable for me. But I don't see too much
difference compared to #1, as I understand  we anyway will be need 
the pipe for communication with main thread.

so why don't use #1 ?


What communication? No communication at all. Just don't prepost buffers
to SRQ during connection establishment. Problem solved (only for SRQ of
cause).  
As i know Jeff use the pipe for some status update (Jeff, please correct 
me if  I wrong).

If we still need pipe for communication , I prefer #1.
If we don't have the pipe , I prefer your solution

Pasha



Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote:
> >> 5. ...?
> >> 
> > What about moving posting of receive buffers into main thread. With
> > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > prepost buffers automatically after first fragment received on the
> > endpoint (in btl_openib_handle_incoming()). 
> It still doesn't guaranty that we will not see RNR (as I understand we 
> trying to resolve this problem  for iwarp?!)
> 
I don't think that iwarp has SRQ at all. And if it has then it should
have HW flow control for it too. I don't see what advantage SRQ without
flow control can provide over PPRQ.

> So this solution will cost 1 buffer on each srq ... sounds acceptable 
> for me. But I don't see too much
> difference compared to #1, as I understand  we anyway will be need the 
> pipe for communication with main thread.
> so why don't use #1 ?
What communication? No communication at all. Just don't prepost buffers
to SRQ during connection establishment. Problem solved (only for SRQ of
cause).

--
Gleb.


Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Pavel Shamis (Pasha)


1. When CM progress thread completes an incoming connection, it sends  
a command down a pipe to the main thread indicating that a new  
endpoint is ready to use.  The pipe message will be noticed by  
opal_progress() in the main thread and will run a function to do all  
necessary housekeeping (sets the endpoint state to CONNECTED, etc.).   
But it is possible that the receiver process won't dip into the MPI  
layer for a long time (and therefore not call opal_progress and the  
housekeeping function).  Therefore, it is possible that with an active  
sender and a slow receiver, the sender can overwhelm an SRQ.  On IB,  
this will just generate RNRs and be ok (we configure SRQs to have  
infinite RNRs), but I don't understand the semantics of what will  
happen on iWARP (it may terminate?  I sent an off-list question to  
Steve Wise to ask for detail -- we may have other issues with SRQ on  
iWARP if this is the case, but let's skip that discussion for now).




Is it possible to have sane SRQ implementation without HW flow control?
Anyway the described problem exists with SRQ right now too. If receiver
doesn't enter progress for a long time sender can overwhelm an SRQ.
I don't see how this can be fixed without progress thread (and I am not
even sure that this is the problem that has to be fixed).
  
It may be resolved particularly by srq_limit_event (this event is 
generated when number posted receive buffer come down under predefined 
watermark )
But I'm not sure that we want to move the RNR problem from sender side 
to receiver.


The full solution will be progress thread + srq_limit_event.

  
Even if we can get the iWARP semantics to work, this feels kinda  
icky.  Perhaps I'm overreacting and this isn't a problem that needs to  
be fixed -- after all, this situation is no different than what  
happens after the initial connection, but it still feels icky.


What is so icky about it? Sender is faster than a receiver so flow control
kicks in.

  
2. The CM progress thread posts its own receive buffers when creating  
a QP (which is a necessary step in both CMs).  However, this is  
problematic in two cases:




[skip]
 
I don't like 1,2 and 3. :(
  

If Iwarp may handle RNR , #1 - sounds ok for me, at least for 1.3.
  

4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs.  We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives.  The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.



This is slightly better, but still...

  

5. ...?


What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). 
It still doesn't guaranty that we will not see RNR (as I understand we 
trying to resolve this problem  for iwarp?!)


So this solution will cost 1 buffer on each srq ... sounds acceptable 
for me. But I don't see too much
difference compared to #1, as I understand  we anyway will be need the 
pipe for communication with main thread.

so why don't use #1 ?

With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to
prepost one buffer on the active side and two buffers on the passive side.

--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  




Re: [OMPI devel] Threaded progress for CPCs

2008-05-19 Thread Gleb Natapov
On Sun, May 18, 2008 at 11:38:36AM -0400, Jeff Squyres wrote:
> ==> Remember that the goal for this work was to have a separate  
> progress thread *without* all the heavyweight OMPI thread locks.   
> Specifically: make it work in a build without --enable-progress- 
> threads or --enable-mpi-threads (we did some preliminary testing with  
> that stuff enabled and it had a big performance impact).
> 
> 1. When CM progress thread completes an incoming connection, it sends  
> a command down a pipe to the main thread indicating that a new  
> endpoint is ready to use.  The pipe message will be noticed by  
> opal_progress() in the main thread and will run a function to do all  
> necessary housekeeping (sets the endpoint state to CONNECTED, etc.).   
> But it is possible that the receiver process won't dip into the MPI  
> layer for a long time (and therefore not call opal_progress and the  
> housekeeping function).  Therefore, it is possible that with an active  
> sender and a slow receiver, the sender can overwhelm an SRQ.  On IB,  
> this will just generate RNRs and be ok (we configure SRQs to have  
> infinite RNRs), but I don't understand the semantics of what will  
> happen on iWARP (it may terminate?  I sent an off-list question to  
> Steve Wise to ask for detail -- we may have other issues with SRQ on  
> iWARP if this is the case, but let's skip that discussion for now).
> 
Is it possible to have sane SRQ implementation without HW flow control?
Anyway the described problem exists with SRQ right now too. If receiver
doesn't enter progress for a long time sender can overwhelm an SRQ.
I don't see how this can be fixed without progress thread (and I am not
even sure that this is the problem that has to be fixed).

> Even if we can get the iWARP semantics to work, this feels kinda  
> icky.  Perhaps I'm overreacting and this isn't a problem that needs to  
> be fixed -- after all, this situation is no different than what  
> happens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow control
kicks in.

> 
> 2. The CM progress thread posts its own receive buffers when creating  
> a QP (which is a necessary step in both CMs).  However, this is  
> problematic in two cases:
> 
[skip]

I don't like 1,2 and 3. :(

> 4. Have a separate mpool for drawing initial receive buffers for the
> CM-posted RQs.  We'd probably want this mpool to be always empty (or
> close to empty) -- it's ok to be slow to allocate / register more
> memory when a new connection request arrives.  The memory obtained
> from this mpool should be able to be returned to the "main" mpool
> after it is consumed.

This is slightly better, but still...

> 5. ...?
What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to
prepost one buffer on the active side and two buffers on the passive side.

--
Gleb.