Re: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-14 Thread William Jordan
On 6/14/05, James Lentini <[EMAIL PROTECTED]> wrote:
> 
> Sounds like I need to understand the difference between the
> ib_cm_req_param's retry_count and max_cm_retries fields. We set the
> former to 0 and the later to 4.

The retry_count is the number of retries you want to configure for
data on your connection once your connection is established. The
max_cm_retries field is how many times you want the cm to retry
establishing the connection.

-- 
Bill Jordan
SilverStorm Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-14 Thread James Lentini



On Tue, 14 Jun 2005, Hal Rosenstock wrote:


On Mon, 2005-06-13 at 18:33, James Lentini wrote:

On Mon, 13 Jun 2005, Hal Rosenstock wrote:

halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote:
halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote:
halr> >
halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
halr> > halr> > We interpreted the above to mean "give the connection protocol 
as
halr> > halr> > much time as it needs to establish a connection, but don't mask
halr> > halr> > errors (no path to the remove node, etc.)". For that reason we 
changed
halr> > halr> > the variable name to DAT_TIMEOUT_MAX.
halr> > halr>
halr> > halr> But if the REQ is lost, the timeout is really really long (longer 
than
halr> > halr> most will wait for an error).
halr> >
halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a
halr> > smaller amount of time to dat_ep_connect. Does this satisfy your
halr> > requirements?
halr>
halr> Is it the intended that the only way out is via user intervention (e.g.
halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then
halr> there is no chance of it completing and the user needs to intervene.

Why does the user need to intervene? Did I misunderstanding the CM
API?

When dapl_ep_connect() is called with a timeout value of
DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the
ib_cm_req_param structure's remote_cm_response_timeout value. My
understanding was that this is the maximum timeout and that once it
expires the CM will inform the user that the REQ timed out.


Yes but it is a long time (4.096 * 2 ^ 31 usec ~ 8796 sec ~ 146.60 min
(if my calcs are correct)). This is longer than (most) users would wait.
They would usually hit ctl-C before this timeout is reached.


Understood. As long as it is not infinite we've made a step in the 
right direction. I like your ideas below on how to improve this 
further.



halr> If that is the intended behavior, we are there. (This (lost REQ)
halr> can even occur when the timeout is non infinite too).

We didn't intend for the active side to wait forever if a REQ was
lost.


The active side has no way of knowing that the REQ was lost (other than
timeout/retry) and when the timeout is long, this is effectively the
case.


This behavior is ok. The DAT consumer should choose timeout value that 
makes sense, it doesn't need to use DAT_TIMEOUT_MAX (and probably 
shouldn't in most cases). We should update our dapltest program to use 
a smaller value (like 1 min).



halr> An alternative (as Sean suggested) is to continually retry (at a
halr> periodicity below the supplied timeout) until the time period specified
halr> expires. That seems to be better (at least to me and Sean) in terms of
halr> handling the lost REQ case. As retries is not part of the API for
halr> connect, I would presume the implementor is free to what they want under
halr> the covers of dapl_ib_connect.

You're correct.


The current implementation is:
1. address resolution phase for some amount of time
followed by:
2. dapl_ib_connect timeout * 5 (since there are 4 retries)


Sounds like I need to understand the difference between the 
ib_cm_req_param's retry_count and max_cm_retries fields. We set the 
former to 0 and the later to 4.



A better algorithm would be to divide down the timeout by some number of
retries (which would vary based on the timeout requested) and have the
number of retries vary based on the total timeout requested.


I agree that would be better. As you point out, we should also account 
for the address resolution time. I know that no one is working on 
this. Are you interested?




-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-14 Thread Hal Rosenstock
On Tue, 2005-06-14 at 10:00, Talpey, Thomas wrote:
> At 09:49 AM 6/14/2005, Hal Rosenstock wrote:
> >Are you proposing that the number of retries be set to 0 then
> >(regardless of the timeout requested) ?
> 
> All I am suggesting is that the number of retries is not something the
> consumer can or should be specifying. Whatever the appropriate
> number is, is something for the transport to choose. It's an
> internal detail.

Yes, I was proposing that that this be calculated internally based on 
the requested timeout. Sorry if that was not clear.

> >The CM is not using exponential backoff.
> 
> Okay, though I would suggest it should. In any case, iWARP (TCP)
> does, and that's important to bear in mind.

It could easily be made to do this. What do others think about this ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-14 Thread Talpey, Thomas
At 09:49 AM 6/14/2005, Hal Rosenstock wrote:
>Are you proposing that the number of retries be set to 0 then
>(regardless of the timeout requested) ?

All I am suggesting is that the number of retries is not something the
consumer can or should be specifying. Whatever the appropriate
number is, is something for the transport to choose. It's an
internal detail.

>The CM is not using exponential backoff.

Okay, though I would suggest it should. In any case, iWARP (TCP)
does, and that's important to bear in mind.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-14 Thread Hal Rosenstock
On Tue, 2005-06-14 at 09:36, Talpey, Thomas wrote:
> At 08:41 AM 6/14/2005, Hal Rosenstock wrote:
> >The current implementation is:
> >1. address resolution phase for some amount of time 
> >followed by:
> >2. dapl_ib_connect timeout * 5 (since there are 4 retries)
> >
> >A better algorithm would be to divide down the timeout by some number of
> >retries (which would vary based on the timeout requested) and have the
> >number of retries vary based on the total timeout requested.
> 
> Why is address resolution exempt from the timeout? If the caller
> wants a timeout, it should be independent of low-level link resolution.
> Socket connect()s don't care about ARP, for example.

I was just stating the way the algorithm is right now. The address
resolution phase can be included in the calculation but this complicates
things a little. 

I thought it was previously said that the timeout can be approximate.
Also, the CM timeouts are approximate and not precise either.

> I don't like the idea of retry counts because there is no deterministic
> length of time that they will take. 

Are you proposing that the number of retries be set to 0 then
(regardless of the timeout requested) ?

> Exponential backoff could drive
> even a few retries to many minutes. Of course, if an IB provider
> can guarantee that N retries will be performed in M seconds, then
> okay, but not in general.

The CM is not using exponential backoff.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-14 Thread Talpey, Thomas
At 08:41 AM 6/14/2005, Hal Rosenstock wrote:
>The current implementation is:
>1. address resolution phase for some amount of time 
>followed by:
>2. dapl_ib_connect timeout * 5 (since there are 4 retries)
>
>A better algorithm would be to divide down the timeout by some number of
>retries (which would vary based on the timeout requested) and have the
>number of retries vary based on the total timeout requested.

Why is address resolution exempt from the timeout? If the caller
wants a timeout, it should be independent of low-level link resolution.
Socket connect()s don't care about ARP, for example.

I don't like the idea of retry counts because there is no deterministic
length of time that they will take. Exponential backoff could drive
even a few retries to many minutes. Of course, if an IB provider
can guarantee that N retries will be performed in M seconds, then
okay, but not in general.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-14 Thread Hal Rosenstock
On Mon, 2005-06-13 at 18:33, James Lentini wrote:
> On Mon, 13 Jun 2005, Hal Rosenstock wrote:
> 
> halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote: 
> halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote:
> halr> > 
> halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
> halr> > halr> > We interpreted the above to mean "give the connection 
> protocol as 
> halr> > halr> > much time as it needs to establish a connection, but don't 
> mask 
> halr> > halr> > errors (no path to the remove node, etc.)". For that reason 
> we changed 
> halr> > halr> > the variable name to DAT_TIMEOUT_MAX.
> halr> > halr> 
> halr> > halr> But if the REQ is lost, the timeout is really really long 
> (longer than
> halr> > halr> most will wait for an error). 
> halr> > 
> halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a 
> halr> > smaller amount of time to dat_ep_connect. Does this satisfy your 
> halr> > requirements?
> halr> 
> halr> Is it the intended that the only way out is via user intervention (e.g.
> halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then
> halr> there is no chance of it completing and the user needs to intervene. 
> 
> Why does the user need to intervene? Did I misunderstanding the CM 
> API? 
> 
> When dapl_ep_connect() is called with a timeout value of 
> DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the 
> ib_cm_req_param structure's remote_cm_response_timeout value. My 
> understanding was that this is the maximum timeout and that once it 
> expires the CM will inform the user that the REQ timed out.

Yes but it is a long time (4.096 * 2 ^ 31 usec ~ 8796 sec ~ 146.60 min
(if my calcs are correct)). This is longer than (most) users would wait.
They would usually hit ctl-C before this timeout is reached.

> halr> If that is the intended behavior, we are there. (This (lost REQ) 
> halr> can even occur when the timeout is non infinite too).
> 
> We didn't intend for the active side to wait forever if a REQ was 
> lost.

The active side has no way of knowing that the REQ was lost (other than
timeout/retry) and when the timeout is long, this is effectively the
case.

> halr> An alternative (as Sean suggested) is to continually retry (at a
> halr> periodicity below the supplied timeout) until the time period specified
> halr> expires. That seems to be better (at least to me and Sean) in terms of
> halr> handling the lost REQ case. As retries is not part of the API for
> halr> connect, I would presume the implementor is free to what they want under
> halr> the covers of dapl_ib_connect.
> 
> You're correct.

The current implementation is:
1. address resolution phase for some amount of time 
followed by:
2. dapl_ib_connect timeout * 5 (since there are 4 retries)

A better algorithm would be to divide down the timeout by some number of
retries (which would vary based on the timeout requested) and have the
number of retries vary based on the total timeout requested.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-13 Thread James Lentini


On Mon, 13 Jun 2005, Hal Rosenstock wrote:

halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote: 
halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote:
halr> > 
halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
halr> > halr> > We interpreted the above to mean "give the connection protocol 
as 
halr> > halr> > much time as it needs to establish a connection, but don't mask 
halr> > halr> > errors (no path to the remove node, etc.)". For that reason we 
changed 
halr> > halr> > the variable name to DAT_TIMEOUT_MAX.
halr> > halr> 
halr> > halr> But if the REQ is lost, the timeout is really really long (longer 
than
halr> > halr> most will wait for an error). 
halr> > 
halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a 
halr> > smaller amount of time to dat_ep_connect. Does this satisfy your 
halr> > requirements?
halr> 
halr> Is it the intended that the only way out is via user intervention (e.g.
halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then
halr> there is no chance of it completing and the user needs to intervene. 

Why does the user need to intervene? Did I misunderstanding the CM 
API? 

When dapl_ep_connect() is called with a timeout value of 
DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the 
ib_cm_req_param structure's remote_cm_response_timeout value. My 
understanding was that this is the maximum timeout and that once it 
expires the CM will inform the user that the REQ timed out.

halr> If that is the intended behavior, we are there. (This (lost REQ) 
halr> can even occur when the timeout is non infinite too).

We didn't intend for the active side to wait forever if a REQ was 
lost.

halr> 
halr> An alternative (as Sean suggested) is to continually retry (at a
halr> periodicity below the supplied timeout) until the time period specified
halr> expires. That seems to be better (at least to me and Sean) in terms of
halr> handling the lost REQ case. As retries is not part of the API for
halr> connect, I would presume the implementor is free to what they want under
halr> the covers of dapl_ib_connect.

You're correct.

halr> 
halr> -- Hal
halr> 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-13 Thread Hal Rosenstock
On Wed, 2005-06-08 at 17:53, James Lentini wrote: 
> On Wed, 8 Jun 2005, Hal Rosenstock wrote:
> 
> halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
> halr> > We interpreted the above to mean "give the connection protocol as 
> halr> > much time as it needs to establish a connection, but don't mask 
> halr> > errors (no path to the remove node, etc.)". For that reason we 
> changed 
> halr> > the variable name to DAT_TIMEOUT_MAX.
> halr> 
> halr> But if the REQ is lost, the timeout is really really long (longer than
> halr> most will wait for an error). 
> 
> If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a 
> smaller amount of time to dat_ep_connect. Does this satisfy your 
> requirements?

Is it the intended that the only way out is via user intervention (e.g.
ctl-C) ? If one connection attempt (REQ) is made and it is lost, then
there is no chance of it completing and the user needs to intervene. If
that is the intended behavior, we are there. (This (lost REQ) can even
occur when the timeout is non infinite too).

An alternative (as Sean suggested) is to continually retry (at a
periodicity below the supplied timeout) until the time period specified
expires. That seems to be better (at least to me and Sean) in terms of
handling the lost REQ case. As retries is not part of the API for
connect, I would presume the implementor is free to what they want under
the covers of dapl_ib_connect.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-08 Thread James Lentini


On Wed, 8 Jun 2005, Sean Hefty wrote:


Hal Rosenstock wrote:

On Wed, 2005-06-08 at 11:44, James Lentini wrote:

We interpreted the above to mean "give the connection protocol as much 
time as it needs to establish a connection, but don't mask errors (no path 
to the remove node, etc.)". For that reason we changed the variable name 
to DAT_TIMEOUT_MAX.



But if the REQ is lost, the timeout is really really long (longer than
most will wait for an error). Transaction test also appears to be using
this as well as the quit test.


My interpretation was that this is a DAPL level timeout and did not 
necessarily relate to a timeout for a single CM REQ.  That is, there could 
still be a different timeout specified to the CM, but the number of retries 
could be infinite.


If there are kernel users in need of an truly inifinite timeout, we 
could do that.


Note that I'm not saying that an infinite timeout makes sense, but the use of 
TIMEOUT_MAX seems reasonable.  To me that indicates that DAPL decides how 
long is needed to establish a timeout, and it manages all retries.


- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-08 Thread James Lentini


On Wed, 8 Jun 2005, Hal Rosenstock wrote:

halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote:
halr> > We interpreted the above to mean "give the connection protocol as 
halr> > much time as it needs to establish a connection, but don't mask 
halr> > errors (no path to the remove node, etc.)". For that reason we changed 
halr> > the variable name to DAT_TIMEOUT_MAX.
halr> 
halr> But if the REQ is lost, the timeout is really really long (longer than
halr> most will wait for an error). 

If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a 
smaller amount of time to dat_ep_connect. Does this satisfy your 
requirements?

halr> Transaction test also appears to be using this as well as the 
halr> quit test.
halr> 
halr> -- Hal
halr> 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-08 Thread James Lentini

That would be another way to interpret this: that each time the CM 
times out, DAPL should re-attempt to establish a connection. However, 
the original implementation didn't do this and the feedback I've 
received is that such a feature would not be appropriate for the 
kernel. 

If there are kernel applications that want this, we can add it.

On Wed, 8 Jun 2005, Tom Duffy wrote:

tduffy> On Wed, 2005-06-08 at 11:44 -0400, James Lentini wrote:
tduffy> > We interpreted the above to mean "give the connection protocol as 
tduffy> > much time as it needs to establish a connection, but don't mask 
tduffy> > errors (no path to the remove node, etc.)". For that reason we 
changed 
tduffy> > the variable name to DAT_TIMEOUT_MAX.
tduffy> 
tduffy> Well, let's say the end node is not there yet.  Should the CM keep
tduffy> trying indefinitely waiting for somebody to show up and respond?
tduffy> 
tduffy> -tduffy
tduffy> 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-08 Thread Sean Hefty

Hal Rosenstock wrote:

On Wed, 2005-06-08 at 11:44, James Lentini wrote:

We interpreted the above to mean "give the connection protocol as 
much time as it needs to establish a connection, but don't mask 
errors (no path to the remove node, etc.)". For that reason we changed 
the variable name to DAT_TIMEOUT_MAX.



But if the REQ is lost, the timeout is really really long (longer than
most will wait for an error). Transaction test also appears to be using
this as well as the quit test.


My interpretation was that this is a DAPL level timeout and did not 
necessarily relate to a timeout for a single CM REQ.  That is, there could 
still be a different timeout specified to the CM, but the number of retries 
could be infinite.


Note that I'm not saying that an infinite timeout makes sense, but the use 
of TIMEOUT_MAX seems reasonable.  To me that indicates that DAPL decides how 
long is needed to establish a timeout, and it manages all retries.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-08 Thread Hal Rosenstock
On Wed, 2005-06-08 at 11:44, James Lentini wrote:
> We interpreted the above to mean "give the connection protocol as 
> much time as it needs to establish a connection, but don't mask 
> errors (no path to the remove node, etc.)". For that reason we changed 
> the variable name to DAT_TIMEOUT_MAX.

But if the REQ is lost, the timeout is really really long (longer than
most will wait for an error). Transaction test also appears to be using
this as well as the quit test.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-08 Thread Tom Duffy
On Wed, 2005-06-08 at 11:44 -0400, James Lentini wrote:
> We interpreted the above to mean "give the connection protocol as 
> much time as it needs to establish a connection, but don't mask 
> errors (no path to the remove node, etc.)". For that reason we changed 
> the variable name to DAT_TIMEOUT_MAX.

Well, let's say the end node is not there yet.  Should the CM keep
trying indefinitely waiting for somebody to show up and respond?

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-08 Thread James Lentini


On Tue, 7 Jun 2005, Hal Rosenstock wrote:


On Tue, 2005-05-31 at 14:17, James Lentini wrote:

Here's the specification's exact description:

  timeout: Duration of time, in microseconds, that a consumer waits for
   connection establishment. The value of DAT_TIMEOUT_INFINITE
   represents no timeout, indefinite wait. Values must be
   positive.


What is the purpose of an infinite timeout (other than the obvious) ?
The quit test uses this feature. Not sure if other tests do as well.
What happens if the REQ is lost ? Why would someone want an infinite
timeout ?


We interpreted the above to mean "give the connection protocol as 
much time as it needs to establish a connection, but don't mask 
errors (no path to the remove node, etc.)". For that reason we changed 
the variable name to DAT_TIMEOUT_MAX.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-07 Thread Hal Rosenstock
On Tue, 2005-05-31 at 14:17, James Lentini wrote:
> Here's the specification's exact description:
> 
>   timeout: Duration of time, in microseconds, that a consumer waits for
>connection establishment. The value of DAT_TIMEOUT_INFINITE
>represents no timeout, indefinite wait. Values must be
>positive.

What is the purpose of an infinite timeout (other than the obvious) ?
The quit test uses this feature. Not sure if other tests do as well.
What happens if the REQ is lost ? Why would someone want an infinite
timeout ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-02 Thread Tom Duffy
On Thu, 2005-06-02 at 10:17 -0400, James Lentini wrote:
> 
> On Tue, 31 May 2005, Tom Duffy wrote:
> 
> > On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote:
> >> Here's the specification's exact description:
> >>
> >>   timeout: Duration of time, in microseconds, that a consumer waits for
> >>connection establishment. The value of DAT_TIMEOUT_INFINITE
> >>represents no timeout, indefinite wait. Values must be
> >>positive.
> >
> > Let me make sure I got this right: timeout is in µs (10^-6 seconds), not
> > ms (10^-3 seconds).  If so, I am off by 3 orders of magnitude in my
> > calculation.  Right?
> 
> Correct, the value was intended to be in microseconds.

OK, so this should be the conversion function:

/*
 * approximately transforms microseconds to 4.096us*2^x
 * 63(+8) is max return
 */
static inline u8 dapl_convert_us_to_kookyib(unsigned long us) {
unsigned long ms = us/1000UL, converged = 2;
u8 i;

if (2 > ms)
return 8;

for (i = 1; i < 63; i++) {
if (converged >= ms)
break;
converged = 2*converged;
}

return i+8;
}



signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-02 Thread James Lentini



On Wed, 1 Jun 2005, Tom Duffy wrote:


On Tue, 2005-05-31 at 14:52 -0700, Tom Duffy wrote:

On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote:

Here's the specification's exact description:

  timeout: Duration of time, in microseconds, that a consumer waits for
   connection establishment. The value of DAT_TIMEOUT_INFINITE
   represents no timeout, indefinite wait. Values must be
   positive.


Let me make sure I got this right: timeout is in µs (10^-6 seconds), not
ms (10^-3 seconds).  If so, I am off by 3 orders of magnitude in my
calculation.  Right?


This is from DT_fft_connect() in test/dapltest/test/dapl_fft_util.c:

   /* attempt to connect, timeout = 10 secs */
   rc = dat_ep_connect (conn->ep_handle, conn->remote_netaddr,
   SERVER_PORT_NUMBER, 10*1000, 0, (void *)0,
   DAT_QOS_BEST_EFFORT, DAT_CONNECT_DEFAULT_FLAG);
   DT_assert_dat (phead, rc == DAT_SUCCESS);

leading me to believe we are talking about milliseconds.  Or is this a
bug?


That is a bug.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-02 Thread James Lentini



On Tue, 31 May 2005, Tom Duffy wrote:


On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote:

Here's the specification's exact description:

  timeout: Duration of time, in microseconds, that a consumer waits for
   connection establishment. The value of DAT_TIMEOUT_INFINITE
   represents no timeout, indefinite wait. Values must be
   positive.


Let me make sure I got this right: timeout is in µs (10^-6 seconds), not
ms (10^-3 seconds).  If so, I am off by 3 orders of magnitude in my
calculation.  Right?


Correct, the value was intended to be in microseconds.


My perspective is that we are not implementing this API for a real
time operating system and therefore should take a fuzzy view of time.


Trust me, it is going to fuzzy what with the mechanism IB uses to encode
timeouts.

BTW, what do you think would be a good test case to make sure the new
code is working as intended?


dapltest could be updated to allow the timeout value to be specified 
on the command line.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-06-01 Thread Tom Duffy
On Tue, 2005-05-31 at 14:52 -0700, Tom Duffy wrote:
> On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote:
> > Here's the specification's exact description:
> > 
> >   timeout: Duration of time, in microseconds, that a consumer waits for
> >connection establishment. The value of DAT_TIMEOUT_INFINITE
> >represents no timeout, indefinite wait. Values must be
> >positive.
> 
> Let me make sure I got this right: timeout is in µs (10^-6 seconds), not
> ms (10^-3 seconds).  If so, I am off by 3 orders of magnitude in my
> calculation.  Right?

This is from DT_fft_connect() in test/dapltest/test/dapl_fft_util.c:

/* attempt to connect, timeout = 10 secs */
rc = dat_ep_connect (conn->ep_handle, conn->remote_netaddr,
SERVER_PORT_NUMBER, 10*1000, 0, (void *)0,
DAT_QOS_BEST_EFFORT, DAT_CONNECT_DEFAULT_FLAG);
DT_assert_dat (phead, rc == DAT_SUCCESS);

leading me to believe we are talking about milliseconds.  Or is this a
bug?

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread Tom Duffy
On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote:
> Here's the specification's exact description:
> 
>   timeout: Duration of time, in microseconds, that a consumer waits for
>connection establishment. The value of DAT_TIMEOUT_INFINITE
>represents no timeout, indefinite wait. Values must be
>positive.

Let me make sure I got this right: timeout is in µs (10^-6 seconds), not
ms (10^-3 seconds).  If so, I am off by 3 orders of magnitude in my
calculation.  Right?

> My perspective is that we are not implementing this API for a real 
> time operating system and therefore should take a fuzzy view of time.

Trust me, it is going to fuzzy what with the mechanism IB uses to encode
timeouts.

BTW, what do you think would be a good test case to make sure the new
code is working as intended?

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread James Lentini



On Tue, 31 May 2005, Hal Rosenstock wrote:


On Tue, 2005-05-31 at 15:57, James Lentini wrote:

If we included address resolution, how would we divide up the time
between address resolution and cm protocol? Wouldn't we have to
track how long address resolution took to complete?


Yes, to follow the requirement closely, one would need to time the
duration of the address translation but that is pretty straightforward
to do. IBAT already has to time out requests anyway. The worst case for
address resolution is currently 4 * 100 msec.


If we can account for all of the time properly, then we should 
implement it that way.


Other alternatives are to subtract the maximal address translation 
time off the time supplied and use the rest for CM, or as you said 
ignore this time and use it all for CM purposes (and just go over by 
whatever amount this is). Did other implementations factor this in 
or did they ignore this ?

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread Hal Rosenstock
On Tue, 2005-05-31 at 15:57, James Lentini wrote:
> If we included address resolution, how would we divide up the time 
> between address resolution and cm protocol? Wouldn't we have to 
> track how long address resolution took to complete?

Yes, to follow the requirement closely, one would need to time the
duration of the address translation but that is pretty straightforward
to do. IBAT already has to time out requests anyway. The worst case for
address resolution is currently 4 * 100 msec. Other alternatives are to
subtract the maximal address translation time off the time supplied and
use the rest for CM, or as you said ignore this time and use it all for
CM purposes (and just go over by whatever amount this is). Did other
implementations factor this in or did they ignore this ?

-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread James Lentini



On Tue, 31 May 2005, Hal Rosenstock wrote:


On Tue, 2005-05-31 at 14:17, James Lentini wrote:

Here's the specification's exact description:

  timeout: Duration of time, in microseconds, that a consumer waits for
   connection establishment. The value of DAT_TIMEOUT_INFINITE
   represents no timeout, indefinite wait. Values must be
   positive.

My perspective is that we are not implementing this API for a real 
time operating system and therefore should take a fuzzy view of 
time.


Fuzzy in that we are certainly not concerned with the granularity of 
microseconds.


My interpretation of the definition above is that a provider should 
attempt to establish a connection for a least [timeout] time.



So any number of retries is allowed up to the time period specified 
(depending on the timeout used) ?


Correct, any number of retries (including 0) is allowed. Once the time 
period expires, the provider should post a result as quickly as 
possible.


 If a connection is not established after attempting for at least 
[timeout] time, the provider should should give up and post a 
connection failure event. If there is some reasonable additional 
time needed for address resolution, etc., I think that is 
acceptable.


This all can be bundled in. One just needs to know what the 
requirement is.


If we included address resolution, how would we divide up the time 
between address resolution and cm protocol? Wouldn't we have to 
track how long address resolution took to complete?



-- Hal


james

On Tue, 31 May 2005, Hal Rosenstock wrote:


On Tue, 2005-05-31 at 13:27, James Lentini wrote:

James, what is the timeout value passed into dapl_ep_connect mean, the
total timeout time?  Or how much for each retry?


It is the total timeout value.


Total meaning all everything inclusive ? If that is what it is supposed
to be, that is not what is implemented now:

DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */
DAPL_IB_MAX_CM_RETRIES 4

There are also the timeout/retries of IBAT as well.
DAPL_IB_MAX_AT_RETRY 3
IB_AT_REQ_RETRY_MS  100

-- Hal




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread Hal Rosenstock
On Tue, 2005-05-31 at 14:17, James Lentini wrote:
> Here's the specification's exact description:
> 
>   timeout: Duration of time, in microseconds, that a consumer waits for
>connection establishment. The value of DAT_TIMEOUT_INFINITE
>represents no timeout, indefinite wait. Values must be
>positive.
> 
> My perspective is that we are not implementing this API for a real 
> time operating system and therefore should take a fuzzy view of time.

Fuzzy in that we are certainly not concerned with the granularity of
microseconds.

> My interpretation of the definition above is that a provider should 
> attempt to establish a connection for a least [timeout] time.


So any number of retries is allowed up to the time period specified
(depending on the timeout used) ?

>  If a 
> connection is not established after attempting for at least [timeout] 
> time, the provider should should give up and post a connection failure 
> event. If there is some reasonable additional time needed for address 
> resolution, etc., I think that is acceptable.

This all can be bundled in. One just needs to know what the requirement
is.

-- Hal

> james
> 
> On Tue, 31 May 2005, Hal Rosenstock wrote:
> 
> > On Tue, 2005-05-31 at 13:27, James Lentini wrote:
> >>> James, what is the timeout value passed into dapl_ep_connect mean, the
> >>> total timeout time?  Or how much for each retry?
> >>
> >> It is the total timeout value.
> >
> > Total meaning all everything inclusive ? If that is what it is supposed
> > to be, that is not what is implemented now:
> >
> > DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */
> > DAPL_IB_MAX_CM_RETRIES 4
> >
> > There are also the timeout/retries of IBAT as well.
> > DAPL_IB_MAX_AT_RETRY 3
> > IB_AT_REQ_RETRY_MS  100
> >
> > -- Hal
> >

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread James Lentini


Here's the specification's exact description:

 timeout: Duration of time, in microseconds, that a consumer waits for
  connection establishment. The value of DAT_TIMEOUT_INFINITE
  represents no timeout, indefinite wait. Values must be
  positive.

My perspective is that we are not implementing this API for a real 
time operating system and therefore should take a fuzzy view of time.


My interpretation of the definition above is that a provider should 
attempt to establish a connection for a least [timeout] time. If a 
connection is not established after attempting for at least [timeout] 
time, the provider should should give up and post a connection failure 
event. If there is some reasonable additional time needed for address 
resolution, etc., I think that is acceptable.


james

On Tue, 31 May 2005, Hal Rosenstock wrote:


On Tue, 2005-05-31 at 13:27, James Lentini wrote:

James, what is the timeout value passed into dapl_ep_connect mean, the
total timeout time?  Or how much for each retry?


It is the total timeout value.


Total meaning all everything inclusive ? If that is what it is supposed
to be, that is not what is implemented now:

DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */
DAPL_IB_MAX_CM_RETRIES 4

There are also the timeout/retries of IBAT as well.
DAPL_IB_MAX_AT_RETRY 3
IB_AT_REQ_RETRY_MS  100

-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread Hal Rosenstock
On Tue, 2005-05-31 at 13:27, James Lentini wrote:
> > James, what is the timeout value passed into dapl_ep_connect mean, the
> > total timeout time?  Or how much for each retry?
> 
> It is the total timeout value.

Total meaning all everything inclusive ? If that is what it is supposed
to be, that is not what is implemented now:

DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */
DAPL_IB_MAX_CM_RETRIES 4

There are also the timeout/retries of IBAT as well.
DAPL_IB_MAX_AT_RETRY 3
IB_AT_REQ_RETRY_MS  100

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-31 Thread James Lentini


On Fri, 27 May 2005, Tom Duffy wrote:


On Thu, 2005-05-26 at 22:25 -0700, Sean Hefty wrote:

So, here is the strategy I am taking.  Please let me know if it is
wrong.

When dapl_ep_connect() is called, I save off the timeout value into the
dapl_ep struct.  Then, when we get ready to call ib_send_cm_req(), I
stuff the timeout value (after munging it into IB's strange format) into
the conn params remote_cm_response_timeout.


From a CM perspective, this sounds fine.  Note that the CM timeout will not
occur until the number of retries has been met.  So I don't know if the
timeout passed to dapl_ep_connect() should convert directly into the
remote_cm_response_timeout, or needs to be divided by the number of retries.


So, are you saying that if you have a timeout of 4 seconds (you pass in
20) and you have retries set to 2, that it will fail after 8 seconds?

James, what is the timeout value passed into dapl_ep_connect mean, the
total timeout time?  Or how much for each retry?


It is the total timeout value.


Also, did you notice that dapl_ib_connect always sets the timeout to 20
(4 seconds) no matter what?  Should this be the case?


The timeout should not be constant as it is now. It was being 
unnecessarily emulated with the extra "timeout" thread.



If the connection fails to complete within the timeout,
dapl_cm_active_cb_handler() is called with IB_CM_REQ_ERROR which in turn
calls dapl_evd_connection_callback() which does the same thing that
dapl_ep_timeout() used to do -- tear down the connection.


I haven't looked at your changes, but note that calling ib_destroy_cm_id
from within the CM callback thread will hang.  The callback holds a
reference on the cm_id.  The good news is that there should be code in kDAPL
to catch this.


I will take a look and see if this could happen.


Tom, I don't believe that you've changed Hal and Sean's implementation 
of this.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-27 Thread Tom Duffy
On Thu, 2005-05-26 at 22:25 -0700, Sean Hefty wrote:
> >So, here is the strategy I am taking.  Please let me know if it is
> >wrong.
> >
> >When dapl_ep_connect() is called, I save off the timeout value into the
> >dapl_ep struct.  Then, when we get ready to call ib_send_cm_req(), I
> >stuff the timeout value (after munging it into IB's strange format) into
> >the conn params remote_cm_response_timeout.
> 
> From a CM perspective, this sounds fine.  Note that the CM timeout will not
> occur until the number of retries has been met.  So I don't know if the
> timeout passed to dapl_ep_connect() should convert directly into the
> remote_cm_response_timeout, or needs to be divided by the number of retries.

So, are you saying that if you have a timeout of 4 seconds (you pass in
20) and you have retries set to 2, that it will fail after 8 seconds?

James, what is the timeout value passed into dapl_ep_connect mean, the
total timeout time?  Or how much for each retry?

Also, did you notice that dapl_ib_connect always sets the timeout to 20
(4 seconds) no matter what?  Should this be the case?

> >If the connection fails to complete within the timeout,
> >dapl_cm_active_cb_handler() is called with IB_CM_REQ_ERROR which in turn
> >calls dapl_evd_connection_callback() which does the same thing that
> >dapl_ep_timeout() used to do -- tear down the connection.
> 
> I haven't looked at your changes, but note that calling ib_destroy_cm_id
> from within the CM callback thread will hang.  The callback holds a
> reference on the cm_id.  The good news is that there should be code in kDAPL
> to catch this.

I will take a look and see if this could happen.

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-27 Thread Tom Duffy
On Fri, 2005-05-27 at 08:44 -0700, Sean Hefty wrote:
> >+/*
> >+ * approximately transforms miliseconds to 4.096us*2^x
> >+ * 63(+8) is max return
> 
> I think that the max return is 64+8.

I guess that is technically true, but converged overflows when i is 64.
So, maybe I should stop one before that.

-tduffy


signature.asc
Description: This is a digitally signed message part
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-27 Thread Sean Hefty
>+/*
>+ * approximately transforms miliseconds to 4.096us*2^x
>+ * 63(+8) is max return

I think that the max return is 64+8.

>+ */
>+static inline u8 dapl_convert_ms_to_kookyib(unsigned long ms) {

I like the function name.  :)

>+  unsigned long converged = 2;
>+  u8 i;
>+
>+  if (2 > ms)
>+  return 8;
>+
>+  for (i = 1; i < 64; i++) {
>+  if (converged >= ms)
>+  break;
>+  converged = 2*converged;
>+  }
>+
>+  return i+8;
>+}

I didn't notice any other issues looking over the changes, but see my other
email regarding setting the timeout based on the number of retries.

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own

2005-05-26 Thread Sean Hefty
>So, here is the strategy I am taking.  Please let me know if it is
>wrong.
>
>When dapl_ep_connect() is called, I save off the timeout value into the
>dapl_ep struct.  Then, when we get ready to call ib_send_cm_req(), I
>stuff the timeout value (after munging it into IB's strange format) into
>the conn params remote_cm_response_timeout.

>From a CM perspective, this sounds fine.  Note that the CM timeout will not
occur until the number of retries has been met.  So I don't know if the
timeout passed to dapl_ep_connect() should convert directly into the
remote_cm_response_timeout, or needs to be divided by the number of retries.

>If the connection fails to complete within the timeout,
>dapl_cm_active_cb_handler() is called with IB_CM_REQ_ERROR which in turn
>calls dapl_evd_connection_callback() which does the same thing that
>dapl_ep_timeout() used to do -- tear down the connection.

I haven't looked at your changes, but note that calling ib_destroy_cm_id
from within the CM callback thread will hang.  The callback holds a
reference on the cm_id.  The good news is that there should be code in kDAPL
to catch this.

>Here is a patch that implements this, *untested*, please take a look.

I'll look over the patch tomorrow and let you know if anything stands out,
but I'm not overly familiar with the kDAPL code...

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general