Re: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On 6/14/05, James Lentini <[EMAIL PROTECTED]> wrote: > > Sounds like I need to understand the difference between the > ib_cm_req_param's retry_count and max_cm_retries fields. We set the > former to 0 and the later to 4. The retry_count is the number of retries you want to configure for data on your connection once your connection is established. The max_cm_retries field is how many times you want the cm to retry establishing the connection. -- Bill Jordan SilverStorm Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 14 Jun 2005, Hal Rosenstock wrote: On Mon, 2005-06-13 at 18:33, James Lentini wrote: On Mon, 13 Jun 2005, Hal Rosenstock wrote: halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote: halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote: halr> > halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote: halr> > halr> > We interpreted the above to mean "give the connection protocol as halr> > halr> > much time as it needs to establish a connection, but don't mask halr> > halr> > errors (no path to the remove node, etc.)". For that reason we changed halr> > halr> > the variable name to DAT_TIMEOUT_MAX. halr> > halr> halr> > halr> But if the REQ is lost, the timeout is really really long (longer than halr> > halr> most will wait for an error). halr> > halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a halr> > smaller amount of time to dat_ep_connect. Does this satisfy your halr> > requirements? halr> halr> Is it the intended that the only way out is via user intervention (e.g. halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then halr> there is no chance of it completing and the user needs to intervene. Why does the user need to intervene? Did I misunderstanding the CM API? When dapl_ep_connect() is called with a timeout value of DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the ib_cm_req_param structure's remote_cm_response_timeout value. My understanding was that this is the maximum timeout and that once it expires the CM will inform the user that the REQ timed out. Yes but it is a long time (4.096 * 2 ^ 31 usec ~ 8796 sec ~ 146.60 min (if my calcs are correct)). This is longer than (most) users would wait. They would usually hit ctl-C before this timeout is reached. Understood. As long as it is not infinite we've made a step in the right direction. I like your ideas below on how to improve this further. halr> If that is the intended behavior, we are there. (This (lost REQ) halr> can even occur when the timeout is non infinite too). We didn't intend for the active side to wait forever if a REQ was lost. The active side has no way of knowing that the REQ was lost (other than timeout/retry) and when the timeout is long, this is effectively the case. This behavior is ok. The DAT consumer should choose timeout value that makes sense, it doesn't need to use DAT_TIMEOUT_MAX (and probably shouldn't in most cases). We should update our dapltest program to use a smaller value (like 1 min). halr> An alternative (as Sean suggested) is to continually retry (at a halr> periodicity below the supplied timeout) until the time period specified halr> expires. That seems to be better (at least to me and Sean) in terms of halr> handling the lost REQ case. As retries is not part of the API for halr> connect, I would presume the implementor is free to what they want under halr> the covers of dapl_ib_connect. You're correct. The current implementation is: 1. address resolution phase for some amount of time followed by: 2. dapl_ib_connect timeout * 5 (since there are 4 retries) Sounds like I need to understand the difference between the ib_cm_req_param's retry_count and max_cm_retries fields. We set the former to 0 and the later to 4. A better algorithm would be to divide down the timeout by some number of retries (which would vary based on the timeout requested) and have the number of retries vary based on the total timeout requested. I agree that would be better. As you point out, we should also account for the address resolution time. I know that no one is working on this. Are you interested? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-06-14 at 10:00, Talpey, Thomas wrote: > At 09:49 AM 6/14/2005, Hal Rosenstock wrote: > >Are you proposing that the number of retries be set to 0 then > >(regardless of the timeout requested) ? > > All I am suggesting is that the number of retries is not something the > consumer can or should be specifying. Whatever the appropriate > number is, is something for the transport to choose. It's an > internal detail. Yes, I was proposing that that this be calculated internally based on the requested timeout. Sorry if that was not clear. > >The CM is not using exponential backoff. > > Okay, though I would suggest it should. In any case, iWARP (TCP) > does, and that's important to bear in mind. It could easily be made to do this. What do others think about this ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
At 09:49 AM 6/14/2005, Hal Rosenstock wrote: >Are you proposing that the number of retries be set to 0 then >(regardless of the timeout requested) ? All I am suggesting is that the number of retries is not something the consumer can or should be specifying. Whatever the appropriate number is, is something for the transport to choose. It's an internal detail. >The CM is not using exponential backoff. Okay, though I would suggest it should. In any case, iWARP (TCP) does, and that's important to bear in mind. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-06-14 at 09:36, Talpey, Thomas wrote: > At 08:41 AM 6/14/2005, Hal Rosenstock wrote: > >The current implementation is: > >1. address resolution phase for some amount of time > >followed by: > >2. dapl_ib_connect timeout * 5 (since there are 4 retries) > > > >A better algorithm would be to divide down the timeout by some number of > >retries (which would vary based on the timeout requested) and have the > >number of retries vary based on the total timeout requested. > > Why is address resolution exempt from the timeout? If the caller > wants a timeout, it should be independent of low-level link resolution. > Socket connect()s don't care about ARP, for example. I was just stating the way the algorithm is right now. The address resolution phase can be included in the calculation but this complicates things a little. I thought it was previously said that the timeout can be approximate. Also, the CM timeouts are approximate and not precise either. > I don't like the idea of retry counts because there is no deterministic > length of time that they will take. Are you proposing that the number of retries be set to 0 then (regardless of the timeout requested) ? > Exponential backoff could drive > even a few retries to many minutes. Of course, if an IB provider > can guarantee that N retries will be performed in M seconds, then > okay, but not in general. The CM is not using exponential backoff. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
At 08:41 AM 6/14/2005, Hal Rosenstock wrote: >The current implementation is: >1. address resolution phase for some amount of time >followed by: >2. dapl_ib_connect timeout * 5 (since there are 4 retries) > >A better algorithm would be to divide down the timeout by some number of >retries (which would vary based on the timeout requested) and have the >number of retries vary based on the total timeout requested. Why is address resolution exempt from the timeout? If the caller wants a timeout, it should be independent of low-level link resolution. Socket connect()s don't care about ARP, for example. I don't like the idea of retry counts because there is no deterministic length of time that they will take. Exponential backoff could drive even a few retries to many minutes. Of course, if an IB provider can guarantee that N retries will be performed in M seconds, then okay, but not in general. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Mon, 2005-06-13 at 18:33, James Lentini wrote: > On Mon, 13 Jun 2005, Hal Rosenstock wrote: > > halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote: > halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote: > halr> > > halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote: > halr> > halr> > We interpreted the above to mean "give the connection > protocol as > halr> > halr> > much time as it needs to establish a connection, but don't > mask > halr> > halr> > errors (no path to the remove node, etc.)". For that reason > we changed > halr> > halr> > the variable name to DAT_TIMEOUT_MAX. > halr> > halr> > halr> > halr> But if the REQ is lost, the timeout is really really long > (longer than > halr> > halr> most will wait for an error). > halr> > > halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a > halr> > smaller amount of time to dat_ep_connect. Does this satisfy your > halr> > requirements? > halr> > halr> Is it the intended that the only way out is via user intervention (e.g. > halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then > halr> there is no chance of it completing and the user needs to intervene. > > Why does the user need to intervene? Did I misunderstanding the CM > API? > > When dapl_ep_connect() is called with a timeout value of > DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the > ib_cm_req_param structure's remote_cm_response_timeout value. My > understanding was that this is the maximum timeout and that once it > expires the CM will inform the user that the REQ timed out. Yes but it is a long time (4.096 * 2 ^ 31 usec ~ 8796 sec ~ 146.60 min (if my calcs are correct)). This is longer than (most) users would wait. They would usually hit ctl-C before this timeout is reached. > halr> If that is the intended behavior, we are there. (This (lost REQ) > halr> can even occur when the timeout is non infinite too). > > We didn't intend for the active side to wait forever if a REQ was > lost. The active side has no way of knowing that the REQ was lost (other than timeout/retry) and when the timeout is long, this is effectively the case. > halr> An alternative (as Sean suggested) is to continually retry (at a > halr> periodicity below the supplied timeout) until the time period specified > halr> expires. That seems to be better (at least to me and Sean) in terms of > halr> handling the lost REQ case. As retries is not part of the API for > halr> connect, I would presume the implementor is free to what they want under > halr> the covers of dapl_ib_connect. > > You're correct. The current implementation is: 1. address resolution phase for some amount of time followed by: 2. dapl_ib_connect timeout * 5 (since there are 4 retries) A better algorithm would be to divide down the timeout by some number of retries (which would vary based on the timeout requested) and have the number of retries vary based on the total timeout requested. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Mon, 13 Jun 2005, Hal Rosenstock wrote: halr> On Wed, 2005-06-08 at 17:53, James Lentini wrote: halr> > On Wed, 8 Jun 2005, Hal Rosenstock wrote: halr> > halr> > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote: halr> > halr> > We interpreted the above to mean "give the connection protocol as halr> > halr> > much time as it needs to establish a connection, but don't mask halr> > halr> > errors (no path to the remove node, etc.)". For that reason we changed halr> > halr> > the variable name to DAT_TIMEOUT_MAX. halr> > halr> halr> > halr> But if the REQ is lost, the timeout is really really long (longer than halr> > halr> most will wait for an error). halr> > halr> > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a halr> > smaller amount of time to dat_ep_connect. Does this satisfy your halr> > requirements? halr> halr> Is it the intended that the only way out is via user intervention (e.g. halr> ctl-C) ? If one connection attempt (REQ) is made and it is lost, then halr> there is no chance of it completing and the user needs to intervene. Why does the user need to intervene? Did I misunderstanding the CM API? When dapl_ep_connect() is called with a timeout value of DAT_TIMEOUT_MAX, DAPL passes ib_sen_cm_req the value 0x1F in the ib_cm_req_param structure's remote_cm_response_timeout value. My understanding was that this is the maximum timeout and that once it expires the CM will inform the user that the REQ timed out. halr> If that is the intended behavior, we are there. (This (lost REQ) halr> can even occur when the timeout is non infinite too). We didn't intend for the active side to wait forever if a REQ was lost. halr> halr> An alternative (as Sean suggested) is to continually retry (at a halr> periodicity below the supplied timeout) until the time period specified halr> expires. That seems to be better (at least to me and Sean) in terms of halr> handling the lost REQ case. As retries is not part of the API for halr> connect, I would presume the implementor is free to what they want under halr> the covers of dapl_ib_connect. You're correct. halr> halr> -- Hal halr> ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Wed, 2005-06-08 at 17:53, James Lentini wrote: > On Wed, 8 Jun 2005, Hal Rosenstock wrote: > > halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote: > halr> > We interpreted the above to mean "give the connection protocol as > halr> > much time as it needs to establish a connection, but don't mask > halr> > errors (no path to the remove node, etc.)". For that reason we > changed > halr> > the variable name to DAT_TIMEOUT_MAX. > halr> > halr> But if the REQ is lost, the timeout is really really long (longer than > halr> most will wait for an error). > > If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a > smaller amount of time to dat_ep_connect. Does this satisfy your > requirements? Is it the intended that the only way out is via user intervention (e.g. ctl-C) ? If one connection attempt (REQ) is made and it is lost, then there is no chance of it completing and the user needs to intervene. If that is the intended behavior, we are there. (This (lost REQ) can even occur when the timeout is non infinite too). An alternative (as Sean suggested) is to continually retry (at a periodicity below the supplied timeout) until the time period specified expires. That seems to be better (at least to me and Sean) in terms of handling the lost REQ case. As retries is not part of the API for connect, I would presume the implementor is free to what they want under the covers of dapl_ib_connect. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Wed, 8 Jun 2005, Sean Hefty wrote: Hal Rosenstock wrote: On Wed, 2005-06-08 at 11:44, James Lentini wrote: We interpreted the above to mean "give the connection protocol as much time as it needs to establish a connection, but don't mask errors (no path to the remove node, etc.)". For that reason we changed the variable name to DAT_TIMEOUT_MAX. But if the REQ is lost, the timeout is really really long (longer than most will wait for an error). Transaction test also appears to be using this as well as the quit test. My interpretation was that this is a DAPL level timeout and did not necessarily relate to a timeout for a single CM REQ. That is, there could still be a different timeout specified to the CM, but the number of retries could be infinite. If there are kernel users in need of an truly inifinite timeout, we could do that. Note that I'm not saying that an infinite timeout makes sense, but the use of TIMEOUT_MAX seems reasonable. To me that indicates that DAPL decides how long is needed to establish a timeout, and it manages all retries. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Wed, 8 Jun 2005, Hal Rosenstock wrote: halr> On Wed, 2005-06-08 at 11:44, James Lentini wrote: halr> > We interpreted the above to mean "give the connection protocol as halr> > much time as it needs to establish a connection, but don't mask halr> > errors (no path to the remove node, etc.)". For that reason we changed halr> > the variable name to DAT_TIMEOUT_MAX. halr> halr> But if the REQ is lost, the timeout is really really long (longer than halr> most will wait for an error). If a user doesn't want to wait DAT_TIMEOUT_MAX time, it can pass a smaller amount of time to dat_ep_connect. Does this satisfy your requirements? halr> Transaction test also appears to be using this as well as the halr> quit test. halr> halr> -- Hal halr> ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
That would be another way to interpret this: that each time the CM times out, DAPL should re-attempt to establish a connection. However, the original implementation didn't do this and the feedback I've received is that such a feature would not be appropriate for the kernel. If there are kernel applications that want this, we can add it. On Wed, 8 Jun 2005, Tom Duffy wrote: tduffy> On Wed, 2005-06-08 at 11:44 -0400, James Lentini wrote: tduffy> > We interpreted the above to mean "give the connection protocol as tduffy> > much time as it needs to establish a connection, but don't mask tduffy> > errors (no path to the remove node, etc.)". For that reason we changed tduffy> > the variable name to DAT_TIMEOUT_MAX. tduffy> tduffy> Well, let's say the end node is not there yet. Should the CM keep tduffy> trying indefinitely waiting for somebody to show up and respond? tduffy> tduffy> -tduffy tduffy> ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
Hal Rosenstock wrote: On Wed, 2005-06-08 at 11:44, James Lentini wrote: We interpreted the above to mean "give the connection protocol as much time as it needs to establish a connection, but don't mask errors (no path to the remove node, etc.)". For that reason we changed the variable name to DAT_TIMEOUT_MAX. But if the REQ is lost, the timeout is really really long (longer than most will wait for an error). Transaction test also appears to be using this as well as the quit test. My interpretation was that this is a DAPL level timeout and did not necessarily relate to a timeout for a single CM REQ. That is, there could still be a different timeout specified to the CM, but the number of retries could be infinite. Note that I'm not saying that an infinite timeout makes sense, but the use of TIMEOUT_MAX seems reasonable. To me that indicates that DAPL decides how long is needed to establish a timeout, and it manages all retries. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Wed, 2005-06-08 at 11:44, James Lentini wrote: > We interpreted the above to mean "give the connection protocol as > much time as it needs to establish a connection, but don't mask > errors (no path to the remove node, etc.)". For that reason we changed > the variable name to DAT_TIMEOUT_MAX. But if the REQ is lost, the timeout is really really long (longer than most will wait for an error). Transaction test also appears to be using this as well as the quit test. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Wed, 2005-06-08 at 11:44 -0400, James Lentini wrote: > We interpreted the above to mean "give the connection protocol as > much time as it needs to establish a connection, but don't mask > errors (no path to the remove node, etc.)". For that reason we changed > the variable name to DAT_TIMEOUT_MAX. Well, let's say the end node is not there yet. Should the CM keep trying indefinitely waiting for somebody to show up and respond? -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 7 Jun 2005, Hal Rosenstock wrote: On Tue, 2005-05-31 at 14:17, James Lentini wrote: Here's the specification's exact description: timeout: Duration of time, in microseconds, that a consumer waits for connection establishment. The value of DAT_TIMEOUT_INFINITE represents no timeout, indefinite wait. Values must be positive. What is the purpose of an infinite timeout (other than the obvious) ? The quit test uses this feature. Not sure if other tests do as well. What happens if the REQ is lost ? Why would someone want an infinite timeout ? We interpreted the above to mean "give the connection protocol as much time as it needs to establish a connection, but don't mask errors (no path to the remove node, etc.)". For that reason we changed the variable name to DAT_TIMEOUT_MAX. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-05-31 at 14:17, James Lentini wrote: > Here's the specification's exact description: > > timeout: Duration of time, in microseconds, that a consumer waits for >connection establishment. The value of DAT_TIMEOUT_INFINITE >represents no timeout, indefinite wait. Values must be >positive. What is the purpose of an infinite timeout (other than the obvious) ? The quit test uses this feature. Not sure if other tests do as well. What happens if the REQ is lost ? Why would someone want an infinite timeout ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Thu, 2005-06-02 at 10:17 -0400, James Lentini wrote: > > On Tue, 31 May 2005, Tom Duffy wrote: > > > On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote: > >> Here's the specification's exact description: > >> > >> timeout: Duration of time, in microseconds, that a consumer waits for > >>connection establishment. The value of DAT_TIMEOUT_INFINITE > >>represents no timeout, indefinite wait. Values must be > >>positive. > > > > Let me make sure I got this right: timeout is in µs (10^-6 seconds), not > > ms (10^-3 seconds). If so, I am off by 3 orders of magnitude in my > > calculation. Right? > > Correct, the value was intended to be in microseconds. OK, so this should be the conversion function: /* * approximately transforms microseconds to 4.096us*2^x * 63(+8) is max return */ static inline u8 dapl_convert_us_to_kookyib(unsigned long us) { unsigned long ms = us/1000UL, converged = 2; u8 i; if (2 > ms) return 8; for (i = 1; i < 63; i++) { if (converged >= ms) break; converged = 2*converged; } return i+8; } signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Wed, 1 Jun 2005, Tom Duffy wrote: On Tue, 2005-05-31 at 14:52 -0700, Tom Duffy wrote: On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote: Here's the specification's exact description: timeout: Duration of time, in microseconds, that a consumer waits for connection establishment. The value of DAT_TIMEOUT_INFINITE represents no timeout, indefinite wait. Values must be positive. Let me make sure I got this right: timeout is in µs (10^-6 seconds), not ms (10^-3 seconds). If so, I am off by 3 orders of magnitude in my calculation. Right? This is from DT_fft_connect() in test/dapltest/test/dapl_fft_util.c: /* attempt to connect, timeout = 10 secs */ rc = dat_ep_connect (conn->ep_handle, conn->remote_netaddr, SERVER_PORT_NUMBER, 10*1000, 0, (void *)0, DAT_QOS_BEST_EFFORT, DAT_CONNECT_DEFAULT_FLAG); DT_assert_dat (phead, rc == DAT_SUCCESS); leading me to believe we are talking about milliseconds. Or is this a bug? That is a bug. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 31 May 2005, Tom Duffy wrote: On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote: Here's the specification's exact description: timeout: Duration of time, in microseconds, that a consumer waits for connection establishment. The value of DAT_TIMEOUT_INFINITE represents no timeout, indefinite wait. Values must be positive. Let me make sure I got this right: timeout is in µs (10^-6 seconds), not ms (10^-3 seconds). If so, I am off by 3 orders of magnitude in my calculation. Right? Correct, the value was intended to be in microseconds. My perspective is that we are not implementing this API for a real time operating system and therefore should take a fuzzy view of time. Trust me, it is going to fuzzy what with the mechanism IB uses to encode timeouts. BTW, what do you think would be a good test case to make sure the new code is working as intended? dapltest could be updated to allow the timeout value to be specified on the command line. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-05-31 at 14:52 -0700, Tom Duffy wrote: > On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote: > > Here's the specification's exact description: > > > > timeout: Duration of time, in microseconds, that a consumer waits for > >connection establishment. The value of DAT_TIMEOUT_INFINITE > >represents no timeout, indefinite wait. Values must be > >positive. > > Let me make sure I got this right: timeout is in µs (10^-6 seconds), not > ms (10^-3 seconds). If so, I am off by 3 orders of magnitude in my > calculation. Right? This is from DT_fft_connect() in test/dapltest/test/dapl_fft_util.c: /* attempt to connect, timeout = 10 secs */ rc = dat_ep_connect (conn->ep_handle, conn->remote_netaddr, SERVER_PORT_NUMBER, 10*1000, 0, (void *)0, DAT_QOS_BEST_EFFORT, DAT_CONNECT_DEFAULT_FLAG); DT_assert_dat (phead, rc == DAT_SUCCESS); leading me to believe we are talking about milliseconds. Or is this a bug? -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-05-31 at 14:17 -0400, James Lentini wrote: > Here's the specification's exact description: > > timeout: Duration of time, in microseconds, that a consumer waits for >connection establishment. The value of DAT_TIMEOUT_INFINITE >represents no timeout, indefinite wait. Values must be >positive. Let me make sure I got this right: timeout is in µs (10^-6 seconds), not ms (10^-3 seconds). If so, I am off by 3 orders of magnitude in my calculation. Right? > My perspective is that we are not implementing this API for a real > time operating system and therefore should take a fuzzy view of time. Trust me, it is going to fuzzy what with the mechanism IB uses to encode timeouts. BTW, what do you think would be a good test case to make sure the new code is working as intended? -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 31 May 2005, Hal Rosenstock wrote: On Tue, 2005-05-31 at 15:57, James Lentini wrote: If we included address resolution, how would we divide up the time between address resolution and cm protocol? Wouldn't we have to track how long address resolution took to complete? Yes, to follow the requirement closely, one would need to time the duration of the address translation but that is pretty straightforward to do. IBAT already has to time out requests anyway. The worst case for address resolution is currently 4 * 100 msec. If we can account for all of the time properly, then we should implement it that way. Other alternatives are to subtract the maximal address translation time off the time supplied and use the rest for CM, or as you said ignore this time and use it all for CM purposes (and just go over by whatever amount this is). Did other implementations factor this in or did they ignore this ? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-05-31 at 15:57, James Lentini wrote: > If we included address resolution, how would we divide up the time > between address resolution and cm protocol? Wouldn't we have to > track how long address resolution took to complete? Yes, to follow the requirement closely, one would need to time the duration of the address translation but that is pretty straightforward to do. IBAT already has to time out requests anyway. The worst case for address resolution is currently 4 * 100 msec. Other alternatives are to subtract the maximal address translation time off the time supplied and use the rest for CM, or as you said ignore this time and use it all for CM purposes (and just go over by whatever amount this is). Did other implementations factor this in or did they ignore this ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 31 May 2005, Hal Rosenstock wrote: On Tue, 2005-05-31 at 14:17, James Lentini wrote: Here's the specification's exact description: timeout: Duration of time, in microseconds, that a consumer waits for connection establishment. The value of DAT_TIMEOUT_INFINITE represents no timeout, indefinite wait. Values must be positive. My perspective is that we are not implementing this API for a real time operating system and therefore should take a fuzzy view of time. Fuzzy in that we are certainly not concerned with the granularity of microseconds. My interpretation of the definition above is that a provider should attempt to establish a connection for a least [timeout] time. So any number of retries is allowed up to the time period specified (depending on the timeout used) ? Correct, any number of retries (including 0) is allowed. Once the time period expires, the provider should post a result as quickly as possible. If a connection is not established after attempting for at least [timeout] time, the provider should should give up and post a connection failure event. If there is some reasonable additional time needed for address resolution, etc., I think that is acceptable. This all can be bundled in. One just needs to know what the requirement is. If we included address resolution, how would we divide up the time between address resolution and cm protocol? Wouldn't we have to track how long address resolution took to complete? -- Hal james On Tue, 31 May 2005, Hal Rosenstock wrote: On Tue, 2005-05-31 at 13:27, James Lentini wrote: James, what is the timeout value passed into dapl_ep_connect mean, the total timeout time? Or how much for each retry? It is the total timeout value. Total meaning all everything inclusive ? If that is what it is supposed to be, that is not what is implemented now: DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ DAPL_IB_MAX_CM_RETRIES 4 There are also the timeout/retries of IBAT as well. DAPL_IB_MAX_AT_RETRY 3 IB_AT_REQ_RETRY_MS 100 -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-05-31 at 14:17, James Lentini wrote: > Here's the specification's exact description: > > timeout: Duration of time, in microseconds, that a consumer waits for >connection establishment. The value of DAT_TIMEOUT_INFINITE >represents no timeout, indefinite wait. Values must be >positive. > > My perspective is that we are not implementing this API for a real > time operating system and therefore should take a fuzzy view of time. Fuzzy in that we are certainly not concerned with the granularity of microseconds. > My interpretation of the definition above is that a provider should > attempt to establish a connection for a least [timeout] time. So any number of retries is allowed up to the time period specified (depending on the timeout used) ? > If a > connection is not established after attempting for at least [timeout] > time, the provider should should give up and post a connection failure > event. If there is some reasonable additional time needed for address > resolution, etc., I think that is acceptable. This all can be bundled in. One just needs to know what the requirement is. -- Hal > james > > On Tue, 31 May 2005, Hal Rosenstock wrote: > > > On Tue, 2005-05-31 at 13:27, James Lentini wrote: > >>> James, what is the timeout value passed into dapl_ep_connect mean, the > >>> total timeout time? Or how much for each retry? > >> > >> It is the total timeout value. > > > > Total meaning all everything inclusive ? If that is what it is supposed > > to be, that is not what is implemented now: > > > > DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ > > DAPL_IB_MAX_CM_RETRIES 4 > > > > There are also the timeout/retries of IBAT as well. > > DAPL_IB_MAX_AT_RETRY 3 > > IB_AT_REQ_RETRY_MS 100 > > > > -- Hal > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
Here's the specification's exact description: timeout: Duration of time, in microseconds, that a consumer waits for connection establishment. The value of DAT_TIMEOUT_INFINITE represents no timeout, indefinite wait. Values must be positive. My perspective is that we are not implementing this API for a real time operating system and therefore should take a fuzzy view of time. My interpretation of the definition above is that a provider should attempt to establish a connection for a least [timeout] time. If a connection is not established after attempting for at least [timeout] time, the provider should should give up and post a connection failure event. If there is some reasonable additional time needed for address resolution, etc., I think that is acceptable. james On Tue, 31 May 2005, Hal Rosenstock wrote: On Tue, 2005-05-31 at 13:27, James Lentini wrote: James, what is the timeout value passed into dapl_ep_connect mean, the total timeout time? Or how much for each retry? It is the total timeout value. Total meaning all everything inclusive ? If that is what it is supposed to be, that is not what is implemented now: DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ DAPL_IB_MAX_CM_RETRIES 4 There are also the timeout/retries of IBAT as well. DAPL_IB_MAX_AT_RETRY 3 IB_AT_REQ_RETRY_MS 100 -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Tue, 2005-05-31 at 13:27, James Lentini wrote: > > James, what is the timeout value passed into dapl_ep_connect mean, the > > total timeout time? Or how much for each retry? > > It is the total timeout value. Total meaning all everything inclusive ? If that is what it is supposed to be, that is not what is implemented now: DAPL_IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ DAPL_IB_MAX_CM_RETRIES 4 There are also the timeout/retries of IBAT as well. DAPL_IB_MAX_AT_RETRY 3 IB_AT_REQ_RETRY_MS 100 -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Fri, 27 May 2005, Tom Duffy wrote: On Thu, 2005-05-26 at 22:25 -0700, Sean Hefty wrote: So, here is the strategy I am taking. Please let me know if it is wrong. When dapl_ep_connect() is called, I save off the timeout value into the dapl_ep struct. Then, when we get ready to call ib_send_cm_req(), I stuff the timeout value (after munging it into IB's strange format) into the conn params remote_cm_response_timeout. From a CM perspective, this sounds fine. Note that the CM timeout will not occur until the number of retries has been met. So I don't know if the timeout passed to dapl_ep_connect() should convert directly into the remote_cm_response_timeout, or needs to be divided by the number of retries. So, are you saying that if you have a timeout of 4 seconds (you pass in 20) and you have retries set to 2, that it will fail after 8 seconds? James, what is the timeout value passed into dapl_ep_connect mean, the total timeout time? Or how much for each retry? It is the total timeout value. Also, did you notice that dapl_ib_connect always sets the timeout to 20 (4 seconds) no matter what? Should this be the case? The timeout should not be constant as it is now. It was being unnecessarily emulated with the extra "timeout" thread. If the connection fails to complete within the timeout, dapl_cm_active_cb_handler() is called with IB_CM_REQ_ERROR which in turn calls dapl_evd_connection_callback() which does the same thing that dapl_ep_timeout() used to do -- tear down the connection. I haven't looked at your changes, but note that calling ib_destroy_cm_id from within the CM callback thread will hang. The callback holds a reference on the cm_id. The good news is that there should be code in kDAPL to catch this. I will take a look and see if this could happen. Tom, I don't believe that you've changed Hal and Sean's implementation of this. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Thu, 2005-05-26 at 22:25 -0700, Sean Hefty wrote: > >So, here is the strategy I am taking. Please let me know if it is > >wrong. > > > >When dapl_ep_connect() is called, I save off the timeout value into the > >dapl_ep struct. Then, when we get ready to call ib_send_cm_req(), I > >stuff the timeout value (after munging it into IB's strange format) into > >the conn params remote_cm_response_timeout. > > From a CM perspective, this sounds fine. Note that the CM timeout will not > occur until the number of retries has been met. So I don't know if the > timeout passed to dapl_ep_connect() should convert directly into the > remote_cm_response_timeout, or needs to be divided by the number of retries. So, are you saying that if you have a timeout of 4 seconds (you pass in 20) and you have retries set to 2, that it will fail after 8 seconds? James, what is the timeout value passed into dapl_ep_connect mean, the total timeout time? Or how much for each retry? Also, did you notice that dapl_ib_connect always sets the timeout to 20 (4 seconds) no matter what? Should this be the case? > >If the connection fails to complete within the timeout, > >dapl_cm_active_cb_handler() is called with IB_CM_REQ_ERROR which in turn > >calls dapl_evd_connection_callback() which does the same thing that > >dapl_ep_timeout() used to do -- tear down the connection. > > I haven't looked at your changes, but note that calling ib_destroy_cm_id > from within the CM callback thread will hang. The callback holds a > reference on the cm_id. The good news is that there should be code in kDAPL > to catch this. I will take a look and see if this could happen. -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
On Fri, 2005-05-27 at 08:44 -0700, Sean Hefty wrote: > >+/* > >+ * approximately transforms miliseconds to 4.096us*2^x > >+ * 63(+8) is max return > > I think that the max return is 64+8. I guess that is technically true, but converged overflows when i is 64. So, maybe I should stop one before that. -tduffy signature.asc Description: This is a digitally signed message part ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
>+/* >+ * approximately transforms miliseconds to 4.096us*2^x >+ * 63(+8) is max return I think that the max return is 64+8. >+ */ >+static inline u8 dapl_convert_ms_to_kookyib(unsigned long ms) { I like the function name. :) >+ unsigned long converged = 2; >+ u8 i; >+ >+ if (2 > ms) >+ return 8; >+ >+ for (i = 1; i < 64; i++) { >+ if (converged >= ms) >+ break; >+ converged = 2*converged; >+ } >+ >+ return i+8; >+} I didn't notice any other issues looking over the changes, but see my other email regarding setting the timeout based on the number of retries. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCHv2][RFC] kDAPL: use cm timers instead of own
>So, here is the strategy I am taking. Please let me know if it is >wrong. > >When dapl_ep_connect() is called, I save off the timeout value into the >dapl_ep struct. Then, when we get ready to call ib_send_cm_req(), I >stuff the timeout value (after munging it into IB's strange format) into >the conn params remote_cm_response_timeout. >From a CM perspective, this sounds fine. Note that the CM timeout will not occur until the number of retries has been met. So I don't know if the timeout passed to dapl_ep_connect() should convert directly into the remote_cm_response_timeout, or needs to be divided by the number of retries. >If the connection fails to complete within the timeout, >dapl_cm_active_cb_handler() is called with IB_CM_REQ_ERROR which in turn >calls dapl_evd_connection_callback() which does the same thing that >dapl_ep_timeout() used to do -- tear down the connection. I haven't looked at your changes, but note that calling ib_destroy_cm_id from within the CM callback thread will hang. The callback holds a reference on the cm_id. The good news is that there should be code in kDAPL to catch this. >Here is a patch that implements this, *untested*, please take a look. I'll look over the patch tomorrow and let you know if anything stands out, but I'm not overly familiar with the kDAPL code... - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general