Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Mike Heinz
I never got a response to our last discussions of this issue; to remind 
everyone, this is a patch for handling BUSY responses from the SA in the same 
manner time outs are handled. The purpose of this change is to prevent poorly 
written applications from overloading the SM, or failing when they receive a 
BUSY response.

To recap where we were, the problem is that no existing OFED module or user 
application currently handles a BUSY response from the SA at all. For the most 
part, they treat a return code of IB_MGMT_MAD_STATUS_BUSY as requiring an 
immediate resend of the query or, worse, as a fatal error.

I have a proposed patch which causes the BUSY response to be handled within the 
infinband core modules in the same way timeouts are handled. (i.e., BUSY 
responses are treated as if no response was received). I have a version of this 
patch which meets the objections about handling TRAP REPRESS, but which does 
not affect any existing APIs and does not require changes to any existing 
kernel modules or applications. As far as I can tell, this patch has neither 
been accepted nor rejected at this point.

In addition to the patch itself, there has been significant discussion about 
changing the APIs to allow the application to explicitly specify how to handle 
BUSY. 

Reviewing the code, I tried to get an idea of the impact that applications to 
set the number of busy retries and busy timeout values would have.

At the kernel level, two data structures (ib_mad_send_wr and 
ib_mad_send_wr_private) and three key functions are affected. The functions are 
mad.c/ib_mad_complete_recv(), mad.c/ib_post_send_mad() and 
sa_query.c/send_mad(). At the very minimum, the latter two functions would need 
to be patched to copy the new fields while ib_mad_complete_recv() would be 
patched in a manner similar to my current patch, but using the busy retries and 
busy timeout values.

Beyond that, setting the new parameters will require patching the agent portion 
of ib_mad, the cm, mlx4 mthca drivers and srp. Other modules will be indirectly 
affected, if we choose to expose the busy values through the sa query interface 
(ib_sa_path_rec_get(), etcetera).
 
Beyond those, user_mad.c/ib_umad_read() and user_mad.c/ib_umad_write() would 
need to be altered in a similar manner to allow the new fields to be passed 
into user space, requiring changes to the ib_user_mad structure (which would 
change the ABI) and affects the ibsim, qlvnic tools, the opensm and, I think, 
adding new calls to libibmad.


Obviously, once that it done it will trigger a cascade of changes to user space 
tools such as saquery, ibdiagnet and so on.

Rather than chewing off all that at once, what I would suggest is that we 
simply add the new fields to struct ib_mad_send_wr and struct 
ib_mad_send_wr_private in kernel space, and have and have send_mad() and 
ib_post_send_mad() check the values in ib_mad_send_wr and, if they are zero, 
set the values in ib_mad_send_wr_private to match the existing retries and 
timeout_ms fields. This would allow existing code to work without modification 
while laying the ground work for the broader change. 

Once these changes were made, it would be possible to add support for 
explicitly setting BUSY behavior to the ulps on a case by case basis although 
it should be noted that adding the new fields to the user space interfaces will 
trigger another new ABI revision - so I would suggest leaving that change for a 
major update to OFED.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Hefty, Sean
 In addition to the patch itself, there has been significant discussion
 about changing the APIs to allow the application to explicitly specify how
 to handle BUSY.

The current code allows an app to explicitly handle BUSY replies, so we don't 
need changes for that.  I was advocating making things simpler for the user, 
with more intelligent retry/timeout handling done in the kernel.

For example:

umad_send() takes the timeout_ms and retries as int.  If a negative timeout_ms 
is given with retries set to 0, then the timeout is treated as the total amount 
of time to wait for a response.  The number of retries and timeout values 
between each one would be handled by the kernel.  This includes the kernel 
handling BUSY responses in whatever way seems most appropriate.

The kernel can be updated to use random retry intervals, exponential back-offs, 
windowing techniques, response time history, etc.  Of course, we can start with 
some simple kernel changes at first, then enhance them.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Mike Heinz
The problem is that none of the apps **do** handle BUSY - at all - and your 
proposal still requires the apps to be changed to stop them from degrading the 
fabric.

The whole benefit of this change is that it implements a reasonable default 
without requiring the apps be changed, but still allows them to override the 
behavior if desired.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Wednesday, August 11, 2010 1:00 PM
To: Mike Heinz; linux-rdma@vger.kernel.org; Roland Dreier; Jason Gunthorpe; Hal 
Rosenstock
Cc: Todd Rimmer
Subject: RE: Update proposal for handling BUSY responses from the SA/SM

 In addition to the patch itself, there has been significant discussion
 about changing the APIs to allow the application to explicitly specify how
 to handle BUSY.

The current code allows an app to explicitly handle BUSY replies, so we don't 
need changes for that.  I was advocating making things simpler for the user, 
with more intelligent retry/timeout handling done in the kernel.

For example:

umad_send() takes the timeout_ms and retries as int.  If a negative timeout_ms 
is given with retries set to 0, then the timeout is treated as the total amount 
of time to wait for a response.  The number of retries and timeout values 
between each one would be handled by the kernel.  This includes the kernel 
handling BUSY responses in whatever way seems most appropriate.

The kernel can be updated to use random retry intervals, exponential back-offs, 
windowing techniques, response time history, etc.  Of course, we can start with 
some simple kernel changes at first, then enhance them.

- Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Hefty, Sean
 The problem is that none of the apps **do** handle BUSY - at all - and your
 proposal still requires the apps to be changed to stop them from degrading
 the fabric.

Yes - the apps are busted, so I do believe that the fixes are required there 
and not in the kernel.  If you want to fix them by applying a work-around in a 
user space library, that's still doable.  Take the timeout/retry values 
provided by the app, calculate the total timeout, and pass that into the kernel.

- Sean 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Todd Rimmer
 From: Hefty, Sean [mailto:sean.he...@intel.com]
 
  The problem is that none of the apps **do** handle BUSY - at all -
 and your
  proposal still requires the apps to be changed to stop them from
 degrading
  the fabric.
 
 Yes - the apps are busted, so I do believe that the fixes are required
 there and not in the kernel.  If you want to fix them by applying a
 work-around in a user space library, that's still doable.  Take the
 timeout/retry values provided by the app, calculate the total timeout,
 and pass that into the kernel.
 
 - Sean

Coding IB applications is hard enough, let's not require it to be harder.  We 
need a solution that fixes all the apps and makes it easy for future 
applications to have a sensible default behavior.  

I think Mike's approach does that, minimizes risk, addresses 3rd party apps 
which may not be part of OFA, and has a path toward allowing sophisticated 
applications to control the behavior (few if any apps will really want to do 
that).

I look at this as analogous to TCP sockets and the getopt/setopt calls.  They 
allow a lot of fine grained control, however for applications which chose not 
to use them, the defaults provide good network friendly behaviors.

Having the capability in the kernel is needed so that all kernel ULPs behave 
well, including ones not under OFA control (such as Lustre and other 
filesystems).

Mike's approach also allows for the addition of more sophisticated algorithms, 
such as random backoff, to be easily added and selected in the future.

Todd Rimmer



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Hefty, Sean
 Coding IB applications is hard enough, let's not require it to be harder.
 We need a solution that fixes all the apps and makes it easy for future
 applications to have a sensible default behavior.

The mad interface is privileged, not some generic API available to any user 
space app.

 I think Mike's approach does that, minimizes risk, addresses 3rd party apps
 which may not be part of OFA, and has a path toward allowing sophisticated
 applications to control the behavior (few if any apps will really want to
 do that).

It breaks the ABI and existing apps that *do* handle BUSY replies.  We can't 
assume that no apps out there aren't written correctly. 

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Mike Heinz
 It breaks the ABI and existing apps that *do* handle BUSY replies.  We can't 
 assume that no apps out there aren't written correctly.

I could be wrong, but I couldn't find a single example in the OFED 1.5.2 
package.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Wednesday, August 11, 2010 1:58 PM
To: Todd Rimmer; Mike Heinz; linux-rdma@vger.kernel.org; Roland Dreier; Jason 
Gunthorpe; Hal Rosenstock
Subject: RE: Update proposal for handling BUSY responses from the SA/SM

 Coding IB applications is hard enough, let's not require it to be harder.
 We need a solution that fixes all the apps and makes it easy for future
 applications to have a sensible default behavior.

The mad interface is privileged, not some generic API available to any user 
space app.

 I think Mike's approach does that, minimizes risk, addresses 3rd party apps
 which may not be part of OFA, and has a path toward allowing sophisticated
 applications to control the behavior (few if any apps will really want to
 do that).

It breaks the ABI and existing apps that *do* handle BUSY replies.  We can't 
assume that no apps out there aren't written correctly. 

- Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Mike Heinz
Sean,

What if we reversed the sense of your idea - if the app or ulp provides a 
positive timeout number, apply the combined time-out concept, but if it 
provides a negative number, force it to handle BUSY itself? This would provide 
a good quality default behavior.

Also - it still makes sense to me that we take the approach of not doing 
anything that requires immediate changes to ABIs and APIs but rather set up the 
underlying architecture to allow the change to be propagated over time.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Update proposal for handling BUSY responses from the SA/SM

2010-08-11 Thread Todd Rimmer
 From: Hefty, Sean [mailto:sean.he...@intel.com]
 It breaks the ABI and existing apps that *do* handle BUSY replies.  We
 can't assume that no apps out there aren't written correctly.
 
 The mad interface is privileged, not some generic API available to any
 user space app.

I have yet to find a single app (OFA or 3rd party) which handles BUSY properly. 
 Right now we can identify numerous apps which are broken.

While umad is privileged, the SA queries it allows are used by every IB 
compliant app and ULP.  If is also used by many management apps which are also 
priviledged.

The proposed change will not break any apps and does not change the ABI, it 
will simply limit when they see busy (in Mike's most recently posted patch, 
after all retries were exhausted, a BUSY would be returned so the app could 
handle long duration BUSY, while the kernel would handle short duration 
BUSY).


Todd Rimmer


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Handling busy responses from the SA

2010-06-17 Thread Hal Rosenstock
Mike,

On Wed, Jun 16, 2010 at 3:57 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal,

 But if the original trap had retries  0, wouldn't resending the trap be what 
 the issuer intended?

I suppose as there's nothing in the IBA spec that precludes using busy
on TrapRepresses although I'd be hard pressed to rationalize using
that particularly for SMP traps.

-- Hal

 I guess I'm confused why treating BUSY as similar to simply never getting a 
 response at all is a bad thing. In my mind, receiving a BUSY response is like 
 getting a busy signal when you call someone on the phone - a sign you need to 
 wait a bit then try again. Similarly, if I call someone and never get an 
 answer my strategy is going to be to wait, then try again.

 -Original Message-
 From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com]
 Sent: Tuesday, June 08, 2010 8:16 PM
 To: Mike Heinz
 Cc: Hefty, Sean; linux-rdma@vger.kernel.org
 Subject: Re: Handling busy responses from the SA

 Mike,

 I'm referring to the receipt of the TrapRepress with busy status.
 Wouldn't your patch cause the original Trap to be resent when retries
 0 ? TrapRepress is essentially a response to Trap and classified as
 such by ib_response_mad. Your proposed patch treats a busy as a
 timeout and can cause retry of the original sent Trap.

 -- Hal

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-17 Thread Mike Heinz
To be honest, we haven't been able to think of a case where a sender would use 
retries on a trap or a busy on a repress either, but I don't think it would 
hurt to omit represses from the busy handling either.

Would that be acceptable to everyone? To alter the patch to allow BUSY trap 
repress MADs to pass through?

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Hal Rosenstock
Sent: Thursday, June 17, 2010 9:30 AM
To: Mike Heinz
Cc: Hefty, Sean; linux-rdma@vger.kernel.org; Todd Rimmer
Subject: Re: Handling busy responses from the SA

Mike,

On Wed, Jun 16, 2010 at 3:57 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal,

 But if the original trap had retries  0, wouldn't resending the trap be what 
 the issuer intended?

I suppose as there's nothing in the IBA spec that precludes using busy
on TrapRepresses although I'd be hard pressed to rationalize using
that particularly for SMP traps.

-- Hal

 I guess I'm confused why treating BUSY as similar to simply never getting a 
 response at all is a bad thing. In my mind, receiving a BUSY response is like 
 getting a busy signal when you call someone on the phone - a sign you need to 
 wait a bit then try again. Similarly, if I call someone and never get an 
 answer my strategy is going to be to wait, then try again.

 -Original Message-
 From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com]
 Sent: Tuesday, June 08, 2010 8:16 PM
 To: Mike Heinz
 Cc: Hefty, Sean; linux-rdma@vger.kernel.org
 Subject: Re: Handling busy responses from the SA

 Mike,

 I'm referring to the receipt of the TrapRepress with busy status.
 Wouldn't your patch cause the original Trap to be resent when retries
 0 ? TrapRepress is essentially a response to Trap and classified as
 such by ib_response_mad. Your proposed patch treats a busy as a
 timeout and can cause retry of the original sent Trap.

 -- Hal

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-16 Thread Mike Heinz
Hal,

But if the original trap had retries  0, wouldn't resending the trap be what 
the issuer intended?

I guess I'm confused why treating BUSY as similar to simply never getting a 
response at all is a bad thing. In my mind, receiving a BUSY response is like 
getting a busy signal when you call someone on the phone - a sign you need to 
wait a bit then try again. Similarly, if I call someone and never get an answer 
my strategy is going to be to wait, then try again. 

-Original Message-
From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com] 
Sent: Tuesday, June 08, 2010 8:16 PM
To: Mike Heinz
Cc: Hefty, Sean; linux-rdma@vger.kernel.org
Subject: Re: Handling busy responses from the SA

Mike,

I'm referring to the receipt of the TrapRepress with busy status.
Wouldn't your patch cause the original Trap to be resent when retries
 0 ? TrapRepress is essentially a response to Trap and classified as
such by ib_response_mad. Your proposed patch treats a busy as a
timeout and can cause retry of the original sent Trap.

-- Hal


Re: Handling busy responses from the SA

2010-06-08 Thread Hal Rosenstock
Mike,

On Mon, Jun 7, 2010 at 12:00 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal said:
 Should a busy be retried at all at the mad layer ? Is a special longer) 
 timeout policy for busy needed ?

 Also, should this be done for all MADs classified by ib_response_mad (e.g. 
 trap represses) ?

 Hal,

 The idea of processing BUSY responses in the MAD layer is to BUSY responses 
 like timeouts - which are currently handled by the MAD layer. Right now there 
 is an issue where various apps and ULPs either treat BUSY as a cause to 
 immediately retry or as a permanent error. This doesn't seem to affect users 
 of the OpenSM so much because (as I understand it) the OpenSM seems to 
 discard requests when it gets too busy - but for other SA/SMs, it can cause a 
 major packet storm or, worse, a simple loss of connectivity where MPI jobs or 
 kernel ULPs simply assume the SA is broken because they got a BUSY reply.

 By treating the BUSY reply as a timeout, we're actually simplifying matters 
 by fitting into existing practice.

Understood. Timing these out makes sense to me but still does not
preclude the client from potentially handling this if the retries
fail.

 As for needing a longer timeout - in our old proprietary stack, QLogic did 
 have a longer timeout for retrying busy replies than for normal timeouts

How much longer ? What are the two timeouts used ?

 - but we should try to get this in now so we can get some relief before we 
 begin the long term discussion of the best way to handle this issue overall.

All I was getting at here was: does retrying when busy work ? If not,
why retry at all at the MAD layer (regardless of retries requested)
and perhaps use a longer timeout for this. If it does work, maybe the
timeout on the subsequent retries should be extended.

I think my two other comments on details are relevant to an updated patch.

-- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 As for needing a longer timeout - in our old proprietary stack, QLogic did
 have a longer timeout for retrying busy replies than for normal timeouts -
 but we should try to get this in now so we can get some relief before we
 begin the long term discussion of the best way to handle this issue
 overall.

Because applications may handle BUSY replies differently, we shouldn't simply 
start hiding them from the user.  I would much rather agree on the longer term 
plan, so that the ABI can reflect the proper semantics.  I don't see any issue 
with changing the current behavior for kernel clients, however.

- Sean


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Sean said,

 Because applications may handle BUSY replies differently, we shouldn't simply 
 start hiding them from the user.  

Sean - remember that this patch will still return a BUSY status to the caller, 
if retries are exhausted and the last return code was BUSY, then that's what 
the caller will get. Thus, code which sets retries to zero will not be affected 
by this patch at all.

Hal said,

 All I was getting at here was: does retrying when busy work ? If not,
 why retry at all at the MAD layer (regardless of retries requested)
 and perhaps use a longer timeout for this. If it does work, maybe the
 timeout on the subsequent retries should be extended.

Personally, I think it's been extremely helpful - we've been using busy status 
to tell compute nodes to slow down since our old proprietary stack and we've 
seen a significant improvement in overall traffic congestion when we added this 
patch to OFED clusters using our SM. In addition use of the BUSY return code 
simplifies debugging traffic congestion problems (since it allows you to 
immediately differentiate between SA overload and other traffic issues) and it 
paves the way for more sophisticated back-off strategies in the future.

As to that, and your question, our old stack used two different timeout values 
specified by the client. One value was for actual timeouts and one for busy 
responses. In the case of busy responses, we added a randomization factor to 
spread out the traffic.

This issue with adapting that to the Linux-RDMA stack is that it's an API 
change. What I would suggest personally, is something like this:

1. Take either the timeout passed by the caller OR a predefined constant, 
whichever is larger. I would suggest setting the predefined constant to 
something moderate, say 2 seconds.
2. Add a randomization factor - say between -250 and +250 ms?
3. Update the packet timeout with this new value.


N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Anyone know why my messages are being appended with interesting garbage?

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Mike Heinz
Sent: Tuesday, June 08, 2010 11:49 AM
To: Hal Rosenstock
Cc: linux-rdma@vger.kernel.org
Subject: RE: Handling busy responses from the SA

N�r��y���b�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j
��f���h���z��w���
���j:+v���w�j�m
zZ+�ݢj��!�i


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Anyone know why my messages are being appended with interesting garbage?

I get that too.  I first noticed it a couple of weeks ago.  It eventually went 
back to the normal 'To unsubscribe from this list' message.


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

It looks like it only returns the BUSY response if that matches with the last 
retry, otherwise, the BUSY response is dropped.  It also looks like it applies 
to all MADs, including vendor specific ones, and not just those from the SA.

- Sean


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Right. Effectively this is similar to the I/O resolution timeout policy laid 
out in the spec.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Tuesday, June 08, 2010 12:27 PM
To: Mike Heinz; Hal Rosenstock
Cc: linux-rdma@vger.kernel.org
Subject: RE: Handling busy responses from the SA

 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

It looks like it only returns the BUSY response if that matches with the last 
retry, otherwise, the BUSY response is dropped.  It also looks like it applies 
to all MADs, including vendor specific ones, and not just those from the SA.

- Sean


RE: [PATCH] Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Also, I guess, it would be a good API choice if the caller could say
 'get me a reply for this mad or error within 60s' rather than specify
 details like retry counts, etc. The timeout values should be globally
 set and derived from the usual SA provided data for network transits...

I agree with this.  Within the framework of the existing umad ABI, this could 
be specified by setting the high bit in the ib_user_mad_hdr:timeout_ms field, 
assuming that no one is using that bit in practice.  The kernel could then 
freely select the retry/timeout policy for these clients, which for starters 
could include dropping BUSY responses and adjusting the timeout using an 
approach similar to what Mike mentioned in a separate email.  Kernel clients 
could be updated to use this new mode.

Any disagreements to this approach?  
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Sean -

Is there case where we would ever want to treat BUSY responses differently from 
timeouts?



-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Tuesday, June 08, 2010 12:27 PM
To: Mike Heinz; Hal Rosenstock
Cc: linux-rdma@vger.kernel.org
Subject: RE: Handling busy responses from the SA

 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

It looks like it only returns the BUSY response if that matches with the last 
retry, otherwise, the BUSY response is dropped.  It also looks like it applies 
to all MADs, including vendor specific ones, and not just those from the SA.

- Sean


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Is there case where we would ever want to treat BUSY responses differently
 from timeouts?

I doubt it for a single MAD, but I can't say what people may have implemented.  
The main difference I can think of is that a busy response requires a retry, 
whereas a timeout does not.  This affects the retry policy when multiple MADs 
are outstanding.  E.g. if there are 10 requests outstanding and the first times 
out, we may only resend the first request and increase the timeouts of the 
other 9.  If the 10 requests all receive a busy, then they must all be retried.

To me, it looks like it makes more sense to never send busy, except maybe when 
receive buffer space is full consumed, but implement a more intelligent 
timeout/retry mechanism on the sender side.  The SA almost needs some sort of 
MRA like message.

- Sean


RE: [PATCH] Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
It's workable, although I really wish there was a way to handle stupid apps 
that aren't written to handle a busy response.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Tuesday, June 08, 2010 12:44 PM
To: Jason Gunthorpe
Cc: Mike Heinz; linux-rdma@vger.kernel.org; e...@openfabrics.org
Subject: RE: [PATCH] Handling busy responses from the SA

 Also, I guess, it would be a good API choice if the caller could say
 'get me a reply for this mad or error within 60s' rather than specify
 details like retry counts, etc. The timeout values should be globally
 set and derived from the usual SA provided data for network transits...

I agree with this.  Within the framework of the existing umad ABI, this could 
be specified by setting the high bit in the ib_user_mad_hdr:timeout_ms field, 
assuming that no one is using that bit in practice.  The kernel could then 
freely select the retry/timeout policy for these clients, which for starters 
could include dropping BUSY responses and adjusting the timeout using an 
approach similar to what Mike mentioned in a separate email.  Kernel clients 
could be updated to use this new mode.

Any disagreements to this approach?  
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Handling busy responses from the SA

2010-06-08 Thread Hal Rosenstock
On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote:
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

 It looks like it only returns the BUSY response if that matches with the last 
 retry, otherwise, the BUSY response is dropped.  It also looks like it 
 applies to all MADs, including vendor specific ones, and not just those from 
 the SA.

Per the proposed patch, it currently includes trap represses (as
determined by ib_response_mad). Shouldn't busy be ignored for that
case ? I don't think that would be used but it seems safer to me.

-- Hal


 - Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Hal,

I may be confused - but I thought the spec said there was no valid response to 
a trap repress. I interpreted

o14-3.a4: The SMA shall not send any message in response to a valid 
SubnTrapRepress() message

to mean that the SMA isn't allowed to respond with a BUSY status for a trap 
repress.

-Original Message-
From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com] 
Sent: Tuesday, June 08, 2010 3:09 PM
To: Hefty, Sean
Cc: Mike Heinz; linux-rdma@vger.kernel.org
Subject: Re: Handling busy responses from the SA

On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote:
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

 It looks like it only returns the BUSY response if that matches with the last 
 retry, otherwise, the BUSY response is dropped.  It also looks like it 
 applies to all MADs, including vendor specific ones, and not just those from 
 the SA.

Per the proposed patch, it currently includes trap represses (as
determined by ib_response_mad). Shouldn't busy be ignored for that
case ? I don't think that would be used but it seems safer to me.

-- Hal


 - Sean

N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: Handling busy responses from the SA

2010-06-08 Thread Roland Dreier
  Is there case where we would ever want to treat BUSY responses
  differently from timeouts?

If there isn't then it's silly for the SA to ever send a BUSY response.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Handling busy responses from the SA

2010-06-08 Thread Hal Rosenstock
Mike,

On Tue, Jun 8, 2010 at 3:59 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal,

 I may be confused - but I thought the spec said there was no valid response 
 to a trap repress. I interpreted

 o14-3.a4: The SMA shall not send any message in response to a valid 
 SubnTrapRepress() message

 to mean that the SMA isn't allowed to respond with a BUSY status for a trap 
 repress.

I'm referring to the receipt of the TrapRepress with busy status.
Wouldn't your patch cause the original Trap to be resent when retries
 0 ? TrapRepress is essentially a response to Trap and classified as
such by ib_response_mad. Your proposed patch treats a busy as a
timeout and can cause retry of the original sent Trap.

-- Hal


 -Original Message-
 From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com]
 Sent: Tuesday, June 08, 2010 3:09 PM
 To: Hefty, Sean
 Cc: Mike Heinz; linux-rdma@vger.kernel.org
 Subject: Re: Handling busy responses from the SA

 On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote:
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

 It looks like it only returns the BUSY response if that matches with the 
 last retry, otherwise, the BUSY response is dropped.  It also looks like it 
 applies to all MADs, including vendor specific ones, and not just those from 
 the SA.

 Per the proposed patch, it currently includes trap represses (as
 determined by ib_response_mad). Shouldn't busy be ignored for that
 case ? I don't think that would be used but it seems safer to me.

 -- Hal


 - Sean


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
Roland Dreier said:

 I don't have a strong opinion on this but it seems a bit odd.  If we're just 
 going to drop the response anyway, why did the SA send it in the first place? 
  On the other hand, if the SA told us it's busy, it does seem we could do 
 something more sensible than retrying immediately.

The spec provides for the SA to return a BUSY response. When that happens, this 
patch causes us to wait for the original request to time out before retrying, 
not trying again immediately. In effect, we are pretending we never got the 
BUSY response and allowing the request to time out, instead.

Roland Dreier said:

 The indentation of values seems pretty crazy here.  Also I'm not sure what 
 most of these defines are for?  They seem unused in this patch.

The indentation is probably from the conversion of tabs to spaces when the 
patch was pasted into the email - correcting it is no problem.  The value 
IB_MGMT_MAD_STATUS_BUSY is used in the patch, the others are defined because 
they are the other possible values for the same status field. We might as well 
define them all, for completeness.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
Sean said:
 I don't object to the concept of treating a busy response as a timeout, but 
 how does this help prevent overwhelming the SA?  It continues to retry the 
 queries, even if the SA says that it's too busy to respond without adjusting 
 the timeout specified by the user.  I would think that you'd at least want to 
 adjust the timeout (double it or use some random backoff).


Well, the current behavior is to simply return the BUSY to the client or ULP, 
which  is either treated as a permanent error or causes an immediate retry. 
This can be a big problem with, for example, ipoib which sets retries to 15 and 
(as I understand it) immediately retries to connect when getting an error 
response from the SA. Other ulps have similar settings. Without some kind of 
delay, starting up ipoib on a large fabric (at boot time, for example) can 
cause a real packet storm. 

By treating BUSY replies identically to timeouts, this patch at least 
introduces a delay between attempts. In the case of the ULPs, the delay is 
typically 4 seconds.

Sean said:
 The general guideline that we've been using for adjusting timeouts has been 
 to report the failures and let the caller make the a necessary adjustments.  
 As far as I know, the only way for user space applications to query the SA 
 are through the librdmacm, which sets retries to 0, or through the libibumad 
 interface directly.  I would expect any application using the latter to be 
 intelligent enough to handle a busy response.


And this approach encourages applications to adjust their timeouts 
appropriately by treating BUSY responses as non-events and forcing the 
applications to wait for their request to time out.

Depending on the application developers to take BUSY responses into account 
seems to be asking for trouble - it allows one rogue app to bring the SA to its 
knees, for example. By enforcing this timeout model in the kernel, we guarantee 
that there will be at least some delay between each message when the SA is 
reporting a busy status. And as I previously mentioned this patch also affects 
kernel code, much of which does use retries.

Sean said:
 Maybe we should re-think that guideline and allow users to simply indicate 
 that the MAD layer should use reasonable defaults.  This would enable the 
 ib_mad module to adjust the timeout values for all consumers based on actual 
 destination response times.  It could also back off retrying multiple 
 requests that were initiated around the same time, instead only retrying the 
 first request, while simply increasing the timeout values for the others.  
 This is more complex, but we should be able to start with something fairly 
 simple.

It's an interesting idea, but in the meantime this is a problem that affects 
large clusters today.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
Hal said:
Should a busy be retried at all at the mad layer ? Is a special longer) 
timeout policy for busy needed ?

Also, should this be done for all MADs classified by ib_response_mad (e.g. trap 
represses) ?

Hal, 

The idea of processing BUSY responses in the MAD layer is to BUSY responses 
like timeouts - which are currently handled by the MAD layer. Right now there 
is an issue where various apps and ULPs either treat BUSY as a cause to 
immediately retry or as a permanent error. This doesn't seem to affect users of 
the OpenSM so much because (as I understand it) the OpenSM seems to discard 
requests when it gets too busy - but for other SA/SMs, it can cause a major 
packet storm or, worse, a simple loss of connectivity where MPI jobs or kernel 
ULPs simply assume the SA is broken because they got a BUSY reply.

By treating the BUSY reply as a timeout, we're actually simplifying matters by 
fitting into existing practice.

As for needing a longer timeout - in our old proprietary stack, QLogic did have 
a longer timeout for retrying busy replies than for normal timeouts - but we 
should try to get this in now so we can get some relief before we begin the 
long term discussion of the best way to handle this issue overall.



RE: [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
 But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) 

In that case, what is the purpose of the BUSY response? 

-Original Message-
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] 
Sent: Friday, June 04, 2010 6:58 PM
To: Hefty, Sean
Cc: Mike Heinz; linux-rdma@vger.kernel.org; e...@openfabrics.org
Subject: Re: [PATCH] Handling busy responses from the SA

On Fri, Jun 04, 2010 at 02:05:10PM -0700, Hefty, Sean wrote:

 Maybe we should re-think that guideline and allow users to simply
 indicate that the MAD layer should use reasonable defaults.  This
 would enable the ib_mad module to adjust the timeout values for all
 consumers based on actual destination response times.  It could also
 back off retrying multiple requests that were initiated around the
 same time, instead only retrying the first request, while simply
 increasing the timeout values for the others.  This is more complex,
 but we should be able to start with something fairly simple.

A common method for handling this sort of thing is to randomize
the retry timeout. It would be a good idea to randomize all timeouts,
but the BUSY replies should probably randomize over a longer time
period.

Randomization prevents nodes in the cluster from self-synchronizing
and making the load on the SA worse.

But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) But if you really want to do this then I
think a different, larger, timeout should be used than the standard
mad timeout.

Also, I guess, it would be a good API choice if the caller could say
'get me a reply for this mad or error within 60s' rather than specify
details like retry counts, etc. The timeout values should be globally
set and derived from the usual SA provided data for network transits...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Handling busy responses from the SA

2010-06-04 Thread Mike Heinz
The purpose of this patch is to cause the ib_mad driver to discard busy 
responses from the SA, effectively causing busy responses to become time outs.

This ensures that naïve IB applications cannot overwhelm the SA with queries, 
which could happen when a cluster is being rebooted, or when a large HPC 
application is started.

Note that this patch directly changes the same code affected by the mad user 
rmpp patch - it cannot be successfully applied without that patch.

Signed-Off-By: Michael Heinz michael.he...@qlogic.com



diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index efca783..05f2930 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1815,9 +1815,20 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,
 */
/* Complete corresponding request */
if (ib_response_mad(mad_recv_wc-recv_buf.mad)) {
+   u16 busy = 
__be16_to_cpu(mad_recv_wc-recv_buf.mad-mad_hdr.status) 
+   IB_MGMT_MAD_STATUS_BUSY;
+
spin_lock_irqsave(mad_agent_priv-lock, flags);
mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc);
if (mad_send_wr) {
+   if (busy  mad_send_wr-retries_left) {
+   /* Just let the query timeout and have it 
requeued later */
+   spin_unlock_irqrestore(mad_agent_priv-lock, 
flags);
+   ib_free_recv_mad(mad_recv_wc);
+   deref_mad_agent(mad_agent_priv);
+   printk(KERN_NOTICE PFX Response returned with 
MAD_STATUS_BUSY\n);
+   return;
+   }
ib_mark_mad_done(mad_send_wr);
spin_unlock_irqrestore(mad_agent_priv-lock, flags);
 
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index 2651e93..e9dc4cc 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -77,6 +77,15 @@
 
 #define IB_MGMT_MAX_METHODS128
 
+/* MAD Status field bit masks */
+#define IB_MGMT_MAD_STATUS_SUCCESS 
0x
+#define IB_MGMT_MAD_STATUS_BUSY
0x0001
+#define IB_MGMT_MAD_STATUS_REDIRECT_REQD   0x0002
+#define IB_MGMT_MAD_STATUS_BAD_VERERSION   0x0004  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD  0x0008  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB   0x000c
+#define IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE0x001c
+
 /* RMPP information */
 #define IB_MGMT_RMPP_VERSION   1
 #define IB_MGMT_RMPP_PASSTHRU  255
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Handling busy responses from the SA

2010-06-04 Thread Mike Heinz
The purpose of this patch is to cause the ib_mad driver to discard busy 
responses from the SA, effectively causing busy responses to become time outs.

This ensures that naïve IB applications cannot overwhelm the SA with queries, 
which could happen when a cluster is being rebooted, or when a large HPC 
application is started.

Note that this patch directly changes the same code affected by the mad user 
rmpp patch - it cannot be successfully applied without that patch.

Signed-Off-By: Michael Heinz michael.he...@qlogic.com



diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c 
index efca783..05f2930 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1815,9 +1815,20 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,
 */
/* Complete corresponding request */
if (ib_response_mad(mad_recv_wc-recv_buf.mad)) {
+   u16 busy = 
__be16_to_cpu(mad_recv_wc-recv_buf.mad-mad_hdr.status) 
+   IB_MGMT_MAD_STATUS_BUSY;
+
spin_lock_irqsave(mad_agent_priv-lock, flags);
mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc);
if (mad_send_wr) {
+   if (busy  mad_send_wr-retries_left) {
+   /* Just let the query timeout and have it 
requeued later */
+   spin_unlock_irqrestore(mad_agent_priv-lock, 
flags);
+   ib_free_recv_mad(mad_recv_wc);
+   deref_mad_agent(mad_agent_priv);
+   printk(KERN_NOTICE PFX Response returned with 
MAD_STATUS_BUSY\n);
+   return;
+   }
ib_mark_mad_done(mad_send_wr);
spin_unlock_irqrestore(mad_agent_priv-lock, flags);
 
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 
2651e93..e9dc4cc 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -77,6 +77,15 @@
 
 #define IB_MGMT_MAX_METHODS128
 
+/* MAD Status field bit masks */
+#define IB_MGMT_MAD_STATUS_SUCCESS 
0x
+#define IB_MGMT_MAD_STATUS_BUSY
0x0001
+#define IB_MGMT_MAD_STATUS_REDIRECT_REQD   0x0002
+#define IB_MGMT_MAD_STATUS_BAD_VERERSION   0x0004  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD  0x0008  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB   0x000c
+#define IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE0x001c
+
 /* RMPP information */
 #define IB_MGMT_RMPP_VERSION   1
 #define IB_MGMT_RMPP_PASSTHRU  255
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Handling busy responses from the SA

2010-06-04 Thread Roland Dreier
  The purpose of this patch is to cause the ib_mad driver to discard
  busy responses from the SA, effectively causing busy responses to
  become time outs.

I don't have a strong opinion on this but it seems a bit odd.  If we're
just going to drop the response anyway, why did the SA send it in the
first place?  On the other hand, if the SA told us it's busy, it does
seem we could do something more sensible than retrying immediately.

Any opinions from anyone who worked on fabric scalability?

  +printk(KERN_NOTICE PFX Response returned with 
  MAD_STATUS_BUSY\n);

Do we want to spam kernel logs with this?  Seems it could generate a lot
of messages.

  +#define IB_MGMT_MAD_STATUS_SUCCESS  
  0x
  +#define IB_MGMT_MAD_STATUS_BUSY 
  0x0001
  +#define IB_MGMT_MAD_STATUS_REDIRECT_REQD0x0002
  +#define IB_MGMT_MAD_STATUS_BAD_VERERSION0x0004  
  +#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD   0x0008  
  +#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB0x000c
  +#define IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE 0x001c

The indentation of values seems pretty crazy here.  Also I'm not sure
what most of these defines are for?  They seem unused in this patch.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Handling busy responses from the SA

2010-06-04 Thread Hefty, Sean
 This ensures that naïve IB applications cannot overwhelm the SA with
 queries, which could happen when a cluster is being rebooted, or when a
 large HPC application is started.

I don't object to the concept of treating a busy response as a timeout, but how 
does this help prevent overwhelming the SA?  It continues to retry the queries, 
even if the SA says that it's too busy to respond without adjusting the timeout 
specified by the user.  I would think that you'd at least want to adjust the 
timeout (double it or use some random backoff).

The general guideline that we've been using for adjusting timeouts has been to 
report the failures and let the caller make the a necessary adjustments.  As 
far as I know, the only way for user space applications to query the SA are 
through the librdmacm, which sets retries to 0, or through the libibumad 
interface directly.  I would expect any application using the latter to be 
intelligent enough to handle a busy response.

Maybe we should re-think that guideline and allow users to simply indicate that 
the MAD layer should use reasonable defaults.  This would enable the ib_mad 
module to adjust the timeout values for all consumers based on actual 
destination response times.  It could also back off retrying multiple requests 
that were initiated around the same time, instead only retrying the first 
request, while simply increasing the timeout values for the others.  This is 
more complex, but we should be able to start with something fairly simple.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Handling busy responses from the SA

2010-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2010 at 02:05:10PM -0700, Hefty, Sean wrote:

 Maybe we should re-think that guideline and allow users to simply
 indicate that the MAD layer should use reasonable defaults.  This
 would enable the ib_mad module to adjust the timeout values for all
 consumers based on actual destination response times.  It could also
 back off retrying multiple requests that were initiated around the
 same time, instead only retrying the first request, while simply
 increasing the timeout values for the others.  This is more complex,
 but we should be able to start with something fairly simple.

A common method for handling this sort of thing is to randomize
the retry timeout. It would be a good idea to randomize all timeouts,
but the BUSY replies should probably randomize over a longer time
period.

Randomization prevents nodes in the cluster from self-synchronizing
and making the load on the SA worse.

But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) But if you really want to do this then I
think a different, larger, timeout should be used than the standard
mad timeout.

Also, I guess, it would be a good API choice if the caller could say
'get me a reply for this mad or error within 60s' rather than specify
details like retry counts, etc. The timeout values should be globally
set and derived from the usual SA provided data for network transits...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html