[PATCH] scsi: Allow error handling timeout to be specified

2013-05-09 Thread Martin K. Petersen
Introduce eh_timeout which can be used for error handling purposes. This was previously hardcoded to 10 seconds in the SCSI error handling code. However, for some fast-fail scenarios it is necessary to be able to tune this as it can take several iterations (bus device, target, bus, controller) bef

RE: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-13 Thread Elliott, Robert (Server Storage)
: [PATCH] scsi: Allow error handling timeout to be specified > > On Fri, 2013-05-10 at 16:24 +0200, Hannes Reinecke wrote: > > On 05/10/2013 04:01 PM, Ewan Milne wrote: > > > On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote: > > >> On Fri, May 10, 2013 at 3:43

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-13 Thread Jeremy Linton
On 5/13/2013 10:03 AM, Hannes Reinecke wrote: > The other LUNs haven't reported an error. But how do you know whether they > are still okay? The other LUNs might simply be idle, and no commands have > been send to them. Well, how about generating std inquiry against them if they are idle

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-13 Thread Baruch Even
On Mon, May 13, 2013 at 6:58 PM, Jeremy Linton wrote: > On 5/13/2013 10:03 AM, Hannes Reinecke wrote: >> The other LUNs haven't reported an error. But how do you know whether they >> are still okay? The other LUNs might simply be idle, and no commands have >> been send to them. > > Well, h

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-13 Thread Martin K. Petersen
> "Jeremy" == Jeremy Linton writes: Jeremy> Well, how about generating std inquiry against them if they are Jeremy> idle and the given HBA has a device in error state? Then you can Jeremy> make a rough approximation of what has failed, and escalate the Jeremy> error handling if all the device

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-13 Thread Jeremy Linton
On 5/13/2013 3:29 PM, Martin K. Petersen wrote: > others. We see cases fairly often where a misbehaving target has > confused the HBA enough that we can not bring the device back without > doing an HBA firmware reset. Despite I/O completing successfully on > other targets connected to the same HBA

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-14 Thread Martin K. Petersen
> "Jeremy" == Jeremy Linton writes: >> others. We see cases fairly often where a misbehaving target has >> confused the HBA enough that we can not bring the device back without >> doing an HBA firmware reset. Despite I/O completing successfully on >> other targets connected to the same HBA.

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-09 Thread Bart Van Assche
On 05/10/13 05:11, Martin K. Petersen wrote: Introduce eh_timeout which can be used for error handling purposes. This was previously hardcoded to 10 seconds in the SCSI error handling code. However, for some fast-fail scenarios it is necessary to be able to tune this as it can take several iterat

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Ewan Milne
On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: > Introduce eh_timeout which can be used for error handling purposes. This > was previously hardcoded to 10 seconds in the SCSI error handling > code. However, for some fast-fail scenarios it is necessary to be able > to tune this as it c

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Hannes Reinecke
On 05/10/2013 02:43 PM, Ewan Milne wrote: > On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: >> Introduce eh_timeout which can be used for error handling purposes. This >> was previously hardcoded to 10 seconds in the SCSI error handling >> code. However, for some fast-fail scenarios it

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Bryn M. Reeves
On 05/10/2013 01:43 PM, Ewan Milne wrote: On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: Introduce eh_timeout which can be used for error handling purposes. This was previously hardcoded to 10 seconds in the SCSI error handling code. However, for some fast-fail scenarios it is nece

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Baruch Even
On Fri, May 10, 2013 at 3:43 PM, Ewan Milne wrote: > > On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: > > Introduce eh_timeout which can be used for error handling purposes. This > > was previously hardcoded to 10 seconds in the SCSI error handling > > code. However, for some fast-fa

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Ewan Milne
On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote: > On Fri, May 10, 2013 at 3:43 PM, Ewan Milne wrote: > > > > On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: > > > Introduce eh_timeout which can be used for error handling purposes. This > > > was previously hardcoded to 10 second

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Hannes Reinecke
On 05/10/2013 04:01 PM, Ewan Milne wrote: > On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote: >> On Fri, May 10, 2013 at 3:43 PM, Ewan Milne wrote: >>> >>> On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: Introduce eh_timeout which can be used for error handling purposes. This

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Bryn M. Reeves
On 05/10/2013 03:24 PM, Hannes Reinecke wrote: However, this time is only defined _on the initiator_. The specification does _NOT_ have any fixed timeout values for _any_ command. As such it could in theory (and does, if you happen to run against certain arrays under certain conditions) take seve

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Martin K. Petersen
> "Bart" == Bart Van Assche writes: Bart> Have you considered to move the eh_timeout assignment statement to Bart> just before the transport_configure_device() and slave_configure() Bart> calls ? That would allow transport drivers and LLD drivers to Bart> override the default eh_timeout valu

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Martin K. Petersen
> "Baruch" == Baruch Even writes: Baruch> Actually reducing the timeouts is probably not a good approach Baruch> since it will cause the host to take a more radical approach Baruch> without waiting sufficiently for a potential recovery. Reducing the eh timeout is a requirement in many cluste

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Martin K. Petersen
> "Martin" == Martin K Petersen writes: Martin> I'm also working on a patch to add some heuristics to avoid the Martin> HBA and bus resets Or rather: Defer the HBA and bus resets... Martin> if I/O is completing successfully on other attached targets. But Martin> that's an orthogonal issue.

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Ewan Milne
On Fri, 2013-05-10 at 16:24 +0200, Hannes Reinecke wrote: > On 05/10/2013 04:01 PM, Ewan Milne wrote: > > On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote: > >> On Fri, May 10, 2013 at 3:43 PM, Ewan Milne wrote: > >>> > >>> On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: > I

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Baruch Even
On Fri, May 10, 2013 at 5:01 PM, Ewan Milne wrote: > On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote: >> On Fri, May 10, 2013 at 3:43 PM, Ewan Milne wrote: >> > >> > On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: >> > > Introduce eh_timeout which can be used for error handling

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Baruch Even
On Fri, May 10, 2013 at 5:53 PM, Martin K. Petersen wrote: >> "Baruch" == Baruch Even writes: > > Baruch> Actually reducing the timeouts is probably not a good approach > Baruch> since it will cause the host to take a more radical approach > Baruch> without waiting sufficiently for a potentia

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Hannes Reinecke
On 05/10/2013 07:51 PM, Baruch Even wrote: On Fri, May 10, 2013 at 5:01 PM, Ewan Milne wrote: On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote: On Fri, May 10, 2013 at 3:43 PM, Ewan Milne wrote: On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote: Introduce eh_timeout which can

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-10 Thread Baruch Even
On Fri, May 10, 2013 at 11:18 PM, Hannes Reinecke wrote: > On 05/10/2013 07:51 PM, Baruch Even wrote: >> >> The error handling I have in mind (admittedly, not fully thought out) >> should work for both FC and SAS. Currently the error recovery >> progresses at the host level regardless of if the er

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-12 Thread Hannes Reinecke
On 05/10/2013 09:27 PM, Baruch Even wrote: > On Fri, May 10, 2013 at 11:18 PM, Hannes Reinecke wrote: >> On 05/10/2013 07:51 PM, Baruch Even wrote: >>> >>> The error handling I have in mind (admittedly, not fully thought out) >>> should work for both FC and SAS. Currently the error recovery >>> pr

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-13 Thread Jeremy Linton
On 5/13/2013 12:46 AM, Hannes Reinecke wrote: > True. But and the end of the day, we _do_ want to recover the failed LUN. > If we were to disable that faulty LUN and continue running with the others > we won't have a chance of _ever_ recovering that one LUN. I don't buy this. Especially f

Re: [PATCH] scsi: Allow error handling timeout to be specified

2013-05-13 Thread Hannes Reinecke
On 05/13/2013 04:40 PM, Jeremy Linton wrote: > On 5/13/2013 12:46 AM, Hannes Reinecke wrote: > >> True. But and the end of the day, we _do_ want to recover the failed LUN. >> If we were to disable that faulty LUN and continue running with the others >> we won't have a chance of _ever_ recovering t