Re: [PATCH] scsi device recovery

Bernd Schubert Wed, 12 Dec 2007 09:55:59 -0800

[Hmm, resending since mail after more than 30min still not on the ML, maybe 
the attachment was too large? I have uploaded the log to 
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/scsi/kern.log.1]


On Wednesday 12 December 2007 16:59:36 James Bottomley wrote:
> On Wed, 2007-12-12 at 15:36 +0100, Bernd Schubert wrote:
> > On Wednesday 12 December 2007 14:39:27 Matthew Wilcox wrote:
> > > On Wed, Dec 12, 2007 at 01:54:14PM +0100, Bernd Schubert wrote:
> > > > below is a patch introducing device recovery, trying to prevent i/o
> > > > errors when a DID_NO_CONNECT or SOFT_ERROR does happen.
> > >
> > > Why doesn't the regular scsi_eh do what you need?
> >
> > First of all, it is presently simply not called when the two errors above
> > do happen. This could be changed, of course.
>
> Erm, I think you'll find the error handler does activate on
> DID_SOFT_ERROR.  It causes a retry via the eh.  DID_NO_CONNECT is an

Dec  7 23:48:45 beo-96 kernel: [94605.297924] sd 2:0:5:0: [sdd] Result: 
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
Dec  7 23:48:45 beo-96 kernel: [94605.297932] end_request: I/O error, dev sdd, 
sector 7706802052
Dec  7 23:48:45 beo-96 kernel: [94605.297937] raid5:md5: read error not 
correctable (sector 871932472 on sdd3).

Full log attached.

> immediate error with no eh intervention because it means that the target
> went away.  Handling this as a retryable error isn't an option because
> it will interfere with hotplug.

Then we need a sysfs flag one can set to manually enable eh for these devices
on DID_NO_CONNECT. 

>
> > Secondly, I think scsi_eh is in most cases doing too much. We are
> > fighting with flaky Infortrend boxes here, and scsi_eh sometimes manages
> > to crash their scsi channels. In most cases it is sufficient to stall any
> > io to the device and then to resume.
>
> But that's basically the default behaviour of the error handler (stall
> then resume).
>
> > For most scsi devices one probably doesn't need a suspend time or it can
> > be very small, this still needs to become configurable via sysfs.
>
> You mean a wait time beyond what the error handler currently does
> (basically it waits for the quiesce, begins error handling and then
> sends a test unit ready when it finishes before restarting).

In deh just waits on the first error and then only does a DV. For 
these infortrend devices, thats mostly sufficient.

>
> > Thirdly, scsi_eh doesn't give up, in most cases, when the scsi channel of
> > a Infortrend box crashed, it tried forever to recover.
> > To improve this is still on my todo list.
>
> Could you send traces for this.  I thought the error handler had been
> fixed over the last few years always to terminate.  If there's a case
> where it doesn't, this needs fixing.

I'm attaching the syslog, this is 2.6.22 + additional printks, dump_stack()'s
and msleep()'s.
At 03:59:36 the system finally went into wait_for_completion(), similar
to the "everything in wait_for_completion, what is my system doing?" thread.


Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] scsi device recovery

Reply via email to