Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Ric Wheeler



James Bottomley wrote:


On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
 


I believe you made the first change in response to my prodding at the time,
when libata was not returning valid sense data (no LBA) for media errors.
The SCSI EH handling of that was rather poor at the time,
and so having it not retry the remaining sectors was actually
a very good fix at the time.

But now, libata *does* return valid sense data for LBA/DMA drives,
and the workaround from circa 2.6.16 is no longer the best we can do.
Now that we know which sector failed, we ought to be able to skip
over it, and continue with the rest of the merged request.
   



We can ... the big concern with your approach, which you haven't
addressed is the time factor.  For most SCSI devices, returning a fatal
MEDIUM ERROR means we're out of remapping table, and also that there's
probably a bunch of sectors on the track that are now out.  Thus, there
are almost always multiple sector failures.  In linux, the average
request size on a filesystem is around 64-128kb; thats 128-256 sectors.
If we fail at the initial sector, we have to go through another 128-256
attempts, with the internal device retries, before we fail the entire
request.  Some devices can take a second or so for each read before they
finally give up and decide they really can't read the sector, so you're
looking at 2-5 minutes before the machine finally fails this one
request ... and much worse for devices that retry more times.
 



This is not the case on a read error - we commonly see transient errors 
on reads from disks.  What we push our vendors to do is to try to keep 
the worst case response down to tens of seconds as they retry, etc 
internally with a device. When they take that long (and they do), adding 
retries up the stack can translate into minutes per sector.


The interesting point of this question is about the typically pattern of 
IO errors. On a read, it is safe to assume that you will have issues 
with some bounded numbers of adjacent sectors.


 


One thing that could be even better than the patch below,
would be to have it perhaps skip the entire bio that includes
the failed sector, rather than only the bad sector itself.
   



Er ... define skip over the bio.  A bio is simply a block
representation for a bunch of sg elements coming in to the elevator.
Mostly what we see in SCSI is a single bio per request, so skipping the
bio is really the current behaviour (to fail the rest of the request).
 

This is really a tricky one - what can happen when we fail merged IO 
requests is really unpredictable behavior up at the application level 
since the IO error might not be at all relevant to my part of the 
request.  Merging can produce a request that is much larger than any 
normal drive error.


I really like the idea of being able to set this kind of policy on a per 
drive instance since what you want here will change depending on what 
your system requirements are, what the system is trying to do (i.e., 
when trying to recover a failing but not dead yet disk, IO errors should 
be as quick as possible and we should choose an IO scheduler that does 
not combine IO's).


 


I think doing that might address most concerns expressed here.
Have you got an alternate suggestion, James?
   



James
 


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Alan
 The interesting point of this question is about the typically pattern of 
 IO errors. On a read, it is safe to assume that you will have issues 
 with some bounded numbers of adjacent sectors.

Which in theory you can get by asking the drive for the real sector size
from the ATA7 info. (We ought to dig this out more as its relevant for
partition layout too).

 I really like the idea of being able to set this kind of policy on a per 
 drive instance since what you want here will change depending on what 
 your system requirements are, what the system is trying to do (i.e., 
 when trying to recover a failing but not dead yet disk, IO errors should 
 be as quick as possible and we should choose an IO scheduler that does 
 not combine IO's).

That seems to be arguing for a bounded live time including retry run
time for a command. That's also more intuitive for real time work and for
end user setup. Either work or fail within n seconds


Alan
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Alan
 your system requirements are, what the system is trying to do (i.e., 
 when trying to recover a failing but not dead yet disk, IO errors should 
 be as quick as possible and we should choose an IO scheduler that does 
 not combine IO's).

If this is the right strategy for disk recovery for a given type of
device then this ought to be an automatic strategy. Most end users will
not have the knowledge to frob about in sysfs, and if the bad sector hits
at the wrong moment a sensible automatic recovery strategy is going to do
the right thing faster than human intervention, which may be some hours
away.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Mark Lord

Alan wrote:


If this is the right strategy for disk recovery for a given type of
device then this ought to be an automatic strategy. Most end users will
not have the knowledge to frob about in sysfs, and if the bad sector hits
at the wrong moment a sensible automatic recovery strategy is going to do
the right thing faster than human intervention, which may be some hours
away.


I think something we seem to be discussing here are the opposite aims
of enterprise RAID (traditional SCSI market) versus desktop filesystems
(now the number one user of Linux SCSI, courtesy of libata).

With RAID, it's safe to suggest that a very fast, bounded exit from EH
is almost always what the customer / end-user wants, because the higher
level RAID management can then deal with the failed drive offline.

But for a desktop filesystem, failing out quickly and bulk-failing megabytes
around a couple of bad sectors is not good -- no redundancy, and this will
lead to unneeded data loss.

It's beginning to look like this needs to be run-time tuneable,
per block minor device (per-partition), through sysfs at a minimum.
The RAID tools could then automatically choose settings to bias more
towards an instant exit when errors are found, whereas a non-RAID
desktop could select a more reliable recovery strategy.

Right now, with Jame's patch (earlier in this thread), the whole scheme
is heavily weighted towards the RAID instant exit strategy, which is
making me quite nervous about the data on my laptop.

Cheers 
-

To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Ric Wheeler



James Bottomley wrote:

On Fri, 2007-02-02 at 14:42 +, Alan wrote:
  
The interesting point of this question is about the typically pattern of 
IO errors. On a read, it is safe to assume that you will have issues 
with some bounded numbers of adjacent sectors.
  

Which in theory you can get by asking the drive for the real sector size
from the ATA7 info. (We ought to dig this out more as its relevant for
partition layout too).



Actually, my point is that damage typically impacts a cluster of disk 
sectors that are adjacent. Think of a drive that has junk on the platter 
or a some such thing - the contamination is likely to be localized.

I really like the idea of being able to set this kind of policy on a per 
drive instance since what you want here will change depending on what 
your system requirements are, what the system is trying to do (i.e., 
when trying to recover a failing but not dead yet disk, IO errors should 
be as quick as possible and we should choose an IO scheduler that does 
not combine IO's).
  

That seems to be arguing for a bounded live time including retry run
time for a command. That's also more intuitive for real time work and for
end user setup. Either work or fail within n seconds



Actually, then I think perhaps we use the allowed retries for this ...
  
I really am not a big retry fan for most modern drives - the drive will 
try really, really hard to complete an IO for us and multiple retries 
can just slow down the higher level application from recovering.

So you would fail a single sector and count it against the retries.
When you've done this allowed retries times, you fail the rest of the
request.

James

  
I think that we need to play with some of these possible solutions on 
some  real-world bad drives and see how they react. 


We should definitely talk more about this at the workshop ;-)

ric

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Matt Mackall
On Fri, Feb 02, 2007 at 11:06:19AM -0500, Mark Lord wrote:
 Alan wrote:
 
 If this is the right strategy for disk recovery for a given type of
 device then this ought to be an automatic strategy. Most end users will
 not have the knowledge to frob about in sysfs, and if the bad sector hits
 at the wrong moment a sensible automatic recovery strategy is going to do
 the right thing faster than human intervention, which may be some hours
 away.
 
 I think something we seem to be discussing here are the opposite aims
 of enterprise RAID (traditional SCSI market) versus desktop filesystems
 (now the number one user of Linux SCSI, courtesy of libata).
 
 With RAID, it's safe to suggest that a very fast, bounded exit from EH
 is almost always what the customer / end-user wants, because the higher
 level RAID management can then deal with the failed drive offline.
 
 But for a desktop filesystem, failing out quickly and bulk-failing megabytes
 around a couple of bad sectors is not good -- no redundancy, and this will
 lead to unneeded data loss.
 
 It's beginning to look like this needs to be run-time tuneable,
 per block minor device (per-partition), through sysfs at a minimum.
 The RAID tools could then automatically choose settings to bias more
 towards an instant exit when errors are found, whereas a non-RAID
 desktop could select a more reliable recovery strategy.
 
 Right now, with Jame's patch (earlier in this thread), the whole scheme
 is heavily weighted towards the RAID instant exit strategy, which is
 making me quite nervous about the data on my laptop.

Also worth considering is that spending minutes trying to reread
damaged sectors is likely to accelerate your death spiral. More data
may be recoverable if you give up quickly in a first pass, then go
back and manually retry damaged bits with smaller I/Os.

Another approach is to have separate retries per hard sector and
hard errors per MB at the block device level. We don't want to have
the same number of retries for a 64KB block as a 1MB block and we
certainly don't want to do 2K retries in a row.

So for instance, I could have the first set to 1 and the second set to
16. For a 256KB read, this gets scaled down to 4, which means we'll
retry each sector once and fail the whole I/O when we hit the 5th
sector error.

More reasonable number might be 0 and 1, meaning don't do OS level
retries on sectors and fail the whole I/O on the second sector error
(or immediately for smaller reads).

It might also be informative to add a kernel message reporting if a retry
(in the non-transient cases) ever actually succeeds. 

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Douglas Gilbert
Alan wrote:
 The interesting point of this question is about the typically pattern of 
 IO errors. On a read, it is safe to assume that you will have issues 
 with some bounded numbers of adjacent sectors.
 
 Which in theory you can get by asking the drive for the real sector size
 from the ATA7 info. (We ought to dig this out more as its relevant for
 partition layout too).
 
 I really like the idea of being able to set this kind of policy on a per 
 drive instance since what you want here will change depending on what 
 your system requirements are, what the system is trying to do (i.e., 
 when trying to recover a failing but not dead yet disk, IO errors should 
 be as quick as possible and we should choose an IO scheduler that does 
 not combine IO's).
 
 That seems to be arguing for a bounded live time including retry run
 time for a command. That's also more intuitive for real time work and for
 end user setup. Either work or fail within n seconds

Which is more or less the streaming feature set in recent
ATA standards. [Alas, streaming and NCQ/TCQ can't be done
with the same access.] SCSI has its Read Write Error Recovery
mode page which doesn't have timeouts but does have Read
and Write Retry Counts amongst other fields that control
the amount (and indirectly the time) of attempted error
recovery.

Doug Gilbert


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Mark Lord

Matt Mackall wrote:

..
Also worth considering is that spending minutes trying to reread
damaged sectors is likely to accelerate your death spiral. More data
may be recoverable if you give up quickly in a first pass, then go
back and manually retry damaged bits with smaller I/Os.


All good input.  But what was being debated here is not so much
the retrying of known-bad sectors, but rather what to do about
the kiBs or MiBs of sectors remaining in a merged request after
hitting a single bad sector mid-way.

Currently, SCSI just abandons the entire remaining workload.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Matt Mackall
On Fri, Feb 02, 2007 at 05:58:04PM -0500, Mark Lord wrote:
 Matt Mackall wrote:
 ..
 Also worth considering is that spending minutes trying to reread
 damaged sectors is likely to accelerate your death spiral. More data
 may be recoverable if you give up quickly in a first pass, then go
 back and manually retry damaged bits with smaller I/Os.
 
 All good input.  But what was being debated here is not so much
 the retrying of known-bad sectors, but rather what to do about
 the kiBs or MiBs of sectors remaining in a merged request after
 hitting a single bad sector mid-way.

Yep, that's precisely what was addressed in the part you snipped. 

My main point being that what to do about the remaining workload
should be dependent on the size of the I/O. If we encounter errors on
sectors 4,5,6,7,8.. of a 1MB request, we should have a threshold for
giving up. It's not unreasonable for that threshold to be larger than
1, but it should not be 2048.

And if we do the I/O as four 256KB requests, we should have
approximately the same number of retries (assuming the whole region's
bad).

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-01 Thread Mark Lord

James Bottomley wrote:

On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote:

Kernels since about 2.6.16 or so have been broken in this regard.
They complete the good sectors before the error,
and then fail the entire remaining portions of the request.


What was the commit that introduced the change? ... I have a vague
memory of it being deliberate.


I believe you made the first change in response to my prodding at the time,
when libata was not returning valid sense data (no LBA) for media errors.
The SCSI EH handling of that was rather poor at the time,
and so having it not retry the remaining sectors was actually
a very good fix at the time.

But now, libata *does* return valid sense data for LBA/DMA drives,
and the workaround from circa 2.6.16 is no longer the best we can do.
Now that we know which sector failed, we ought to be able to skip
over it, and continue with the rest of the merged request.

One thing that could be even better than the patch below,
would be to have it perhaps skip the entire bio that includes
the failed sector, rather than only the bad sector itself.

I think doing that might address most concerns expressed here.
Have you got an alternate suggestion, James?

..

Signed-off-by:  Mark Lord [EMAIL PROTECTED]
---
diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff 
old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
--- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.0 -0500
+++ linux/drivers/scsi/scsi_lib.c   2007-01-30 18:30:01.0 -0500
@@ -865,6 +865,12 @@
 */
if (sense_valid  !sense_deferred) {
switch (sshdr.sense_key) {
+   case MEDIUM_ERROR:
+   // Bad sector.  Fail it, and then continue the rest of 
the request:
+   if (scsi_end_request(cmd, 0, cmd-device-sector_size, 
1) == NULL) {


The sense key may have come with additional information  I think we want
to parse that (if it exists) rather than just blindly failing the first
sector of the request.


+   cmd-retries = 0;// go around again..
+   return;
+   }


This would drop through to the UNIT_ATTENTION case if scsi_end_request()
fails ... I don't think that's correct.


case UNIT_ATTENTION:
if (cmd-device-removable) {
/* Detected disc change.  Set a bit


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-01 Thread Mark Lord

James Bottomley wrote:

On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
..

One thing that could be even better than the patch below,
would be to have it perhaps skip the entire bio that includes
the failed sector, rather than only the bad sector itself.


Er ... define skip over the bio.  A bio is simply a block
representation for a bunch of sg elements coming in to the elevator.


Exactly.  Or rather, a block of sg_elements from a single point
of request, is it not?


Mostly what we see in SCSI is a single bio per request, so skipping the
bio is really the current behaviour (to fail the rest of the request).


Very good.  That's what it's supposed to do.

But if each request contained only a single bio, then all of Jens'
work on IO scheduling would be for nothing, n'est-ce pas?

In the case where a request consists of multiple bio's
which have been merged under a single request struct,
we really should give at least one attempt to each bio.

This way, in most cases, only the process that requested the
failed sector(s) will see an error, not the innocent victims
that happened to get merged onto the end.  Which could be very
critical stuff (or not -- it could be quite random).

So the time factor works out to one disk I/O timeout per failed bio.
That's what would have happened with the NOP scheduler anyway.

On the sytems I'm working with, I don't see huge numbers of bad sectors.
What they tend to show is just one or two bad sectors, widely scattered.

So:

I think doing that might address most concerns expressed here.
Have you got an alternate suggestion, James?


Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Jeff Garzik

Mark Lord wrote:

Eric D. Mudama wrote:


Actually, it's possibly worse, since each failure in libata will 
generate 3-4 retries.  With existing ATA error recovery in the drives, 
that's about 3 seconds per retry on average, or 12 seconds per 
failure.  Multiply that by the number of blocks past the error to 
complete the request..


It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..



FWIW -- speaking generally -- I think there are inevitable areas where 
libata error handling combined with SCSI error handling results in 
suboptimal error handling.


Just creating a list of this behavior should be handled this way, 
but in reality is handled in this silly way would be very helpful.


Error handling is tough to get right, because the code is exercised so 
infrequently.  Tejun has actually done an above-average job here, by 
making device probe, hotplug and other exceptions go through the 
libata EH code, thereby exercising the EH code more than one might 
normally assume.


Some errors in libata probably should not be retried more than once, 
when we have a definitive diagnosis.  Suggestions for improvements are 
welcome.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Ric Wheeler



Jeff Garzik wrote:

Mark Lord wrote:

Eric D. Mudama wrote:


Actually, it's possibly worse, since each failure in libata will 
generate 3-4 retries.  With existing ATA error recovery in the 
drives, that's about 3 seconds per retry on average, or 12 seconds 
per failure.  Multiply that by the number of blocks past the error to 
complete the request..


It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..



FWIW -- speaking generally -- I think there are inevitable areas where 
libata error handling combined with SCSI error handling results in 
suboptimal error handling.


Just creating a list of this behavior should be handled this way, 
but in reality is handled in this silly way would be very helpful.


I agree - Tejun has done a great job at giving us a great base. Next step is to 
get clarity on what the types of errors are and how to differentiate between 
them (and maybe how that would change by class of device?).




Error handling is tough to get right, because the code is exercised so 
infrequently.  Tejun has actually done an above-average job here, by 
making device probe, hotplug and other exceptions go through the 
libata EH code, thereby exercising the EH code more than one might 
normally assume.


Some errors in libata probably should not be retried more than once, 
when we have a definitive diagnosis.  Suggestions for improvements are 
welcome.


Jeff


One thing that we find really useful is to inject real errors into devices. Mark 
has some patches that let us inject media errors, we also bring back failed 
drives and run them through testing and occasionally get to use analyzers, etc 
to inject odd ball errors.


Hopefully, we will get some time to brainstorm about this at the workshop,

ric
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Ric Wheeler wrote:

Mark Lord wrote:

Eric D. Mudama wrote:
Actually, it's possibly worse, since each failure in libata will 
generate 3-4 retries. 


(note: libata does *not* generate retries for medium errors;
the looping is driven by the SCSI mid-layer code).


It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high.  It should be zero or one.

..
I think that drives retry enough, we should leave retry at zero for 
normal (non-removable) drives. Should this  be a policy we can set like 
we do with NCQ queue depth via /sys ?


Or perhaps we could have the mid-layer always early-exit
without retries for MEDIUM_ERROR, and still do retries for the rest.

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
with retry ATA opcodes for this).

But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread.  :)

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Alan
 When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
 as the drive itself has already done internal retries (libata uses the
 with retry ATA opcodes for this).

This depends on the firmware. Some of the raid firmware drives don't
appear to do retries in firmware.

 But meanwhile, we still have the original issue too, where a single stray
 bad sector can blow a system out of the water, because the mid-layer
 currently aborts everything after it from a large merged request.
 
 Thus the original patch from this thread.  :)

Agreed
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

James Bottomley wrote:


For the MD case, this is what REQ_FAILFAST is for.


I cannot find where SCSI honours that flag.  James?

And for that matter, even when I patch SCSI so that it *does* honour it,
I don't actually see the flag making it into the SCSI layer from above.

And I don't see where/how the block layer takes care when considering
merge FAILFAST/READA requests with non FAILFAST/READA requests.
To me, it looks perfectly happy to add non-FAILFAST/READA bios
to a FAILFAST request, risking data loss if a lower-layer decides
to honour the FAILFAST/READA flags.

So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Mark Lord wrote:

James Bottomley wrote:


For the MD case, this is what REQ_FAILFAST is for.


I cannot find where SCSI honours that flag.  James?


Scratch that thought.. SCSI honours it in scsi_end_request().

But I'm not certain that the block layer handles it correctly,
at least not in the 2.6.16/2.6.18 kernel that I'm working with today.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread James Bottomley
On Wed, 2007-01-31 at 10:13 -0500, Mark Lord wrote:
 James Bottomley wrote:
 
  For the MD case, this is what REQ_FAILFAST is for.

 I cannot find where SCSI honours that flag.  James?

Er, it's in scsi_error.c:scsi_decide_disposition():

  maybe_retry:

/* we requeue for retry because the error was retryable, and
 * the request was not marked fast fail.  Note that above,
 * even if the request is marked fast fail, we still requeue
 * for queue congestion conditions (QUEUE_FULL or BUSY) */
if ((++scmd-retries) = scmd-allowed
 !blk_noretry_request(scmd-request)) {

return NEEDS_RETRY;
} else {
/*
 * no more retries - report this one back to upper level.
 */
return SUCCESS;
}

 And for that matter, even when I patch SCSI so that it *does* honour it,
 I don't actually see the flag making it into the SCSI layer from above.
 
 And I don't see where/how the block layer takes care when considering
 merge FAILFAST/READA requests with non FAILFAST/READA requests.
 To me, it looks perfectly happy to add non-FAILFAST/READA bios
 to a FAILFAST request, risking data loss if a lower-layer decides
 to honour the FAILFAST/READA flags.
 
 So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Douglas Gilbert
Ric Wheeler wrote:
 
 
 Jeff Garzik wrote:
 Mark Lord wrote:
 Eric D. Mudama wrote:

 Actually, it's possibly worse, since each failure in libata will
 generate 3-4 retries.  With existing ATA error recovery in the
 drives, that's about 3 seconds per retry on average, or 12 seconds
 per failure.  Multiply that by the number of blocks past the error
 to complete the request..

 It really beats the alternative of a forced reboot
 due to, say, superblock I/O failing because it happened
 to get merged with an unrelated I/O which then failed..
 Etc..


 FWIW -- speaking generally -- I think there are inevitable areas where
 libata error handling combined with SCSI error handling results in
 suboptimal error handling.

 Just creating a list of this behavior should be handled this way,
 but in reality is handled in this silly way would be very helpful.
 
 I agree - Tejun has done a great job at giving us a great base. Next
 step is to get clarity on what the types of errors are and how to
 differentiate between them (and maybe how that would change by class of
 device?).
 

 Error handling is tough to get right, because the code is exercised so
 infrequently.  Tejun has actually done an above-average job here, by
 making device probe, hotplug and other exceptions go through the
 libata EH code, thereby exercising the EH code more than one might
 normally assume.

 Some errors in libata probably should not be retried more than once,
 when we have a definitive diagnosis.  Suggestions for improvements are
 welcome.

 Jeff
 
 One thing that we find really useful is to inject real errors into
 devices. Mark has some patches that let us inject media errors, we also
 bring back failed drives and run them through testing and occasionally
 get to use analyzers, etc to inject odd ball errors.

Ric,
Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
command support to flag a block as uncorrectable. There
is no need to send bad long data to it and suppress the
disk's automatic re-allocation logic.

In the case of ATA it is the WRITE UNCORRECTABLE command.
In the case of SCSI it is the WR_UNCOR bit in the WRITE
LONG command.

It seems that due to SAT any useful capability in the ATA
command set will soon appear in the corresponding SCSI
command set, if it is not already there.

Doug Gilbert

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Douglas Gilbert wrote:


Ric,
Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
command support to flag a block as uncorrectable. There
is no need to send bad long data to it and suppress the
disk's automatic re-allocation logic.


That'll be useful in a couple of years, once drives that have it
become more common.  For now, though, we're hacking current drives
using READ/WRITE LONG commands, with a corresponding patch to libata
to allow for the longer sector size involved.

Having real bad sectors, exactly where we want them on the media,
sure does make testing / fixing the EH mechanisms a lot more feasible.

Cheers

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Ric Wheeler



Alan wrote:

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
with retry ATA opcodes for this).


This depends on the firmware. Some of the raid firmware drives don't
appear to do retries in firmware.


I think that even for these devices, the actual drives behind the controller 
will do retries. In any case, it would be reasonable to be able to set this 
retry/no-retry via /sys to handle exceptional cases...





But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread.  :)


Agreed



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Alan wrote:

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
with retry ATA opcodes for this).


This depends on the firmware. Some of the raid firmware drives don't
appear to do retries in firmware.


One way to tell if this is true, is simply to time how long
the failed operation takes.  If the drive truly does not do retries,
then the media error should be reported more or less instantly
(assuming drive was already spun up).

If the failure takes more than a few hundred milliseconds to be reported,
or in this case 4-7 seconds typically, then we know the drive was doing
retries before it reported back.

I haven't seen any drive fail instantly yet.
Can anyone with those newfangled RAID edition drives try it
and report back?  Oh.. you'll need a way to create a bad sector.
I've got patches and a command-line utility for the job.

If your drive supports WRITE UNCORRECTABLE (hdparm -I, w/latest hdparm),
then the patches aren't needed.


But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread.  :)


Agreed

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread James Bottomley
On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote:
 Alan wrote:
  When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
  as the drive itself has already done internal retries (libata uses the
  with retry ATA opcodes for this).
  
  This depends on the firmware. Some of the raid firmware drives don't
  appear to do retries in firmware.
 
 One way to tell if this is true, is simply to time how long
 the failed operation takes.  If the drive truly does not do retries,
 then the media error should be reported more or less instantly
 (assuming drive was already spun up).

Well, the simpler way (and one we have a hope of implementing) is to
examine the ASC/ASCQ codes to see if the error is genuinely unretryable.

I seem to have dropped the ball on this one in that the scsi_error.c
pieces of this patch

http://marc.theaimsgroup.com/?l=linux-scsim=116485834119885

I thought I'd applied.  Apparently I didn't, so I'll go back and put
them in.

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

James Bottomley wrote:

On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote:

Alan wrote:

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
with retry ATA opcodes for this).

This depends on the firmware. Some of the raid firmware drives don't
appear to do retries in firmware.

One way to tell if this is true, is simply to time how long
the failed operation takes.  If the drive truly does not do retries,
then the media error should be reported more or less instantly
(assuming drive was already spun up).


Well, the simpler way (and one we have a hope of implementing) is to
examine the ASC/ASCQ codes to see if the error is genuinely unretryable.


My suggestion above was not for a kernel fix,
but rather just as a way of determining if drives
which claim no retries actually do them or not.  :)


I seem to have dropped the ball on this one in that the scsi_error.c
pieces of this patch

http://marc.theaimsgroup.com/?l=linux-scsim=116485834119885

I thought I'd applied.  Apparently I didn't, so I'll go back and put
them in.


Good.  That would be a useful supplement to the patch I posted here.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread Mark Lord

Eric D. Mudama wrote:


Actually, it's possibly worse, since each failure in libata will 
generate 3-4 retries.  With existing ATA error recovery in the drives, 
that's about 3 seconds per retry on average, or 12 seconds per failure.  
Multiply that by the number of blocks past the error to complete the 
request..


It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high.  It should be zero or one.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread James Bottomley
First off, please send SCSI patches to the SCSI list:
linux-scsi@vger.kernel.org

On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote:
 In ancient kernels, the SCSI disk code used to continue after
 encountering a MEDIUM_ERROR.  It would complete the good
 sectors before the error, fail the bad sector/block, and then
 continue with the rest of the request.
 
 Kernels since about 2.6.16 or so have been broken in this regard.
 They complete the good sectors before the error,
 and then fail the entire remaining portions of the request.

What was the commit that introduced the change? ... I have a vague
memory of it being deliberate.

 This is very risky behaviour, as a request is often a merge
 of several bios, and just because one application hits a bad sector
 is no reason to pretend that (for example) an adjacent directly lookup also 
 failed.
 
 This patch fixes the behaviour to be similar to what we had originally.
 
 When a bad sector is encounted, SCSI will now work around it again,
 failing *only* the bad sector itself.

Erm, but the corollary is that if we get a large read failure because of
a bad track, you're going to try and chunk up it a sector at a time
forcing an individual error for each sector is going to annoy some
people ... particularly removable medium ones which return this error if
the medium isn't present ... Are you sure this is really what we want to
do?

 Signed-off-by:  Mark Lord [EMAIL PROTECTED]
 ---
 diff -u --recursive --new-file 
 --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c 
 linux/drivers/scsi/scsi_lib.c
 --- old/drivers/scsi/scsi_lib.c   2007-01-30 13:58:05.0 -0500
 +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.0 -0500
 @@ -865,6 +865,12 @@
*/
   if (sense_valid  !sense_deferred) {
   switch (sshdr.sense_key) {
 + case MEDIUM_ERROR:
 + // Bad sector.  Fail it, and then continue the rest of 
 the request:
 + if (scsi_end_request(cmd, 0, cmd-device-sector_size, 
 1) == NULL) {

The sense key may have come with additional information  I think we want
to parse that (if it exists) rather than just blindly failing the first
sector of the request.

 + cmd-retries = 0;   // go around again..
 + return;
 + }

This would drop through to the UNIT_ATTENTION case if scsi_end_request()
fails ... I don't think that's correct.

   case UNIT_ATTENTION:
   if (cmd-device-removable) {
   /* Detected disc change.  Set a bit

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread Mark Lord

James Bottomley wrote:

First off, please send SCSI patches to the SCSI list:
linux-scsi@vger.kernel.org


Fixed already, thanks!


This patch fixes the behaviour to be similar to what we had originally.

When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.


Erm, but the corollary is that if we get a large read failure because of
a bad track, you're going to try and chunk up it a sector at a time


That's better than the huge data-loss scenario that we currently
have for single-sector errors.  MUCH better.


forcing an individual error for each sector is going to annoy some
people ... particularly removable medium ones which return this error if
the medium isn't present ... Are you sure this is really what we want to
do?


No, for removed-medium everything just fails right away.
This patch is *only* for media errors, not any other failures.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread Ric Wheeler



Mark Lord wrote:


Eric D. Mudama wrote:



Actually, it's possibly worse, since each failure in libata will 
generate 3-4 retries.  With existing ATA error recovery in the 
drives, that's about 3 seconds per retry on average, or 12 seconds 
per failure.  Multiply that by the number of blocks past the error to 
complete the request..



It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high.  It should be zero or one.

Cheers

I think that drives retry enough, we should leave retry at zero for 
normal (non-removable) drives. Should this  be a policy we can set like 
we do with NCQ queue depth via /sys ?


We need to be able to layer things like MD on top of normal drive errors 
in a way that will produce a system that provides reasonable response 
time despite any possible IO error on a single component.  Another case 
that we end up doing on a regular basis is drive recovery. Errors need 
to be limited in scope to just the impacted area and dispatched up to 
the application layer as quickly as we can so that you don't spend days 
watching a copy of  huge drive (think 750GB or more) ;-)


ric

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread James Bottomley
On Tue, 2007-01-30 at 22:20 -0500, Ric Wheeler wrote:
 Mark Lord wrote:
  The number of retries is an entirely separate issue.
  If we really care about it, then we should fix SD_MAX_RETRIES.
 
  The current value of 5 is *way* too high.  It should be zero or one.
 
  Cheers
 
 I think that drives retry enough, we should leave retry at zero for 
 normal (non-removable) drives. Should this  be a policy we can set like 
 we do with NCQ queue depth via /sys ?

I don't disagree that it should be settable.  However, retries occur for
other reasons than failures inside the device.  The most standard ones
are unit attentions generated because of other activity (target reset
etc).  The key to the problem is retrying only operations that are
genuinely retryable, which the mid-layer doesn't do such a good job on.

 We need to be able to layer things like MD on top of normal drive errors 
 in a way that will produce a system that provides reasonable response 
 time despite any possible IO error on a single component.  Another case 
 that we end up doing on a regular basis is drive recovery. Errors need 
 to be limited in scope to just the impacted area and dispatched up to 
 the application layer as quickly as we can so that you don't spend days 
 watching a copy of  huge drive (think 750GB or more) ;-)

For the MD case, this is what REQ_FAILFAST is for.

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread Douglas Gilbert
Ric Wheeler wrote:
 
 
 Mark Lord wrote:
 
 Eric D. Mudama wrote:


 Actually, it's possibly worse, since each failure in libata will
 generate 3-4 retries.  With existing ATA error recovery in the
 drives, that's about 3 seconds per retry on average, or 12 seconds
 per failure.  Multiply that by the number of blocks past the error to
 complete the request..


 It really beats the alternative of a forced reboot
 due to, say, superblock I/O failing because it happened
 to get merged with an unrelated I/O which then failed..
 Etc..

 Definitely an improvement.

 The number of retries is an entirely separate issue.
 If we really care about it, then we should fix SD_MAX_RETRIES.

 The current value of 5 is *way* too high.  It should be zero or one.

 Cheers

 I think that drives retry enough, we should leave retry at zero for
 normal (non-removable) drives. Should this  be a policy we can set like
 we do with NCQ queue depth via /sys ?

The transport might also want a say. I see ABORTED COMMAND
errors often enough with SAS (e.g. due to expander congestion)
to warrant at least one retry (which works in my testing).
SATA disks behind SAS infrastructure would also be
susceptible to the same random failures.

Transport Layer Retries (TLR) in SAS should remove this class
of transport errors but only SAS tape drives support TLR as
far as I know.

Doug Gilbert

 We need to be able to layer things like MD on top of normal drive errors
 in a way that will produce a system that provides reasonable response
 time despite any possible IO error on a single component.  Another case
 that we end up doing on a regular basis is drive recovery. Errors need
 to be limited in scope to just the impacted area and dispatched up to
 the application layer as quickly as we can so that you don't spend days
 watching a copy of  huge drive (think 750GB or more) ;-)
 
 ric


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html