Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote: I believe you made the first change in response to my prodding at the time, when libata was not returning valid sense data (no LBA) for media errors. The SCSI EH handling of that was rather poor at the time, and so having it not retry the remaining sectors was actually a very good fix at the time. But now, libata *does* return valid sense data for LBA/DMA drives, and the workaround from circa 2.6.16 is no longer the best we can do. Now that we know which sector failed, we ought to be able to skip over it, and continue with the rest of the merged request. We can ... the big concern with your approach, which you haven't addressed is the time factor. For most SCSI devices, returning a fatal MEDIUM ERROR means we're out of remapping table, and also that there's probably a bunch of sectors on the track that are now out. Thus, there are almost always multiple sector failures. In linux, the average request size on a filesystem is around 64-128kb; thats 128-256 sectors. If we fail at the initial sector, we have to go through another 128-256 attempts, with the internal device retries, before we fail the entire request. Some devices can take a second or so for each read before they finally give up and decide they really can't read the sector, so you're looking at 2-5 minutes before the machine finally fails this one request ... and much worse for devices that retry more times. This is not the case on a read error - we commonly see transient errors on reads from disks. What we push our vendors to do is to try to keep the worst case response down to tens of seconds as they retry, etc internally with a device. When they take that long (and they do), adding retries up the stack can translate into minutes per sector. The interesting point of this question is about the typically pattern of IO errors. On a read, it is safe to assume that you will have issues with some bounded numbers of adjacent sectors. One thing that could be even better than the patch below, would be to have it perhaps skip the entire bio that includes the failed sector, rather than only the bad sector itself. Er ... define skip over the bio. A bio is simply a block representation for a bunch of sg elements coming in to the elevator. Mostly what we see in SCSI is a single bio per request, so skipping the bio is really the current behaviour (to fail the rest of the request). This is really a tricky one - what can happen when we fail merged IO requests is really unpredictable behavior up at the application level since the IO error might not be at all relevant to my part of the request. Merging can produce a request that is much larger than any normal drive error. I really like the idea of being able to set this kind of policy on a per drive instance since what you want here will change depending on what your system requirements are, what the system is trying to do (i.e., when trying to recover a failing but not dead yet disk, IO errors should be as quick as possible and we should choose an IO scheduler that does not combine IO's). I think doing that might address most concerns expressed here. Have you got an alternate suggestion, James? James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
The interesting point of this question is about the typically pattern of IO errors. On a read, it is safe to assume that you will have issues with some bounded numbers of adjacent sectors. Which in theory you can get by asking the drive for the real sector size from the ATA7 info. (We ought to dig this out more as its relevant for partition layout too). I really like the idea of being able to set this kind of policy on a per drive instance since what you want here will change depending on what your system requirements are, what the system is trying to do (i.e., when trying to recover a failing but not dead yet disk, IO errors should be as quick as possible and we should choose an IO scheduler that does not combine IO's). That seems to be arguing for a bounded live time including retry run time for a command. That's also more intuitive for real time work and for end user setup. Either work or fail within n seconds Alan - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
your system requirements are, what the system is trying to do (i.e., when trying to recover a failing but not dead yet disk, IO errors should be as quick as possible and we should choose an IO scheduler that does not combine IO's). If this is the right strategy for disk recovery for a given type of device then this ought to be an automatic strategy. Most end users will not have the knowledge to frob about in sysfs, and if the bad sector hits at the wrong moment a sensible automatic recovery strategy is going to do the right thing faster than human intervention, which may be some hours away. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Alan wrote: If this is the right strategy for disk recovery for a given type of device then this ought to be an automatic strategy. Most end users will not have the knowledge to frob about in sysfs, and if the bad sector hits at the wrong moment a sensible automatic recovery strategy is going to do the right thing faster than human intervention, which may be some hours away. I think something we seem to be discussing here are the opposite aims of enterprise RAID (traditional SCSI market) versus desktop filesystems (now the number one user of Linux SCSI, courtesy of libata). With RAID, it's safe to suggest that a very fast, bounded exit from EH is almost always what the customer / end-user wants, because the higher level RAID management can then deal with the failed drive offline. But for a desktop filesystem, failing out quickly and bulk-failing megabytes around a couple of bad sectors is not good -- no redundancy, and this will lead to unneeded data loss. It's beginning to look like this needs to be run-time tuneable, per block minor device (per-partition), through sysfs at a minimum. The RAID tools could then automatically choose settings to bias more towards an instant exit when errors are found, whereas a non-RAID desktop could select a more reliable recovery strategy. Right now, with Jame's patch (earlier in this thread), the whole scheme is heavily weighted towards the RAID instant exit strategy, which is making me quite nervous about the data on my laptop. Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Fri, 2007-02-02 at 14:42 +, Alan wrote: The interesting point of this question is about the typically pattern of IO errors. On a read, it is safe to assume that you will have issues with some bounded numbers of adjacent sectors. Which in theory you can get by asking the drive for the real sector size from the ATA7 info. (We ought to dig this out more as its relevant for partition layout too). Actually, my point is that damage typically impacts a cluster of disk sectors that are adjacent. Think of a drive that has junk on the platter or a some such thing - the contamination is likely to be localized. I really like the idea of being able to set this kind of policy on a per drive instance since what you want here will change depending on what your system requirements are, what the system is trying to do (i.e., when trying to recover a failing but not dead yet disk, IO errors should be as quick as possible and we should choose an IO scheduler that does not combine IO's). That seems to be arguing for a bounded live time including retry run time for a command. That's also more intuitive for real time work and for end user setup. Either work or fail within n seconds Actually, then I think perhaps we use the allowed retries for this ... I really am not a big retry fan for most modern drives - the drive will try really, really hard to complete an IO for us and multiple retries can just slow down the higher level application from recovering. So you would fail a single sector and count it against the retries. When you've done this allowed retries times, you fail the rest of the request. James I think that we need to play with some of these possible solutions on some real-world bad drives and see how they react. We should definitely talk more about this at the workshop ;-) ric - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
On Fri, Feb 02, 2007 at 11:06:19AM -0500, Mark Lord wrote: Alan wrote: If this is the right strategy for disk recovery for a given type of device then this ought to be an automatic strategy. Most end users will not have the knowledge to frob about in sysfs, and if the bad sector hits at the wrong moment a sensible automatic recovery strategy is going to do the right thing faster than human intervention, which may be some hours away. I think something we seem to be discussing here are the opposite aims of enterprise RAID (traditional SCSI market) versus desktop filesystems (now the number one user of Linux SCSI, courtesy of libata). With RAID, it's safe to suggest that a very fast, bounded exit from EH is almost always what the customer / end-user wants, because the higher level RAID management can then deal with the failed drive offline. But for a desktop filesystem, failing out quickly and bulk-failing megabytes around a couple of bad sectors is not good -- no redundancy, and this will lead to unneeded data loss. It's beginning to look like this needs to be run-time tuneable, per block minor device (per-partition), through sysfs at a minimum. The RAID tools could then automatically choose settings to bias more towards an instant exit when errors are found, whereas a non-RAID desktop could select a more reliable recovery strategy. Right now, with Jame's patch (earlier in this thread), the whole scheme is heavily weighted towards the RAID instant exit strategy, which is making me quite nervous about the data on my laptop. Also worth considering is that spending minutes trying to reread damaged sectors is likely to accelerate your death spiral. More data may be recoverable if you give up quickly in a first pass, then go back and manually retry damaged bits with smaller I/Os. Another approach is to have separate retries per hard sector and hard errors per MB at the block device level. We don't want to have the same number of retries for a 64KB block as a 1MB block and we certainly don't want to do 2K retries in a row. So for instance, I could have the first set to 1 and the second set to 16. For a 256KB read, this gets scaled down to 4, which means we'll retry each sector once and fail the whole I/O when we hit the 5th sector error. More reasonable number might be 0 and 1, meaning don't do OS level retries on sectors and fail the whole I/O on the second sector error (or immediately for smaller reads). It might also be informative to add a kernel message reporting if a retry (in the non-transient cases) ever actually succeeds. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Alan wrote: The interesting point of this question is about the typically pattern of IO errors. On a read, it is safe to assume that you will have issues with some bounded numbers of adjacent sectors. Which in theory you can get by asking the drive for the real sector size from the ATA7 info. (We ought to dig this out more as its relevant for partition layout too). I really like the idea of being able to set this kind of policy on a per drive instance since what you want here will change depending on what your system requirements are, what the system is trying to do (i.e., when trying to recover a failing but not dead yet disk, IO errors should be as quick as possible and we should choose an IO scheduler that does not combine IO's). That seems to be arguing for a bounded live time including retry run time for a command. That's also more intuitive for real time work and for end user setup. Either work or fail within n seconds Which is more or less the streaming feature set in recent ATA standards. [Alas, streaming and NCQ/TCQ can't be done with the same access.] SCSI has its Read Write Error Recovery mode page which doesn't have timeouts but does have Read and Write Retry Counts amongst other fields that control the amount (and indirectly the time) of attempted error recovery. Doug Gilbert - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Matt Mackall wrote: .. Also worth considering is that spending minutes trying to reread damaged sectors is likely to accelerate your death spiral. More data may be recoverable if you give up quickly in a first pass, then go back and manually retry damaged bits with smaller I/Os. All good input. But what was being debated here is not so much the retrying of known-bad sectors, but rather what to do about the kiBs or MiBs of sectors remaining in a merged request after hitting a single bad sector mid-way. Currently, SCSI just abandons the entire remaining workload. Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
On Fri, Feb 02, 2007 at 05:58:04PM -0500, Mark Lord wrote: Matt Mackall wrote: .. Also worth considering is that spending minutes trying to reread damaged sectors is likely to accelerate your death spiral. More data may be recoverable if you give up quickly in a first pass, then go back and manually retry damaged bits with smaller I/Os. All good input. But what was being debated here is not so much the retrying of known-bad sectors, but rather what to do about the kiBs or MiBs of sectors remaining in a merged request after hitting a single bad sector mid-way. Yep, that's precisely what was addressed in the part you snipped. My main point being that what to do about the remaining workload should be dependent on the size of the I/O. If we encounter errors on sectors 4,5,6,7,8.. of a 1MB request, we should have a threshold for giving up. It's not unreasonable for that threshold to be larger than 1, but it should not be 2048. And if we do the I/O as four 256KB requests, we should have approximately the same number of retries (assuming the whole region's bad). -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote: Kernels since about 2.6.16 or so have been broken in this regard. They complete the good sectors before the error, and then fail the entire remaining portions of the request. What was the commit that introduced the change? ... I have a vague memory of it being deliberate. I believe you made the first change in response to my prodding at the time, when libata was not returning valid sense data (no LBA) for media errors. The SCSI EH handling of that was rather poor at the time, and so having it not retry the remaining sectors was actually a very good fix at the time. But now, libata *does* return valid sense data for LBA/DMA drives, and the workaround from circa 2.6.16 is no longer the best we can do. Now that we know which sector failed, we ought to be able to skip over it, and continue with the rest of the merged request. One thing that could be even better than the patch below, would be to have it perhaps skip the entire bio that includes the failed sector, rather than only the bad sector itself. I think doing that might address most concerns expressed here. Have you got an alternate suggestion, James? .. Signed-off-by: Mark Lord [EMAIL PROTECTED] --- diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c --- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.0 -0500 +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.0 -0500 @@ -865,6 +865,12 @@ */ if (sense_valid !sense_deferred) { switch (sshdr.sense_key) { + case MEDIUM_ERROR: + // Bad sector. Fail it, and then continue the rest of the request: + if (scsi_end_request(cmd, 0, cmd-device-sector_size, 1) == NULL) { The sense key may have come with additional information I think we want to parse that (if it exists) rather than just blindly failing the first sector of the request. + cmd-retries = 0;// go around again.. + return; + } This would drop through to the UNIT_ATTENTION case if scsi_end_request() fails ... I don't think that's correct. case UNIT_ATTENTION: if (cmd-device-removable) { /* Detected disc change. Set a bit - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote: .. One thing that could be even better than the patch below, would be to have it perhaps skip the entire bio that includes the failed sector, rather than only the bad sector itself. Er ... define skip over the bio. A bio is simply a block representation for a bunch of sg elements coming in to the elevator. Exactly. Or rather, a block of sg_elements from a single point of request, is it not? Mostly what we see in SCSI is a single bio per request, so skipping the bio is really the current behaviour (to fail the rest of the request). Very good. That's what it's supposed to do. But if each request contained only a single bio, then all of Jens' work on IO scheduling would be for nothing, n'est-ce pas? In the case where a request consists of multiple bio's which have been merged under a single request struct, we really should give at least one attempt to each bio. This way, in most cases, only the process that requested the failed sector(s) will see an error, not the innocent victims that happened to get merged onto the end. Which could be very critical stuff (or not -- it could be quite random). So the time factor works out to one disk I/O timeout per failed bio. That's what would have happened with the NOP scheduler anyway. On the sytems I'm working with, I don't see huge numbers of bad sectors. What they tend to show is just one or two bad sectors, widely scattered. So: I think doing that might address most concerns expressed here. Have you got an alternate suggestion, James? Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Mark Lord wrote: Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. With existing ATA error recovery in the drives, that's about 3 seconds per retry on average, or 12 seconds per failure. Multiply that by the number of blocks past the error to complete the request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. FWIW -- speaking generally -- I think there are inevitable areas where libata error handling combined with SCSI error handling results in suboptimal error handling. Just creating a list of this behavior should be handled this way, but in reality is handled in this silly way would be very helpful. Error handling is tough to get right, because the code is exercised so infrequently. Tejun has actually done an above-average job here, by making device probe, hotplug and other exceptions go through the libata EH code, thereby exercising the EH code more than one might normally assume. Some errors in libata probably should not be retried more than once, when we have a definitive diagnosis. Suggestions for improvements are welcome. Jeff - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Jeff Garzik wrote: Mark Lord wrote: Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. With existing ATA error recovery in the drives, that's about 3 seconds per retry on average, or 12 seconds per failure. Multiply that by the number of blocks past the error to complete the request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. FWIW -- speaking generally -- I think there are inevitable areas where libata error handling combined with SCSI error handling results in suboptimal error handling. Just creating a list of this behavior should be handled this way, but in reality is handled in this silly way would be very helpful. I agree - Tejun has done a great job at giving us a great base. Next step is to get clarity on what the types of errors are and how to differentiate between them (and maybe how that would change by class of device?). Error handling is tough to get right, because the code is exercised so infrequently. Tejun has actually done an above-average job here, by making device probe, hotplug and other exceptions go through the libata EH code, thereby exercising the EH code more than one might normally assume. Some errors in libata probably should not be retried more than once, when we have a definitive diagnosis. Suggestions for improvements are welcome. Jeff One thing that we find really useful is to inject real errors into devices. Mark has some patches that let us inject media errors, we also bring back failed drives and run them through testing and occasionally get to use analyzers, etc to inject odd ball errors. Hopefully, we will get some time to brainstorm about this at the workshop, ric - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Ric Wheeler wrote: Mark Lord wrote: Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. (note: libata does *not* generate retries for medium errors; the looping is driven by the SCSI mid-layer code). It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. .. I think that drives retry enough, we should leave retry at zero for normal (non-removable) drives. Should this be a policy we can set like we do with NCQ queue depth via /sys ? Or perhaps we could have the mid-layer always early-exit without retries for MEDIUM_ERROR, and still do retries for the rest. When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the with retry ATA opcodes for this). But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the with retry ATA opcodes for this). This depends on the firmware. Some of the raid firmware drives don't appear to do retries in firmware. But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Agreed - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: For the MD case, this is what REQ_FAILFAST is for. I cannot find where SCSI honours that flag. James? And for that matter, even when I patch SCSI so that it *does* honour it, I don't actually see the flag making it into the SCSI layer from above. And I don't see where/how the block layer takes care when considering merge FAILFAST/READA requests with non FAILFAST/READA requests. To me, it looks perfectly happy to add non-FAILFAST/READA bios to a FAILFAST request, risking data loss if a lower-layer decides to honour the FAILFAST/READA flags. So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;) - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Mark Lord wrote: James Bottomley wrote: For the MD case, this is what REQ_FAILFAST is for. I cannot find where SCSI honours that flag. James? Scratch that thought.. SCSI honours it in scsi_end_request(). But I'm not certain that the block layer handles it correctly, at least not in the 2.6.16/2.6.18 kernel that I'm working with today. Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
On Wed, 2007-01-31 at 10:13 -0500, Mark Lord wrote: James Bottomley wrote: For the MD case, this is what REQ_FAILFAST is for. I cannot find where SCSI honours that flag. James? Er, it's in scsi_error.c:scsi_decide_disposition(): maybe_retry: /* we requeue for retry because the error was retryable, and * the request was not marked fast fail. Note that above, * even if the request is marked fast fail, we still requeue * for queue congestion conditions (QUEUE_FULL or BUSY) */ if ((++scmd-retries) = scmd-allowed !blk_noretry_request(scmd-request)) { return NEEDS_RETRY; } else { /* * no more retries - report this one back to upper level. */ return SUCCESS; } And for that matter, even when I patch SCSI so that it *does* honour it, I don't actually see the flag making it into the SCSI layer from above. And I don't see where/how the block layer takes care when considering merge FAILFAST/READA requests with non FAILFAST/READA requests. To me, it looks perfectly happy to add non-FAILFAST/READA bios to a FAILFAST request, risking data loss if a lower-layer decides to honour the FAILFAST/READA flags. So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;) James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Ric Wheeler wrote: Jeff Garzik wrote: Mark Lord wrote: Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. With existing ATA error recovery in the drives, that's about 3 seconds per retry on average, or 12 seconds per failure. Multiply that by the number of blocks past the error to complete the request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. FWIW -- speaking generally -- I think there are inevitable areas where libata error handling combined with SCSI error handling results in suboptimal error handling. Just creating a list of this behavior should be handled this way, but in reality is handled in this silly way would be very helpful. I agree - Tejun has done a great job at giving us a great base. Next step is to get clarity on what the types of errors are and how to differentiate between them (and maybe how that would change by class of device?). Error handling is tough to get right, because the code is exercised so infrequently. Tejun has actually done an above-average job here, by making device probe, hotplug and other exceptions go through the libata EH code, thereby exercising the EH code more than one might normally assume. Some errors in libata probably should not be retried more than once, when we have a definitive diagnosis. Suggestions for improvements are welcome. Jeff One thing that we find really useful is to inject real errors into devices. Mark has some patches that let us inject media errors, we also bring back failed drives and run them through testing and occasionally get to use analyzers, etc to inject odd ball errors. Ric, Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added command support to flag a block as uncorrectable. There is no need to send bad long data to it and suppress the disk's automatic re-allocation logic. In the case of ATA it is the WRITE UNCORRECTABLE command. In the case of SCSI it is the WR_UNCOR bit in the WRITE LONG command. It seems that due to SAT any useful capability in the ATA command set will soon appear in the corresponding SCSI command set, if it is not already there. Doug Gilbert - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Douglas Gilbert wrote: Ric, Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added command support to flag a block as uncorrectable. There is no need to send bad long data to it and suppress the disk's automatic re-allocation logic. That'll be useful in a couple of years, once drives that have it become more common. For now, though, we're hacking current drives using READ/WRITE LONG commands, with a corresponding patch to libata to allow for the longer sector size involved. Having real bad sectors, exactly where we want them on the media, sure does make testing / fixing the EH mechanisms a lot more feasible. Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Alan wrote: When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the with retry ATA opcodes for this). This depends on the firmware. Some of the raid firmware drives don't appear to do retries in firmware. I think that even for these devices, the actual drives behind the controller will do retries. In any case, it would be reasonable to be able to set this retry/no-retry via /sys to handle exceptional cases... But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Agreed - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Alan wrote: When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the with retry ATA opcodes for this). This depends on the firmware. Some of the raid firmware drives don't appear to do retries in firmware. One way to tell if this is true, is simply to time how long the failed operation takes. If the drive truly does not do retries, then the media error should be reported more or less instantly (assuming drive was already spun up). If the failure takes more than a few hundred milliseconds to be reported, or in this case 4-7 seconds typically, then we know the drive was doing retries before it reported back. I haven't seen any drive fail instantly yet. Can anyone with those newfangled RAID edition drives try it and report back? Oh.. you'll need a way to create a bad sector. I've got patches and a command-line utility for the job. If your drive supports WRITE UNCORRECTABLE (hdparm -I, w/latest hdparm), then the patches aren't needed. But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Agreed - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote: Alan wrote: When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the with retry ATA opcodes for this). This depends on the firmware. Some of the raid firmware drives don't appear to do retries in firmware. One way to tell if this is true, is simply to time how long the failed operation takes. If the drive truly does not do retries, then the media error should be reported more or less instantly (assuming drive was already spun up). Well, the simpler way (and one we have a hope of implementing) is to examine the ASC/ASCQ codes to see if the error is genuinely unretryable. I seem to have dropped the ball on this one in that the scsi_error.c pieces of this patch http://marc.theaimsgroup.com/?l=linux-scsim=116485834119885 I thought I'd applied. Apparently I didn't, so I'll go back and put them in. James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote: Alan wrote: When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the with retry ATA opcodes for this). This depends on the firmware. Some of the raid firmware drives don't appear to do retries in firmware. One way to tell if this is true, is simply to time how long the failed operation takes. If the drive truly does not do retries, then the media error should be reported more or less instantly (assuming drive was already spun up). Well, the simpler way (and one we have a hope of implementing) is to examine the ASC/ASCQ codes to see if the error is genuinely unretryable. My suggestion above was not for a kernel fix, but rather just as a way of determining if drives which claim no retries actually do them or not. :) I seem to have dropped the ball on this one in that the scsi_error.c pieces of this patch http://marc.theaimsgroup.com/?l=linux-scsim=116485834119885 I thought I'd applied. Apparently I didn't, so I'll go back and put them in. Good. That would be a useful supplement to the patch I posted here. Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. With existing ATA error recovery in the drives, that's about 3 seconds per retry on average, or 12 seconds per failure. Multiply that by the number of blocks past the error to complete the request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
First off, please send SCSI patches to the SCSI list: linux-scsi@vger.kernel.org On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote: In ancient kernels, the SCSI disk code used to continue after encountering a MEDIUM_ERROR. It would complete the good sectors before the error, fail the bad sector/block, and then continue with the rest of the request. Kernels since about 2.6.16 or so have been broken in this regard. They complete the good sectors before the error, and then fail the entire remaining portions of the request. What was the commit that introduced the change? ... I have a vague memory of it being deliberate. This is very risky behaviour, as a request is often a merge of several bios, and just because one application hits a bad sector is no reason to pretend that (for example) an adjacent directly lookup also failed. This patch fixes the behaviour to be similar to what we had originally. When a bad sector is encounted, SCSI will now work around it again, failing *only* the bad sector itself. Erm, but the corollary is that if we get a large read failure because of a bad track, you're going to try and chunk up it a sector at a time forcing an individual error for each sector is going to annoy some people ... particularly removable medium ones which return this error if the medium isn't present ... Are you sure this is really what we want to do? Signed-off-by: Mark Lord [EMAIL PROTECTED] --- diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c --- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.0 -0500 +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.0 -0500 @@ -865,6 +865,12 @@ */ if (sense_valid !sense_deferred) { switch (sshdr.sense_key) { + case MEDIUM_ERROR: + // Bad sector. Fail it, and then continue the rest of the request: + if (scsi_end_request(cmd, 0, cmd-device-sector_size, 1) == NULL) { The sense key may have come with additional information I think we want to parse that (if it exists) rather than just blindly failing the first sector of the request. + cmd-retries = 0; // go around again.. + return; + } This would drop through to the UNIT_ATTENTION case if scsi_end_request() fails ... I don't think that's correct. case UNIT_ATTENTION: if (cmd-device-removable) { /* Detected disc change. Set a bit - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: First off, please send SCSI patches to the SCSI list: linux-scsi@vger.kernel.org Fixed already, thanks! This patch fixes the behaviour to be similar to what we had originally. When a bad sector is encounted, SCSI will now work around it again, failing *only* the bad sector itself. Erm, but the corollary is that if we get a large read failure because of a bad track, you're going to try and chunk up it a sector at a time That's better than the huge data-loss scenario that we currently have for single-sector errors. MUCH better. forcing an individual error for each sector is going to annoy some people ... particularly removable medium ones which return this error if the medium isn't present ... Are you sure this is really what we want to do? No, for removed-medium everything just fails right away. This patch is *only* for media errors, not any other failures. Cheers - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Mark Lord wrote: Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. With existing ATA error recovery in the drives, that's about 3 seconds per retry on average, or 12 seconds per failure. Multiply that by the number of blocks past the error to complete the request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. Cheers I think that drives retry enough, we should leave retry at zero for normal (non-removable) drives. Should this be a policy we can set like we do with NCQ queue depth via /sys ? We need to be able to layer things like MD on top of normal drive errors in a way that will produce a system that provides reasonable response time despite any possible IO error on a single component. Another case that we end up doing on a regular basis is drive recovery. Errors need to be limited in scope to just the impacted area and dispatched up to the application layer as quickly as we can so that you don't spend days watching a copy of huge drive (think 750GB or more) ;-) ric - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
On Tue, 2007-01-30 at 22:20 -0500, Ric Wheeler wrote: Mark Lord wrote: The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. Cheers I think that drives retry enough, we should leave retry at zero for normal (non-removable) drives. Should this be a policy we can set like we do with NCQ queue depth via /sys ? I don't disagree that it should be settable. However, retries occur for other reasons than failures inside the device. The most standard ones are unit attentions generated because of other activity (target reset etc). The key to the problem is retrying only operations that are genuinely retryable, which the mid-layer doesn't do such a good job on. We need to be able to layer things like MD on top of normal drive errors in a way that will produce a system that provides reasonable response time despite any possible IO error on a single component. Another case that we end up doing on a regular basis is drive recovery. Errors need to be limited in scope to just the impacted area and dispatched up to the application layer as quickly as we can so that you don't spend days watching a copy of huge drive (think 750GB or more) ;-) For the MD case, this is what REQ_FAILFAST is for. James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Ric Wheeler wrote: Mark Lord wrote: Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. With existing ATA error recovery in the drives, that's about 3 seconds per retry on average, or 12 seconds per failure. Multiply that by the number of blocks past the error to complete the request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. Cheers I think that drives retry enough, we should leave retry at zero for normal (non-removable) drives. Should this be a policy we can set like we do with NCQ queue depth via /sys ? The transport might also want a say. I see ABORTED COMMAND errors often enough with SAS (e.g. due to expander congestion) to warrant at least one retry (which works in my testing). SATA disks behind SAS infrastructure would also be susceptible to the same random failures. Transport Layer Retries (TLR) in SAS should remove this class of transport errors but only SAS tape drives support TLR as far as I know. Doug Gilbert We need to be able to layer things like MD on top of normal drive errors in a way that will produce a system that provides reasonable response time despite any possible IO error on a single component. Another case that we end up doing on a regular basis is drive recovery. Errors need to be limited in scope to just the impacted area and dispatched up to the application layer as quickly as we can so that you don't spend days watching a copy of huge drive (think 750GB or more) ;-) ric - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html