Re: end to end error recovery musings
On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote: James Bottomley wrote: On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: 4104. It's 8 bytes per hardware sector. At least for T10... Er ... that won't look good to the 512 ATA compatibility remapping ... Well, in that case you'd only see 8x512 data bytes, no metadata... i.e. no support for block guard in the 512 byte sector emulation mode ... James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
James Bottomley wrote: On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote: James Bottomley wrote: On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: 4104. It's 8 bytes per hardware sector. At least for T10... Er ... that won't look good to the 512 ATA compatibility remapping ... Well, in that case you'd only see 8x512 data bytes, no metadata... i.e. no support for block guard in the 512 byte sector emulation mode ... That makes sense, though... if the raw sector size is 4096 bytes, that metadata would presumably not exist on a per-sector basis. -hpa - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: end to end error recovery musings
Eric == Moore, Eric [EMAIL PROTECTED] writes: [Trimmed the worldwide broadcast CC: list down to linux-scsi] Eric I from the scsi lld perspective, all we need 32 byte cdbs, and a Eric mechinism to pass the tags down from above. Ok, so your board only supports Type 2 protection? Eric It appears our driver to firmware insterface is only providing Eric the reference and application tags. My current code allows the submitter to specify which tags are valid between the OS and the HBA. Your inbound scsi_cmnd will have a protection_tag_mask which tells you which fields are provided. Similarly, there's a mask in scsi_host which allows the HBA to identify which protection types it supports. I hadn't envisioned that an HBA might only provide a subset. I'll ponder a bit. Eric I assume that for transfers greater than a sector, that the Eric controller firmware updates the tags for all the other sectors Eric within the boundary. In other words you only support one app tag per request and not per sector? -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: end to end error recovery musings
On Tuesday, February 27, 2007 12:07 PM, Martin K. Petersen wrote: Not sure you're up-to-date on the T10 data integrity feature. Essentially it's an extension of the 520 byte sectors common in disk arrays. For each 512 byte sector (or 4KB ditto) you get 8 bytes of protection data. There's a 2 byte CRC (GUARD tag), a 2 byte user-defined tag (APP) and a 4-byte reference tag (REF). Depending on how the drive is formatted, the REF tag usually needs to match the lower 32-bits of the target sector #. I from the scsi lld perspective, all we need 32 byte cdbs, and a mechinism to pass the tags down from above. It appears our driver to firmware insterface is only providing the reference and application tags. It seems the guard tag is not present, so I guess mpt fusion controller firmware is setting it(I will have to check with others). I assume that for transfers greater than a sector, that the controller firmware updates the tags for all the other sectors within the boundary. I'm sure the flags probably tell whether EEDP is enabled or not. I will have to check if there are some manufacturing pages that say whether the controller is capable of EEDP(as not all our controllers support it). Here are the EEDP associated fields we provide in our scsi passthru, as well as target assist. u32 SecondaryReferenceTag u16 SecondaryApplicationTag u16 EEDPFlags u16 ApplicationTagTranslationMask u32 EEDPBlockSize - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Wed, 2007-02-28 at 12:16 -0500, Martin K. Petersen wrote: It's cool that it's on the radar in terms of the protocol. That doesn't mean that drive manufacturers are going to implement it, though. The ones I've talked to were unwilling to sacrifice capacity because that's the main competitive factor in the SATA/consumer space. Maybe we'll see it in the nearline product ranges? That would be a good start... They wouldn't necessarily have to sacrifice capacity per-se. The current problem is that unlike SCSI disks, you can't seem to reformat SATA ones to arbitrary sector sizes. However, I could see the SATA manufacturers selling capacity at 512 (or the new 4096) sectors but allowing their OEMs to reformat them 520 (or 4160) and then implementing block guard on top of this. The OEMs who did this would obviously lose 1.6% of the capacity, but that would be their choice ... James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
James == James Bottomley [EMAIL PROTECTED] writes: James However, I could see the SATA manufacturers selling capacity at James 512 (or the new 4096) sectors but allowing their OEMs to James reformat them 520 (or 4160) 4104. It's 8 bytes per hardware sector. At least for T10... -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
James Bottomley wrote: On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: 4104. It's 8 bytes per hardware sector. At least for T10... Er ... that won't look good to the 512 ATA compatibility remapping ... Well, in that case you'd only see 8x512 data bytes, no metadata... -hpa - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Eric == Moore, Eric [EMAIL PROTECTED] writes: Eric Martin K. Petersen on Data Intergrity Feature, which is also Eric called EEDP(End to End Data Protection), which he presented some Eric ideas/suggestions of adding an API in linux for this. T10 DIF is interesting for a few things: - Ensuring that the data integrity is preserved when writing a buffer to disk - Ensuring that the write ends up on the right hardware sector These features make the most sense in terms of WRITE. Disks already have plenty of CRC on the data so if a READ fails on a regular drive we already know about it. We can, however, leverage DIF with my proposal to expose the protection data to host memory. This will allow us to verify the data integrity information before passing it to the filesystem or application. We can say this is really the information the disk sent. It hasn't been mangled along the way. And by using the APP tag we can mark a sector as - say - metadata or data to ease putting the recovery puzzle back together. It would be great if the app tag was more than 16 bits. Ted mentioned that ideally he'd like to store the inode number in the app tag. But as it stands there isn't room. In any case this is all slightly orthogonal to Ric's original post about finding the right persistence heuristics in the error handling path... -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
These features make the most sense in terms of WRITE. Disks already have plenty of CRC on the data so if a READ fails on a regular drive we already know about it. Don't bet on it. If you want to do this seriously you need an end to end (media to host ram) checksum. We do see bizarre and quite evil things happen to people occasionally because they rely on bus level protection - both faulty network cards and faulty disk or controller RAM can cause very bad things to happen in a critical environment and are very very hard to detect and test for. IDE has another hideously evil feature in this area. Command blocks are sent by PIO cycles, and are therefore unprotected from corruption. So while a data burst with corruption will error and retry and command which corrupts the block number although very very much less likely (less bits and much lower speed) will not be caught on a PATA system for read or for write and will hit the wrong block. With networking you can turn off hardware IP checksumming (and many cluster people do) with disks we don't yet have a proper end to end checksum to media system in the fs or block layers. It would be great if the app tag was more than 16 bits. Ted mentioned that ideally he'd like to store the inode number in the app tag. But as it stands there isn't room. The lowest few bits are the most important with ext2/ext3 because you normally lose a sector of inodes which means you've got dangly bits associated with a sequence of inodes with the same upper bits. More problematic is losing indirect blocks, and being able to keep some kind of [inode low bits/block index] would help put stuff back together. Alan - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Feb 27, 2007 19:02 +, Alan wrote: It would be great if the app tag was more than 16 bits. Ted mentioned that ideally he'd like to store the inode number in the app tag. But as it stands there isn't room. The lowest few bits are the most important with ext2/ext3 because you normally lose a sector of inodes which means you've got dangly bits associated with a sequence of inodes with the same upper bits. More problematic is losing indirect blocks, and being able to keep some kind of [inode low bits/block index] would help put stuff back together. In the ext4 extents format there is the ability (not implemented yet) to add some extra information into the extent index blocks (previously referred to as the ext3_extent_tail). This is planned to be a checksum of the index block, and a back-pointer to the inode which is using this extent block. This allows online detection of corrupt index blocks, and also detection of an index block that is written to the wrong location. There is as yet no plan that I'm aware of to have in-filesystem checksums of the extent data. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Martin K. Petersen wrote: Eric == Moore, Eric [EMAIL PROTECTED] writes: Eric Martin K. Petersen on Data Intergrity Feature, which is also Eric called EEDP(End to End Data Protection), which he presented some Eric ideas/suggestions of adding an API in linux for this. T10 DIF is interesting for a few things: - Ensuring that the data integrity is preserved when writing a buffer to disk - Ensuring that the write ends up on the right hardware sector These features make the most sense in terms of WRITE. Disks already have plenty of CRC on the data so if a READ fails on a regular drive we already know about it. There are paths through a read that could still benefit from the extra data integrity. The CRC gets validated on the physical sector, but we don't have the same level of strict data checking once it is read into the disk's write cache or being transferred out of cache on the way to the transport... We can, however, leverage DIF with my proposal to expose the protection data to host memory. This will allow us to verify the data integrity information before passing it to the filesystem or application. We can say this is really the information the disk sent. It hasn't been mangled along the way. And by using the APP tag we can mark a sector as - say - metadata or data to ease putting the recovery puzzle back together. It would be great if the app tag was more than 16 bits. Ted mentioned that ideally he'd like to store the inode number in the app tag. But as it stands there isn't room. In any case this is all slightly orthogonal to Ric's original post about finding the right persistence heuristics in the error handling path... Still all a very relevant discussion - I agree that we could really use more than just 16 bits... ric - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Alan == Alan [EMAIL PROTECTED] writes: Not sure you're up-to-date on the T10 data integrity feature. Essentially it's an extension of the 520 byte sectors common in disk [...] Alan but here's a minor bit of passing bad news - quite a few older Alan ATA controllers can't issue DMA transfers that are not a Alan multiple of 512 bytes without crapping themselves (eg Alan READ_LONG). Guess we may need to add Alan ap- i_do_not_suck or similar 8) I'm afraid it stops even before you get that far. There doesn't seem to be any interest in adopting the Data Integrity Feature (or anything similar) in the ATA camp. So for now it's a SCSI-only thing. I encourage people to lean on their favorite disk manufacturer. This would be a great feature to have on SATA too... -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Alan wrote: I think that this is mostly true, but we also need to balance this against the need for higher levels to get a timely response. In a really large IO, a naive retry of a very large write could lead to a non-responsive system for a very large time... And losing the I/O could result in a system that is non responsive until the tape restore completes two days later Which brings us back to a recent discussion at the file system workshop on being more repair oriented in file system design so we can survive situations like this a bit more reliably ;-) ric - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Theodore Tso wrote: In any case, the reason why I bring this up is that it would be really nice if there was a way with a single laptop drive to be able to do snapshots and background fsck's without having to use initrd's with device mapper. This is a major part of why I've been trying to push integrated klibc to have all that stuff as a unified kernel deliverable. Unfortunately, as you know, Linus apparently rejected the concept at least for now at LKS last year. With klibc this stuff could still be in one single wrapper without funny dependencies, but wouldn't have to be ported to kernel space. -hpa - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Jeff Garzik wrote: Theodore Tso wrote: Can someone with knowledge of current disk drive behavior confirm that for all drives that support bad block sparing, if an attempt to write to a particular spot on disk results in an error due to bad media at that spot, the disk drive will automatically rewrite the sector to a sector in its spare pool, and automatically redirect that sector to the new location. I believe this should be always true, so presumably with all modern disk drives a write error should mean something very serious has happend. This is what will /probably/ happen. The drive should indeed find a spare sector and remap it, if the write attempt encounters a bad spot on the media. However, with a large enough write, large enough bad-spot-on-media, and a firmware programmed to never take more than X seconds to complete their enterprise customers' I/O, it might just fail. IMO, somewhere in the kernel, when we receive a read-op or write-op media error, we should immediately try to plaster that area with small writes. Sure, if it's a read-op you lost data, but this method will maximize the chance that you can refresh/reuse the logical sectors in question. Jeff One interesting counter example is a smaller write than a full page - say 512 bytes out of 4k. If we need to do a read-modify-write and it just so happens that 1 of the 7 sectors we need to read is flaky, will this look like a write failure? ric - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
One interesting counter example is a smaller write than a full page - say 512 bytes out of 4k. If we need to do a read-modify-write and it just so happens that 1 of the 7 sectors we need to read is flaky, will this look like a write failure? The current core kernel code can't handle propogating sub-page sized errors up to the file system layers (there is nowhere in the page cache to store 'part of this page is missing'). This is a long standing (four year plus) problem with CD-RW support as well. For ATA we can at least retrieve the true media sector size now, which may be helpful at the physical layer but the page cache would need to grow some brains to do anything with it. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
H. Peter Anvin wrote: Ric Wheeler wrote: We still have the following challenges: (1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case. Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. Some disks are doing their own read-ahead in the form of a background media scan. Scans are done on request or periodically (e.g. once per day or once per week) and we have tools that can fetch the scan results from a disk (e.g. a list of unreadable sectors). What we don't have is any way to feed such information to a file system that may be impacted. Doug Gilbert - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Fri, Feb 23, 2007 at 09:32:29PM -0500, Theodore Tso wrote: And having a way of making this list available to both the filesystem and to a userspace utility, so they can more easily deal with doing a forced rewrite of the bad sector, after determining which file is involved and perhaps doing something intelligent (up to and including automatically requesting a backup system to fetch a backup version of the file, and if it can be determined that the file shouldn't have been changed since the last backup, automatically fixing up the corrupted data block :-). i had a small c program + perl script that would take a badblocks list and figure out which files on an xfs filesystem were trashed, though in the case of xfs it's somewhat easier because you can dump the extents for a file something more generic wouldn't be hard to make work, it also wouldn't be hard to extend this to inodes in some cases though im not sure that there is much you can do there beyond fsck - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Ric Wheeler wrote: We still have the following challenges: (1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case. Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. -hpa - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Feb 23, 2007 16:03 -0800, H. Peter Anvin wrote: Ric Wheeler wrote: (1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case. Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. For that matter, a huge win would be to have the MD RAID layer rewrite only the bad sector (in hopes of the disk relocating it) instead of failing the whiole disk. Otherwise, a few read errors on different disks in a RAID set can take the whole system offline. Apologies if this is already done in recent kernels... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Andreas Dilger wrote: And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. Certainly if the overwrite is successful. -hpa - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote: Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. For that matter, a huge win would be to have the MD RAID layer rewrite only the bad sector (in hopes of the disk relocating it) instead of failing the whiole disk. Otherwise, a few read errors on different disks in a RAID set can take the whole system offline. Apologies if this is already done in recent kernels... And having a way of making this list available to both the filesystem and to a userspace utility, so they can more easily deal with doing a forced rewrite of the bad sector, after determining which file is involved and perhaps doing something intelligent (up to and including automatically requesting a backup system to fetch a backup version of the file, and if it can be determined that the file shouldn't have been changed since the last backup, automatically fixing up the corrupted data block :-). - Ted - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html