On Wed, Nov 19, 2014 at 8:11 AM, Phillip Susi <ps...@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 9:40 PM, Chris Murphy wrote: >> It’s well known on linux-raid@ that consumer drives have well over >> 30 second "deep recoveries" when they lack SCT command support. The >> WDC and Seagate “green” drives are over 2 minutes apparently. This >> isn’t easy to test because it requires a sector with enough error >> that it requires the ECC to do something, and yet not so much error >> that it gives up in less than 30 seconds. So you have to track down >> a drive model spec document (one of those 100 pagers). >> >> This makes sense, sorta, because the manufacturer use case is >> typically single drive only, and most proscribe raid5/6 with such >> products. So it’s a “recover data at all costs” behavior because >> it’s assumed to be the only (immediately) available copy. > > It doesn't make sense to me. If it can't recover the data after one > or two hundred retries in one or two seconds, it can keep trying until > the cows come home and it just isn't ever going to work.
I'm not a hard drive engineer, so I can't argue either point. But consumer drives clearly do behave this way. On Linux, the kernel's default 30 second command timer eventually results in what look like link errors rather than drive read errors. And instead of the problems being fixed with the normal md and btrfs recovery mechanisms, the errors simply get worse and eventually there's data loss. Exhibits A, B, C, D - the linux-raid list is full to the brim of such reports and their solution. > >> I don’t see how that’s possible because anything other than the >> drive explicitly producing a read error (which includes the >> affected LBA’s), it’s ambiguous what the actual problem is as far >> as the kernel is concerned. It has no way of knowing which of >> possibly dozens of ata commands queued up in the drive have >> actually hung up the drive. It has no idea why the drive is hung up >> as well. > > IIRC, this is true when the drive returns failure as well. The whole > bio is marked as failed, and the page cache layer then begins retrying > with progressively smaller requests to see if it can get *some* data out. Well that's very course. It's not at a sector level, so as long as the drive continues to try to read from a particular LBA, but fails to either succeed reading or give up and report a read error, within 30 seconds, then you just get a bunch of wonky system behavior. Conversely what I've observed on Windows in such a case, is it tolerates these deep recoveries on consumer drives. So they just get really slow but the drive does seem to eventually recover (until it doesn't). But yeah 2 minutes is a long time. So then the user gets annoyed and reinstalls their system. Since that means writing to the affected drive, the firmware logic causes bad sectors to be dereferenced when the write error is persistent. Problem solved, faster system. > >> No I think 30 is pretty sane for servers using SATA drives because >> if the bus is reset all pending commands in the queue get >> obliterated which is worse than just waiting up to 30 seconds. With >> SAS drives maybe less time makes sense. But in either case you >> still need configurable SCT ERC, or it needs to be a sane fixed >> default like 70 deciseconds. > > Who cares if multiple commands in the queue are obliterated if they > can all be retried on the other mirror? Because now you have a member drive that's inconsistent. At least in the md raid case, a certain number of read failures causes the drive to be ejected from the array. Anytime there's a write failure, it's ejected from the array too. What you want is for the drive to give up sooner with an explicit read error, so md can help fix the problem by writing good data to the effected LBA. That doesn't happen when there are a bunch of link resets happening. > Better to fall back to the > other mirror NOW instead of waiting 30 seconds ( or longer! ). Sure, > you might end up recovering more than you really had to, but that > won't hurt anything. Again, if your drive SCT ERC is configurable, and set to something sane like 70 deciseconds, that read failure happens at MOST 7 seconds after the read attempt. And md is notified of *exactly* what sectors are affected, it immediately goes to mirror data, or rebuilds it from parity, and then writes the correct data to the previously reported bad sectors. And that will fix the problem. So really, if you're going to play the multiple device game, you need drive error timing to be shorter than the kernel's. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html