Re: scrub implies failing drive - smartctl blissfully unaware

Chris Murphy Tue, 18 Nov 2014 10:58:23 -0800

On Nov 18, 2014, at 8:35 AM, Marc MERLIN <m...@merlins.org> wrote:

> On Tue, Nov 18, 2014 at 09:29:54AM +0200, Brendan Hide wrote:
>> Hey, guys
>> 
>> See further below extracted output from a daily scrub showing csum 
>> errors on sdb, part of a raid1 btrfs. Looking back, it has been getting 
>> errors like this for a few days now.
>> 
>> The disk is patently unreliable but smartctl's output implies there are 
>> no issues. Is this somehow standard faire for S.M.A.R.T. output?
> 
> Try running hdrecover on your drive, it'll scan all your blocks and try to
> rewrite the ones that are failing, if any:
> http://hdrecover.sourceforge.net/


The only way it can know if there is a bad sector is if the drive returns a 
read error, which will include the LBA for the affected sector(s). This is the 
same thing that would be done with scrub, except any bad sectors that don’t 
contain data. A common problem getting a drive to issue the read error, 
however, is a mismatch between the scsi command timer setting (default 30 
seconds) and the SCT error recover control setting for the drive. The drive SCT 
ERC value needs to be shorter than the scsi command timer value, otherwise some 
bad sector errors will cause the drive to go into a longer recovery attempt 
beyond the scsi command timer value. If that happens, the ata link is reset, 
and there’s no possibility of finding out what the affected sector is.

So a.) use smartctl -l scterc to change the value below 30 seconds (300 
deciseconds) with 70 deciseconds being reasonable. If the drive doesn’t support 
SCT commands, then b.) change the linux scsi command timer to be greater than 
120 seconds.

Strictly speaking the command timer would be set to a value that ensures there 
are no link reset messages in dmesg, that it’s long enough that the drive 
itself times out and actually reports a read error. This could be much shorter 
than 120 seconds. I don’t know if there are any consumer drives that try longer 
than 2 minutes to recover data from a marginally bad sector.

Ideally though, don’t use drives that lack SCT support in multiple device 
volume configurations. An up to 2 minute hang of the storage stack isn’t 
production compatible for most workflows.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

Reply via email to