On Wed, Sep 10, 2014 at 11:51:19PM -0400, Zygo Blaxell wrote:
> This is a complex topic. 

I agree, and I make no claim to be an expert in any of this.

> Some disks have bugs in their firmware, and some of those bugs make the
> data sheets and most of this discussion entirely moot.  The firmware is
> gonna do what the firmware's gonna do.

Agreed. That's why I like that fact that btrfs provides another layer of
error checking / correction.

> It's a bad idea to try to rewrite a fading sector in some cases.
> If the drive is located in a climate-controlled data center then it
> should be OK; however, there are multiple causes of read failure and
> some of them will also cause writes to damage adjacent data on the disk.
> Spinning disks stop being able to position their heads properly around
> -10C or so, a fact that will be familiar to anyone who's tried to use a
> laptop outside in winter.  Maybe someone dropped the computer, and the
> read errors are due to the heads vibrating with the shock--a read retry
> a few milliseconds later would be OK, but a rewrite (without a delay,
> so the heads are still vibrating from the shock) would just wipe out
> some nearby data with no possibility of recovery.

Of course, the drive can't always know what's going on outside. It just
tries its best (we hope). 

> Most of the reallocations I've observed in the field happen when a
> sector is written, not read.

Very true. I believe what happens is that a sector is marked for
re-allocation when the read fails, and a write to that sector will
trigger the actual reallocation. Hence the "pending reallocations" SMART
attribute.

> Most disks can search for defects on their own, but the host has to issue a
> SMART command to initiate such a search.  They will also track defect
> rates and log recent error details (with varying degrees of bugginess).

And again, it's up to the questionable firmware's discretion as to how
that search is done / how thorough it is. And it has to be triggered by
the user / script. I don't consider that to really be "on its own," as
btrfs scrub requires the same level of input/scripting.

> smartmontools is your friend.  It's not a replacement for btrfs scrub, but
> it collects occasionally useful complementary information about the
> health of the drive.

I can't find the link, but there was a study done that shows an
alarmingly high percentage of disk failures showed no SMART errors
before failing. 

> There used to be a firmware feature for drives to test themselves
> whenever they are spinning and idle for four continuous hours, but most
> modern disks will power themselves down if they are idle for much less
> time...and who has a disk that's idle for four hours at a time anyway?  ;)

My backup destination is touched once a day. It averages about 20 hours
a day idle. Though it probably doesn't need to be testing itself 80% of
the time. That would be a mite excessive =P

> > Scrub your disks, folks. A scrubbed disk is a happy disk.
> 
> Seconded.  Also remember that not all storage errors are due to disk
> failure.  There's a lot of RAM, high-speed signalling, and wire between
> the host CPU and a disk platter.  SMART self-tests won't detect failures
> in those, but scrubs will.

But we'll save the ECC RAM discussion for another day, perhaps.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to