Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

Christoph Anton Mitterer Mon, 28 Nov 2016 17:53:16 -0800

On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> If a drive's
> embedded controller RAM fails, you get corruption on the majority of
> reads from a single disk, and most writes will be corrupted (even if
> they
> were not before).


Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
a number of disks for nearly 10 years now, I'd have never stumbled on
such a case of breakage so far...

Actually most cases are as simple as HDD fails to work and this is
properly signalled to the controller.



> If there's a transient failure due to environmental
> issues (e.g. short-term high-amplitude vibration or overheating) then
> writes may pause for mechanical retry loops.  If there is bitrot in
> SSDs
> (particularly in the address translation tables) it looks like a wall
> of random noise that only ends when the disk goes offline.  You can
> get
> combinations of these (e.g. RAM failures caused by transient
> overheating)
> where the drive's behavior changes over time.
> 
> When in doubt, don't write.

Sorry, but these cases as any cases of memory issues (be it main memory
or HDD controller) would also kick in at any normal writes.

So there's no point in protecting against this on the storage side...

Either never write at all... or have good backups for these rare cases.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

Reply via email to