Re: RAID system with adaption to changed number of disks

Zygo Blaxell Fri, 14 Oct 2016 12:56:11 -0700

On Fri, Oct 14, 2016 at 01:16:05AM -0600, Chris Murphy wrote:
> OK so we know for raid5 data block groups there can be RMW. And
> because of that, any interruption results in the write hole. On Btrfs
> thought, the write hole is on disk only. If there's a lost strip
> (failed drive or UNC read), reconstruction from wrong parity results
> in a checksum error and EIO. That's good.
> 
> However, what happens in the metadata case? If metadata is raid5, and
> there's a crash or power failure during metadata RMW, same problem,
> wrong parity, bad reconstruction, csum mismatch, and EIO. So what's
> the effect of EIO when reading metadata?


The effect is you can't access the page or anything referenced by
the page.  If the page happens to be a root or interior node of
something important, large parts of the filesystem are inaccessible,
or the filesystem is not mountable at all.  RAID device management and
balance operations don't work because they abort as soon as they find
the first unreadable metadata page.

In theory it's still possible to rebuild parts of the filesystem offline
using backrefs or brute-force search.  Using an old root might work too,
but in bad cases the newest viable root could be thousands of generations
old (i.e. it's more likely that no viable root exists at all).

> And how common is RMW for metadata operations?

RMW in metadata is the norm.  It happens on nearly all commits--the only
exception seems to be when both ends of a commit write happen to land
on stripe boundaries accidentally, which is less than 1% of the time on
3 disks.

> I wonder where all of these damn strange cases where people can't do
> anything at all with a normally degraded raid5 - one device failed,
> and no other failures, but they can't mount due to a bunch of csum
> errors.

I'm *astonished* to hear about real-world successes with raid5 metadata.
The total-loss failure reports are the result I expect.

The current btrfs raid5 implementation is a thin layer of bugs on top
of code that is still missing critical pieces.  There is no mechanism to
prevent RMW-related failures combined with zero tolerance for RMW-related
failures in metadata, so I expect a btrfs filesystem using raid5 metadata
to be extremely fragile.  Failure is not likely--it's *inevitable*.

The non-RMW-aware allocator almost maximizes the risk of RMW data loss.
Every transaction commit contains multiple tree root pages, which
are the most critical metadata that could be lost due to RMW failure.
There is a window at least a few milliseconds wide, and potentially
several seconds wide, where some data on disk is in an unrecoverable
state due to RMW.  This happens twice a minute with the default commit
interval and 99% of commits are affected.  That's a million opportunities
per machine-year to lose metadata.  If a crash lands on one of those,
boom, no more filesystem.

I expect one random crash (i.e. a crash that is not strongly correlated
to RMW activity) out of 30-2000 (depending on filesystem size, workload,
rotation speed, btrfs mount parameters) will destroy a filesystem under
typical conditions.  Real world crashes tend not to be random (i.e. they
are strongly correlated to RMW activity), so filesystem loss will be
much more frequent in practice.


> 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

signature.asc
Description: Digital signature

Re: RAID system with adaption to changed number of disks

Reply via email to