On Wed, Oct 12, 2016 at 01:31:41PM -0400, Zygo Blaxell wrote:
> On Wed, Oct 12, 2016 at 12:25:51PM +0500, Roman Mamedov wrote:
> > Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
> > 
> > > A btrfs -dsingle -mdup array on a mdadm raid[56] device might have a
> > > snowball's chance in hell of surviving a disk failure on a live array
> > > with only data losses.  This would work if mdadm and btrfs successfully
> > > arrange to have each dup copy of metadata updated separately, and one
> > > of the copies survives the raid5 write hole.  I've never tested this
> > > configuration, and I'd test the heck out of it before considering
> > > using it.
> > 
> > Not sure what you mean here, a non-fatal disk failure (i.e. within being
> > compensated by redundancy) is invisible to the upper layers on mdadm arrays.
> > They do not need to "arrange" anything, on such failure from the point of 
> > view
> > of Btrfs nothing whatsoever has happened to the /dev/mdX block device, it's
> > still perfectly and correctly readable and writable.
> 
> btrfs hurls a bunch of writes for one metadata copy to mdadm, mdadm
> forwards those writes to the disks.  btrfs sends a barrier to mdadm,
> mdadm must properly forward that barrier to all the disks and wait until
> they're all done.  Repeat the above for the other metadata copy.

I'm not even sure btrfs does this--I haven't checked precisely what
it does in dup mode.  It could send both copies of metadata to the
disks with a single barrier to separate both metadata updates from
the superblock updates.  That would be bad in this particular case.

> If that's all implemented correctly in mdadm, all is well; otherwise,
> mdadm and btrfs fail to arrange to have each dup copy of metadata
> updated separately.

To be clearer about the consequences of this:

If both copies of metadata are updated at the same time (because btrfs
and mdadm failed to get the barriers right), it's possible to have both
copies of metadata in an inconsistent (unreadable) state at the same time,
ending the filesystem.

In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there
is an interruption (system crash, a disk times out, etc) in degraded mode,
one of the metadata copies will be damaged.  The damage may not be limited
to the current commit, so we need the second copy of the metadata intact
to recover from broken changes to the first copy.  Usually metadata chunks
are larger than RAID5 stripes, so this works out for btrfs on mdadm RAID5
(maybe not if two metadata chunks are adjacent and not stripe-aligned,
but that's a rare case, and one that only affects array sizes that are
not a power of 2 + 1 disk for RAID5, or power of 2 + 2 disks for RAID6).

> The present state of the disks is irrelevant.  The array could go
> degraded due to a disk failure at any time, so for practical failure
> analysis purposes, only the behavior in degraded mode is relevant.
> 
> > 
> > -- 
> > With respect,
> > Roman
> 
> 


Attachment: signature.asc
Description: Digital signature

Reply via email to