On Thu, Oct 13, 2016 at 12:33:31AM +0500, Roman Mamedov wrote: > On Wed, 12 Oct 2016 15:19:16 -0400 > Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > > > I'm not even sure btrfs does this--I haven't checked precisely what > > it does in dup mode. It could send both copies of metadata to the > > disks with a single barrier to separate both metadata updates from > > the superblock updates. That would be bad in this particular case. > > It would be bad in any case, including a single physical disk and no RAID, and
No, a single disk does not have these problems. On a single disk we don't have to deal with temporarily corrupted metadata _outside_ the areas we are writing, as the disk will confine damaged data to individual sectors. On RAID5, data damage is only limited at the stripe level, a unit orders of magnitude larger than a sector. > I don't think there's any basis to speculate that mdadm doesn't implement > write barriers properly. btrfs and mdadm have to use them properly together. It's possible to get it fatally wrong from the btrfs side even if mdadm does everything perfectly. Single disks don't have stripe consistency requirements, so if btrfs has single-disk assumptions about the behavior of writes then it can do the wrong thing on multi-disk systems. > > In degraded RAID5/6 mode, all writes temporarily corrupt data, so if there > > is an interruption (system crash, a disk times out, etc) in degraded mode, > > Moreover, in any non-COW system writes temporarily corrupt data. So again, > writing to a (degraded or not) mdadm RAID5 is not much different than writing > to a single physical disk. However I believe in the Btrfs case metadata is > always COW, so this particular problem may be not as relevant here in the > first place. Degraded RAID5 does not behave like a single disk. That's the point people seem to keep missing when thinking about this. btrfs CoW relies on single-disk behavior, and fails badly when it doesn't get it. btrfs CoW requires that writes to one sector don't modify or jeopardize data integrity in any other sectors. mdadm in degraded raid5/6 mode with no stripe journal device cannot deliver this requirement. Writes always temporarily disrupt data on other disks in the same RAID stripe. Each individual disruption lasts only milliseconds, but there may be hundreds or thousands of failure windows per second. > > -- > With respect, > Roman
signature.asc
Description: Digital signature