Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

Chris Murphy Thu, 28 Jun 2018 15:28:15 -0700

On Thu, Jun 28, 2018 at 11:37 AM, Goffredo Baroncelli
<kreij...@libero.it> wrote:


> Regarding your point 3), it must be point out that in case of NOCOW files, 
> even having the same transid it is not enough. It still be possible that a 
> copy is update before a power failure preventing the super-block update.
> I think that the only way to prevent it to happens is:
>   1) using a data journal (which means that each data is copied two times)
> OR
>   2) using a cow filesystem (with cow enabled of course !)


There is no power failure in this example. So it's really off the
table considering whether Btrfs or mdadm/lvm raid do better in the
same situation with a nodatacow file.

I think here is the problem in the Btrfs nodatacow case. Btrfs doesn't
have a way of untrusting nodatacow files on a previously missing drive
that hasn't been balanced. There is no such thing as nometadatacow, so
no matter what it figures out there's a problem, and uses the good
copy of metadata, but it never "marks" the previously missing device
as suspicious. When it comes time to read a nodatacow file, Btrfs just
blindly reads off one of the drives, it has no mechanism for
questioning the formerly missing drive without csum.

That is actually a really weird and unique kind of write hole for
Btrfs raid1 when the data is nodatacow.

I have to agree with Remi. This is a flaw in the design or bad bug,
however you want to consider it. Because mdadm/lvm do not behave this
way in the exact same situation.

And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?

As for fixes:

- During mount time, Btrfs sees from supers that there is a transid
mismatch, to not read nodatacow files from the lower transid device
until an auto balance has completed. Right now Btrfs doesn't have an
abbreviated balance that "replays" the events between two transids.
Basically it would work like send/receive but for balance to catch up
a previously missing device. Right now we have to do a full balance
which is a brutal penalty for a briefly missing drive. Again, mdadm
and lvm do better here by default.

- Fix the performance issues of COW with disk images. ZFS doesn't even
have a nodatacow option and they're running VM images on ZFS and it
doesn't sound like they're running into ridiculous performance
penalties that makes it impractical to use.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

Reply via email to