Re: List of known BTRFS Raid 5/6 Bugs?

erenthetitan Fri, 10 Aug 2018 16:32:48 -0700

Did i get you right?
Please correct me if i am wrong:

Scrubbing seems to have been fixed, you only have to run it once.


Hotplugging (temporary connection loss) is affected by the write hole bug, and 
will create undetectable errors every 16 TB (crc32 limitation).

The write Hole Bug can affect both old and new data.
Reason: BTRFS saves data in fixed size stripes, if the write operation fails 
midway, the stripe is lost.
This does not matter much for Raid 1/10, data always uses a full stripe, and 
stripes are copied on write. Only new data could be lost.
However, for some reason Raid 5/6 works with partial stripes, meaning that data 
is stored in stripes not completley filled by prior data, and stripes are 
removed on write.
Result: If the operation fails midway, the stripe is lost as is all data 
previously stored it.

Transid Mismatch can silently corrupt data.
Reason: It is a seperate metadata failure that is trigged by lost or incomplete 
writes, writes that are lost somewhere during transmission.
It can happen to all BTRFS configurations and is not trigerred by the write 
hole.
It could happen due to brown out (temporary undersupply of voltage), faulty 
cables, faulty ram, faulty disc cache, faulty discs in general.

Both bugs could damage metadata and trigger the following:
Data will be lost (0 to 100% unreadable), the filesystem will be readonly.
Reason: BTRFS saves metadata as a tree structure. The closer the error to the 
root, the more data cannot be read.

Transid Mismatch can happen up to once every 3 months per device,
depending on the drive hardware!

Question: Does this not make transid mismatch way more dangerous than
the write hole? What would happen to other filesystems, like ext4?

Am 10-Aug-2018 09:12:21 +0200 schrieb ce3g8...@umail.furryterror.org: 
> > On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote:
> > > I am searching for more information regarding possible bugs related to
> > > BTRFS Raid 5/6. All sites i could find are incomplete and information
> > > contradicts itself:
> > >
> > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> > > warns of the write hole bug, stating that your data remains safe
> > > (except data written during power loss, obviously) upon unclean shutdown
> > > unless your data gets corrupted by further issues like bit-rot, drive
> > > failure etc.
> > 
> > The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
> > no mitigations to prevent or avoid it in mainline kernels.
> > 
> > The write hole results from allowing a mixture of old (committed) and
> > new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
> > blocks consisting of one related data or parity block from each disk
> > in the array, such that writes to any of the data blocks affect the
> > correctness of the parity block and vice versa). If the writes were
> > not completed and one or more of the data blocks are not online, the
> > data blocks reconstructed by the raid5/6 algorithm will be corrupt.
> > 
> > If all disks are online, the write hole does not immediately
> > damage user-visible data as the old data blocks can still be read
> > directly; however, should a drive failure occur later, old data may
> > not be recoverable because the parity block will not be correct for
> > reconstructing the missing data block. A scrub can fix write hole
> > errors if all disks are online, and a scrub should be performed after
> > any unclean shutdown to recompute parity data.
> > 
> > The write hole always puts both old and new data at risk of damage;
> > however, due to btrfs's copy-on-write behavior, only the old damaged
> > data can be observed after power loss. The damaged new data will have
> > no references to it written to the disk due to the power failure, so
> > there is no way to observe the new damaged data using the filesystem.
> > Not every interrupted write causes damage to old data, but some will.
> > 
> > Two possible mitigations for the write hole are:
> > 
> > - modify the btrfs allocator to prevent writes to partially filled
> > raid5/6 stripes (similar to what the ssd mount option does, except
> > with the correct parameters to match RAID5/6 stripe boundaries),
> > and advise users to run btrfs balance much more often to reclaim
> > free space in partially occupied raid stripes
> > 
> > - add a stripe write journal to the raid5/6 layer (either in
> > btrfs itself, or in a lower RAID5 layer).
> > 
> > There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
> > to btrfs or dramatically increase the btrfs block size) that also solve
> > the write hole problem but are somewhat more invasive and less practical
> > for btrfs.
> > 
> > Note that the write hole also affects btrfs on top of other similar
> > raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
> > The btrfs CoW layer does not understand how to allocate data to avoid RMW
> > raid5 stripe updates without corrupting existing committed data, and this
> > limitation applies to every combination of unjournalled raid5/6 and btrfs.
> > 
> > > The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
> > > warns of possible incorrigible "transid" mismatch, not stating which
> > > versions are affected or what transid mismatch means for your data. It
> > > does not mention the write hole at all.
> > 
> > Neither raid5 nor write hole are required to produce a transid mismatch
> > failure. transid mismatch usually occurs due to a lost write. Write hole
> > is a specific case of lost write, but write hole does not usually produce
> > transid failures (it produces header or csum failures instead).
> > 
> > During real disk failure events, multiple distinct failure modes can
> > occur concurrently. i.e. both transid failure and write hole can occur
> > at different places in the same filesystem as a result of attempting to
> > use a failing disk over a long period of time.
> > 
> > A transid verify failure is metadata damage. It will make the filesystem
> > readonly and make some data inaccessible as described below.
> > 
> > > This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html"
> > > target="_blank">linux-btrfs@vger.kernel.org/msg55161.html" 
> > > target="_blank">linux-btrfs@vger.kernel.org/msg55161.html" 
> > > target="_blank">linux-btrfs@vger.kernel.org/msg55161.html" 
> > > target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
> > > states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption,
> > > but may corrupt your Metadata while trying to do so - meaning you have
> > > to scrub twice in a row to ensure data integrity.
> > 
> > Simple corruption (without write hole errors) is fixed by scrubbing
> > as of the last...at least six months? Kernel v4.14.xx and later can
> > definitely do it these days. Both data and metadata.
> > 
> > If the metadata is damaged in any way (corruption, write hole, or transid
> > verify failure) on btrfs and btrfs cannot use the raid profile for
> > metadata to recover the damaged data, the filesystem is usually forever
> > readonly, and anywhere from 0 to 100% of the filesystem may be readable
> > depending on where in the metadata tree structure the error occurs (the
> > closer to the root, the more data is lost). This is the same for dup,
> > raid1, raid5, raid6, and raid10 profiles. raid0 and single profiles are
> > not a good idea for metadata if you want a filesystem that can persist
> > across reboots (some use cases don't require persistence, so they can
> > use -msingle/-mraid0 btrfs as a large-scale tmpfs).
> > 
> > For all metadata raid profiles, recovery can fail due to risks including
> > RAM corruption, multiple drives having defects in the same locations,
> > or multiple drives with identically-behaving firmware bugs. For raid5/6
> > metadata there is the *additional* risk of the write hole bug preventing
> > recovery of metadata.
> > 
> > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > to the write hole, but data is. In this configuration you can determine
> > with high confidence which files you need to restore from backup, and
> > the filesystem will remain writable to replace the restored data, because
> > raid1 does not have the write hole bug.
> > 
> > More than one scrub for a single write hole event won't help (and never
> > did). If the first scrub doesn't fix all the errors then your kernel
> > probably also has a race condition bug or regression that will permanently
> > corrupt the data (this was true in 2016 when the referenced mailing
> > list post was written).
> > 
> > Current kernels don't have such bugs--if the first scrub can correct
> > the data, it does, and if the first scrub can't correct the data then
> > all future scrubs will produce identical results.
> > 
> > Older kernels (2016) had problems reconstructing data during read()
> > operations but could fix data during scrub or balance operations.
> > These bugs, as far as I am able to test, have been fixed by v4.17 and
> > backported to v4.14.
> > 
> > > The Bugzilla Entry
> > > (https://bugzilla.kernel.org/buglist.cgi?component=btrfs) contains
> > > mostly unanswered bugs, which may or may not still count (2013 - 2018).
> > 
> > I find that any open bug over three years old on b.k.o can be safely
> > ignored because it has either already been fixed or there is not enough
> > information provided to understand what is going on.
> > 
> > > This Spinics Discussion
> > > (https://www.spinics.net/lists/linux-btrfs/msg76471.html) states
> > > that the write hole can even damage old data eg. data that was not
> > > accessed during unclean shutdown, the opposite of what the Raid5/6
> > > Status Page states!
> > 
> > Correct, write hole can *only* damage old data as described above.
> > 
> > > This Spinics comment
> > > (https://www.spinics.net/lists/linux-btrfs/msg76412.html) informs that
> > > hot-plugging a device will trigger the write hole. Accessed data will
> > > therefore be corrupted. In case the earlier statement about old data
> > > corruption is true, random data could be permamently lost. This is even
> > > more dangerous if you are connecting your devices via USB, as USB can
> > > unconnect due to external influence, eg. touching the cables, shaking...
> > 
> > Hot-unplugging a device can cause many lost write events at once, and
> > each lost write event is very bad.
> > 
> > btrfs does not reject and resynchronize a device from a raid array if a
> > write to the device fails (unlike every other working RAID implementation
> > on Earth...). If the device reconnects, btrfs will read a mixture of
> > old and new data and rely on checksums to determine which blocks are
> > out of date (as opposed to treating the departed disk as entirely out
> > of date and initiating a disk replace operation when it reconnects).
> > 
> > A scrub after a momentary disconnect can reconstruct most missing data,
> > but not all. CRC32 lets one error through per 16 TB of corrupted blocks,
> > and all nodatasum/nodatacow files modified while a drive was offline
> > will be corrupted without detection or recovery by btrfs.
> > 
> > Device replace is currently the best recovery option from this kind
> > of failure. Ideally btrfs would implement something like mdadm write
> > intent bitmaps so only those block groups that were modified while
> > the device as offline would be replaced, but this is the btrfs we want
> > not the btrfs we have.
> > 
> > > Lastly, this Superuser question
> > > (https://superuser.com/questions/1325245/btrfs-transid-failure#1344494)
> > > assumes that the transid mismatch bug could toggle your system
> > > unmountable. While it might be possible to restore your data using
> > > sudo BTRFS Restore, it is still unknown how the transid mismatch is
> > > even toggled, meaning that your file system could fail at any time!
> > 
> > Note that transid failure risk applies to all btrfs configurations.
> > It is not specific to raid5/6. The write hole errors from raid5/6 will
> > typically produce a header or csum failure (from reading garbage) not a
> > transid failure (from reading an old, valid, but deleted metadata block).
> > 
> > transid mismatch is pretty simple: one of your disk drives, or some
> > caching or translation layer between btrfs and your disk drives, dropped
> > a write (or, less likely, read from or wrote to the wrong sector address).
> > btrfs detects this by embedding transids into all data structures where
> > one object points to another object in a different block.
> > 
> > transid mismatch is also hard: you then have to figure out which layer
> > of your possibly quite complicated RAID setup is doing that, and make
> > it stop. This process almost never involves btrfs. Sometimes it's the
> > bottom layer (i.e. the drives themselves) but the more layers you add,
> > the more candidates need to be eliminated before the cause can be found.
> > Sometimes it's a *power supply* (i.e. the drive controller CPU browns
> > out and forgets it was writing something or corrupts its embedded RAM).
> > Sometimes it's host RAM going bad, corrupting and breaking everything
> > it touches.
> > 
> > I have a variety of test setups and the correlation between hardware
> > model (especially drive model, but also some SATA controller models)
> > and total filesystem loss due to transid verify failure is very strong.
> > Out of 10 drive models from 5 vendors, 2 models can't keep a filesystem
> > intact for more than a few months, while the other models average 3 years
> > old and still hold the first btrfs filesystem they were formatted with.
> > 
> > Disabling drive write caching sometimes helps, but some hardware eats
> > a filesystem every few months no matter what settings I change. If the
> > problem is a broken SATA controller or cable then changing drive settings
> > won't help.
> > 
> > It's fun and/or scary to put known good and bad hardware in the same
> > RAID1 array and watch btrfs autocorrecting the bad data after every
> > other power failure; however, the bad hardware is clearly not sufficient
> > to implement any sort of reliable data persistence, and arrays with bad
> > hardware in them will eventually fail.
> > 
> > The bad drives can still contribute to society as media cache servers or
> > point-of-sale terminals where the only response to any data integrity
> > issue is a full reformat and image reinstall. This seems to be the
> > target market that low-end consumer drives are aiming for, as they seem
> > to be useless for anything else.
> > 
> > Adopt a zero-tolerance policy for drive resets after the array is
> > mounted and active. A drive reset means a potential lost write leading
> > to a transid verify failure. Swap out both drive and SATA cable the
> > first time a reset occurs during a read or write operation, and consider
> > swapping out SATA controller, changing drive model, and upgrading power
> > supply if it happens twice.
> > 
> > > Do you know of any comprehensive and complete Bug list?
> > 
> > ...related to raid5/6:
> > 
> > - no write hole mitigation (at least two viable strategies
> > available)
> > 
> > - no device bouncing mitigation (mdadm had this working 20
> > years ago)
> > 
> > - probably slower than it could be
> > 
> > - no recovery strategy other than raid (btrfs check --repair is
> > useless on non-trivial filesytems, and a single-bit uncorrected
> > metadata error makes the filesystem unusable)
> > 
> > > Do you know more about the stated Bugs?
> > >
> > > Do you know further Bugs that are not addressed in any of these sites?
> > 
> > My testing on raid5/6 filesystems is producing pretty favorable results
> > these days. There do not seem to be many bugs left.
> > 
> > I have one test case where I write millions of errors into a raid5/6 and
> > the filesystem recovers every single one transparently while verifying
> > SHA1 hashes of test data. After years of rebuilding busted ext3 on
> > mdadm-raid5 filesystems, watching btrfs do it all automatically is
> > just...beautiful.
> > 
> > I think once the write hole and device bouncing mitigations are in place,
> > I'll start looking at migrating -draid1/-mraid1 setups to -draid5/-mraid1,
> > assuming the performance isn't too painful.
-------------------------------------------------------------------------------------------------
FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

Re: List of known BTRFS Raid 5/6 Bugs?

Reply via email to