Re: Functional difference between "replace" vs "add" then "delete missing" with a missing disk in a RAID56 array

Duncan Sun, 29 May 2016 14:13:04 -0700

Chris Johnson posted on Sun, 29 May 2016 09:33:49 -0700 as excerpted:

> Situation: A six disk RAID5/6 array with a completely failed disk. The
> failed disk is removed and an identical replacement drive is plugged in.


First of all, be aware (as you already will be if you're following the 
list) that there are currently two, possibly related, (semi-?)critical 
known bugs still affecting raid56 mode, with the result being that 
despite raid56 nominal completion in 3.19 and fix of a couple even more 
critical bugs early on, by 4.1 release, raid56 mode remains negatively 
recommended for anything but testing.

One of the two bugs is that restriping (as done by balance either with 
the restripe filters after adding devices or triggered automatically by 
device delete) can, in SOME cases only, with the trigger variable unknown 
at this point, can take an order of magnitude (or even more) longer than 
it should -- we're talking over a week for a rebalance that would be 
expected to be done in under a day, upto possibly months for the multi-TB 
filesystems that are a common use-case for raid5/6, that might be 
expected to take a day or two under normal circumstances.

This rises to critical because other than the impractical time involved, 
once you're talking weeks to months restripe time, the danger of another 
device going out, thereby killing the entire array, increases 
unacceptably, to the point that raid56 cannot be considered usable for 
the normal things people use it for, thus the critical bug rating even if 
in theory the restripe is completing correctly and the data isn't in 
immediate danger.

Obviously you're not hitting it if your results show balance as 
significantly faster, but because we don't know what triggers the problem 
yet, that's no guarantee that you won't hit it later, after somehow 
triggering the problem.

The second bug is equally alarming, but in a different way.  A number of 
people have reported that replacing (by one method or the other) a first 
device appears to work, but if a second replace is attempted, it kills 
the array(!!), so obviously something's going wrong with the first 
replace as it's not returning the array to full undegraded functionality, 
even tho all the current tools as well as operations before the second 
replace is attempted suggest that it has done just that.

This one too remains untraced to an ultimate cause, and while the two 
bugs appear quite different, because they are both critical and remain 
untraced, it remains possible that they are actually simply two different 
symptoms of the same root bug.


So, if you're using raid56 only for testing as is recommended, great, but 
if you're using it for live data, for sure have your backups ready as 
there remains an uncomfortably high chance that you may need to use them 
if something goes wrong with that raid56 and these bug(s) prevent you 
from recovering the array.  Or alternatively, switch to the more mature 
raid1 or raid10 modes if realistic in your use-case, or to more 
traditional solutions such as md/dm-raid underneath btrfs or some other 
filesystem.

(One very interesting solution is btrfs raid1 mode over top of a pair of 
md/dm-raid0 virtual devices, each of which can then be composed of 
multiple physical devices.  This allows btrfs raid1 mode data and 
metadata integrity checking and repair that underlying raid modes don't 
have, and includes the repair of detected checksum errors that btrfs 
single mode won't be able to do because it can detect problems but not 
correct them.  Meanwhile, the underlying raid0 helps make up somewhat for 
the btrfs' poor raid1 optimization and performance as it tends to 
serialize access to multiple devices that other raid solutions 
parallelize.)

Of course, the more mature zfs on linux can be another alternative, if 
you're prepared to overlook the licensing issues and have hardware upto 
the task.

With that warning explained and alternatives provided, to your actual 
question...

> Here I have two options for replacing the disk, assuming the old drive
> is device 6 in the superblock and the replacement disk is /dev/sda.
> 
> 'btrfs replace start 6 /dev/sda /mnt'
> This will start a rebuild of the array using the new drive, copying data
> that would have been on device 6 to the new drive from the parity data.
> 
> btrfs add /dev/sda /mnt && btrfs device delete missing /mnt This adds a
> new device (the replacement disk) to the array and dev delete missing
> appears to trigger a rebalance before deleting the missing disk from the
> array. The end result appears to be identical to option 1.
> 
> A few weeks back I recovered an array with a failed drive using 'delete
> missing' because 'replace' caused a kernel panic. I later discovered
> that this was not (just) a failed drive but some other failed hardware
> that I've yet to start diagnosing. Either motherboard or HBA. The drives
> are in a new server now and I am currently rebuilding the array with
> 'replace', which is believe is the "more correct" way to replace a bad
> drive in an array.
> 
> Both work, but 'replace' seems to be slower so I'm curious what the
> functional differences are between the two. I thought the replace would
> be faster as I assumed it would need to read fewer blocks since instead
> of a complete rebalance it's just rebuilding a drive from parity data.

Replace should be faster if the existing device being replaced is still 
at least mostly functional, as in that case, it's a pretty direct low-
level rewrite of the data from one device to the other.

If the device is dead or missing, or simply so damaged most content will 
need to be reconstructed from parity anyway, then (absent the bug 
mentioned above at least) the add/delete method can indeed be faster.

Note that when reconstructing from parity, /all/ devices must be read in 
ordered to deduce what the content of the missing device actually is.  So 
you are mistaken in regard to /read/.  However, a replace should /write/ 
less data, since while it needs to read all remaining devices to 
reconstruct the content of the missing device, it should only have to 
/write/ the replacement, not all devices, while balance will rewrite most 
or all content on all devices.

But balance is more efficient when the device is actually missing, 
apparently because in that case (again, absent the above mentioned bug) 
its algorithm is more efficient, despite having to rewrite everything, 
instead of just the single device.

(I'm just a user and list regular, not a dev, so I won't attempt to 
explain more "under the hood" than that.)


Meanwhile, not directly appropos to the question, but there are a few 
btrfs features that are known to increase balance times /dramatically/, 
in part due to known scaling issues that are being worked on, but it's a 
multi-year project.

Btrfs quotas are a huge factor in this regard.  However, quotas on btrfs 
continue to be buggy and not always reliable in any case, so the best 
recommendation for now continues to be to simply turn off (or leave off 
if never activated) btrfs quotas if you don't actually need them, and to 
use a more mature filesystem where quotas are mature and reliable if you 
do need them.  That will reduce balance (and btrfs check) times 
/dramatically/.

The other factor is snapshots and/or other forms of heavy reflinking such 
as dedup.  Keeping the number of snapshots per subvolume reasonably low, 
say 250-300 and definitely under 500, helps dramatically in this regard, 
as balance (and check) operations simply don't scale well when there's 
thousands or tens of thousands of reflinks per extent to account for.  
Unfortunately (in this regard) it's incredibly easy and fast to create 
snapshots, deceptively so, masking the work balance and check have to do 
to maintain them.

So scheduled snapshotting is fine, as long as you have scheduled snapshot 
thinning in place as well, keeping the number of snapshots per subvolume 
to a few hundred at most.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Functional difference between "replace" vs "add" then "delete missing" with a missing disk in a RAID56 array

Reply via email to