On Tue, Nov 29, 2016 at 4:16 PM, Wilson Meier <wilson.me...@gmail.com> wrote: > > > On 29.11.2016 23:52, Chris Murphy wrote: >> On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier <wilson.me...@gmail.com> wrote: >>> On 29.11.2016 18:54, Austin S. Hemmelgarn wrote: >>>> On 2016-11-29 12:20, Florian Lindner wrote: >>>>> Hello, >>>>> >>>>> I have 4 harddisks with 3TB capacity each. They are all used in a >>>>> btrfs RAID 5. It has come to my attention, that there >>>>> seem to be major flaws in btrfs' raid 5 implementation. Because of >>>>> that, I want to convert the the raid 5 to a raid 10 >>>>> and I have several questions. >>>>> >>>>> * Is that possible as an online conversion? >>>> Yes, as long as you have a complete array to begin with (converting from >>>> a degraded raid5/6 array has the same issues as rebuilding a degraded >>>> raid5/6 array). >>>>> >>>>> * Since my effective capacity will shrink during conversions, does >>>>> btrfs check if there is enough free capacity to >>>>> convert? As you see below, right now it's probably too full, but I'm >>>>> going to delete some stuff. >>>> No, you'll have to do the math yourself. This would be a great project >>>> idea to place on the wiki though. >>>>> >>>>> * I understand the command to convert is >>>>> >>>>> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt >>>>> >>>>> Correct? >>>> Yes, but I would personally convert first metadata then data. The >>>> raid10 profile gets better performance than raid5, so converting the >>>> metadata first (by issuing a balance just covering the metadata) should >>>> speed up the data conversion a bit). >>>>> >>>>> * What disks are allowed to fail? My understanding of a raid 10 is >>>>> like that >>>>> >>>>> disks = {a, b, c, d} >>>>> >>>>> raid0( raid1(a, b), raid1(c, d) ) >>>>> >>>>> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid >>>>> to fail (either a or b and c or d are allowed to fail) >>>>> >>>>> How is that with a btrfs raid 10? >>>> A BTRFS raid10 can only sustain one disk failure. Ideally, it would >>>> work like you show, but in practice it doesn't. >>> I'm a little bit concerned right now. I migrated my 4 disk raid6 to >>> raid10 because of the known raid5/6 problems. I assumed that btrfs >>> raid10 can handle 2 disk failures as longs as they occur in different >>> stripes. >>> Could you please point out why it cannot sustain 2 disk failures? >> >> Conventional raid10 has a fixed assignment of which drives are >> mirrored pairs, and this doesn't happen with Btrfs at the device level >> but rather the chunk level. And a chunk stripe number is not fixed to >> a particular device, therefore it's possible a device will have more >> than one chunk stripe number. So what that means is the loss of two >> devices has a pretty decent chance of resulting in the loss of both >> copies of a chunk, whereas conventional RAID 10 must lose both >> mirrored pairs for data loss to happen. >> >> With very cursory testing what I've found is btrfs-progs establishes >> an initial stripe number to device mapping that's different than the >> kernel code. The kernel code appears to be pretty consistent so long >> as the member devices are identically sized. So it's probably not an >> unfixable problem, but the effect is that right now Btrfs raid10 >> profile is more like raid0+1. >> >> You can use >> $ sudo btrfs insp dump-tr -t 3 /dev/ >> >> That will dump the chunk tree, and you can see if any device has more >> than one chunk stripe number associated with it. >> >> > Huh, that makes sense. That probably should be fixed :) > > Given your advised command (extended it a bit for readability): > # btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{ > print $1" "$2" "$3" "$4 }' | sort -u > > I get: > stripe 0 devid 1 > stripe 0 devid 4 > stripe 1 devid 2 > stripe 1 devid 3 > stripe 1 devid 4 > stripe 2 devid 1 > stripe 2 devid 2 > stripe 2 devid 3 > stripe 3 devid 1 > stripe 3 devid 2 > stripe 3 devid 3 > stripe 3 devid 4 > > Now i'm even more concerned!
Uhh yeah, this is a four device raid10? I'm a little confused why it's not consistently showing four stripes per chunk, which would mean the same number of strip 0's as stripe 3's. I don't know what that's about. A full balance might make the mapping consistent. > That said, btrfs shouldn't be used for other then raid1 as every other > raid level has serious problems or at least doesn't work as the expected > raid level (in terms of failure recovery). Well, raid1 is also single device failure tolerance only as well. There is no device n raid1. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html