On 30.11.2016 00:49, Chris Murphy wrote: > On Tue, Nov 29, 2016 at 4:16 PM, Wilson Meier <wilson.me...@gmail.com> wrote: >> >> >> On 29.11.2016 23:52, Chris Murphy wrote: >>> On Tue, Nov 29, 2016 at 3:34 PM, Wilson Meier <wilson.me...@gmail.com> >>> wrote: >>>> On 29.11.2016 18:54, Austin S. Hemmelgarn wrote: >>>>> On 2016-11-29 12:20, Florian Lindner wrote: >>>>>> Hello, >>>>>> >>>>>> I have 4 harddisks with 3TB capacity each. They are all used in a >>>>>> btrfs RAID 5. It has come to my attention, that there >>>>>> seem to be major flaws in btrfs' raid 5 implementation. Because of >>>>>> that, I want to convert the the raid 5 to a raid 10 >>>>>> and I have several questions. >>>>>> >>>>>> * Is that possible as an online conversion? >>>>> Yes, as long as you have a complete array to begin with (converting from >>>>> a degraded raid5/6 array has the same issues as rebuilding a degraded >>>>> raid5/6 array). >>>>>> >>>>>> * Since my effective capacity will shrink during conversions, does >>>>>> btrfs check if there is enough free capacity to >>>>>> convert? As you see below, right now it's probably too full, but I'm >>>>>> going to delete some stuff. >>>>> No, you'll have to do the math yourself. This would be a great project >>>>> idea to place on the wiki though. >>>>>> >>>>>> * I understand the command to convert is >>>>>> >>>>>> btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt >>>>>> >>>>>> Correct? >>>>> Yes, but I would personally convert first metadata then data. The >>>>> raid10 profile gets better performance than raid5, so converting the >>>>> metadata first (by issuing a balance just covering the metadata) should >>>>> speed up the data conversion a bit). >>>>>> >>>>>> * What disks are allowed to fail? My understanding of a raid 10 is >>>>>> like that >>>>>> >>>>>> disks = {a, b, c, d} >>>>>> >>>>>> raid0( raid1(a, b), raid1(c, d) ) >>>>>> >>>>>> This way (a XOR b) AND (c XOR d) are allowed to fail without the raid >>>>>> to fail (either a or b and c or d are allowed to fail) >>>>>> >>>>>> How is that with a btrfs raid 10? >>>>> A BTRFS raid10 can only sustain one disk failure. Ideally, it would >>>>> work like you show, but in practice it doesn't. >>>> I'm a little bit concerned right now. I migrated my 4 disk raid6 to >>>> raid10 because of the known raid5/6 problems. I assumed that btrfs >>>> raid10 can handle 2 disk failures as longs as they occur in different >>>> stripes. >>>> Could you please point out why it cannot sustain 2 disk failures? >>> >>> Conventional raid10 has a fixed assignment of which drives are >>> mirrored pairs, and this doesn't happen with Btrfs at the device level >>> but rather the chunk level. And a chunk stripe number is not fixed to >>> a particular device, therefore it's possible a device will have more >>> than one chunk stripe number. So what that means is the loss of two >>> devices has a pretty decent chance of resulting in the loss of both >>> copies of a chunk, whereas conventional RAID 10 must lose both >>> mirrored pairs for data loss to happen. >>> >>> With very cursory testing what I've found is btrfs-progs establishes >>> an initial stripe number to device mapping that's different than the >>> kernel code. The kernel code appears to be pretty consistent so long >>> as the member devices are identically sized. So it's probably not an >>> unfixable problem, but the effect is that right now Btrfs raid10 >>> profile is more like raid0+1. >>> >>> You can use >>> $ sudo btrfs insp dump-tr -t 3 /dev/ >>> >>> That will dump the chunk tree, and you can see if any device has more >>> than one chunk stripe number associated with it. >>> >>> >> Huh, that makes sense. That probably should be fixed :) >> >> Given your advised command (extended it a bit for readability): >> # btrfs insp dump-tr -t 3 /dev/mapper/luks-2.1 | grep "stripe " | awk '{ >> print $1" "$2" "$3" "$4 }' | sort -u >> >> I get: >> stripe 0 devid 1 >> stripe 0 devid 4 >> stripe 1 devid 2 >> stripe 1 devid 3 >> stripe 1 devid 4 >> stripe 2 devid 1 >> stripe 2 devid 2 >> stripe 2 devid 3 >> stripe 3 devid 1 >> stripe 3 devid 2 >> stripe 3 devid 3 >> stripe 3 devid 4 >> >> Now i'm even more concerned! > > Uhh yeah, this is a four device raid10? I'm a little confused why it's > not consistently showing four stripes per chunk, which would mean the > same number of strip 0's as stripe 3's. I don't know what that's > about. > Yes, 4 devices. It does show 4 stripes per chunk, but the command above sorts and makes the results unique (sort -u). This gives a quick overview of multiple stripes on a single device.
> A full balance might make the mapping consistent. > Will give i a try. >> That said, btrfs shouldn't be used for other then raid1 as every other >> raid level has serious problems or at least doesn't work as the expected >> raid level (in terms of failure recovery). > > Well, raid1 is also single device failure tolerance only as well. > There is no device n raid1. > Sure, but this is the expected behaviour of raid1. So at least no surprise here :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html