On Sunday, August 14, 2016 8:04:14 PM CEST you wrote: > On Sunday, August 14, 2016 10:20:39 AM CEST you wrote: > > On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader > > > > <wolfgang_ma...@brain-frog.de> wrote: > > > Hi, > > > > > > I have two questions > > > > > > 1) Layout of raid10 in btrfs > > > btrfs pools all devices and than stripes and mirrors across this pool. > > > Is > > > it therefore correct, that a raid10 layout consisting of 4 devices > > > a,b,c,d is _not_ > > > > > > raid0 > > > | > > > |---------------| > > > > > > ------------ ------------- > > > > > > |a| |b| |c| |d| > > > | > > > raid1 raid1 > > > > > > Rather, there is no clear distinction of device level between two > > > devices > > > which form a raid1 set which are than paired by raid0, but simply, each > > > bit is mirrored across two different devices. Is this correct? > > > > All of the profiles apply to block groups (chunks), and that includes > > raid10. They only incidentally apply to devices since of course block > > groups end up on those devices, but which stripe ends up on which > > device is not consistent, and that ends up making Btrfs raid10 pretty > > much only able to survive a single device loss. > > > > I don't know if this is really thoroughly understood. I just did a > > test and I kinda wonder if the reason for this inconsistent assignment > > is a difference between the initial stripe>devid pairing at mkfs time, > > compared to subsequent pairings done by kernel code. For example, I > > > > get this from mkfs: > > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 > > itemsize > > > > 176 chunk length 16777216 owner 2 stripe_len 65536 > > > > type SYSTEM|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1048576 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1048576 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1048576 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 20971520 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 > > itemsize > > > > 176 chunk length 2147483648 owner 2 stripe_len 65536 > > > > type METADATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 9437184 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 9437184 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 9437184 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 29360128 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1083179008 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1083179008 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1083179008 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 1103101952 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > Here you can see every chunk type has the same stripe to devid > > pairing. But once the kernel starts to allocate more data chunks, the > > pairing is different from mkfs, yet always (so far) consistent for > > each additional kernel allocated chunk. > > > > item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 2 offset 2156920832 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 1 devid 3 offset 2156920832 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 4 offset 2156920832 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 3 devid 1 offset 2176843776 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > This volume now has about a dozen chunks created by kernel code, and > > the stripe X to devid Y mapping is identical. Using dd and hexdump, > > I'm finding that stripe 0 and 1 are mirrored pairs, they contain > > identical information. And stripe 2 and 3 are mirrored pairs. And the > > raid0 striping happens across 01 and 23 such that odd-numbered 64KiB > > (default) stripe elements go on 01, and even-numbered stripe elements > > go on 23. If the stripe to devid pairing were always consistent, I > > could lose more than one device and still have a viable volume, just > > like a conventional raid10. Of course you can't lose both of any > > mirrored pair, but you could lose one of every mirrored pair. That's > > why raid10 is considered scalable. > > Let me compare the btrfs raid10 to a conventional raid5. Assume a raid5 > across n disks. Than, for each chunk (don't know the unit of such a chunk) > of n-1 disks, a parity chunk is written to the remaining disk using xor. > Parity chunks are distributed across all disks. In case the data of a > failed disk has to be restored from the degraded array, the entirety of n-1 > disks have to be read, in order to use xor to reconstruct the data. Is this > correct? Again, in order to restore a failed disk in raid5, all data on all > remaining disks is needed, otherwise the array can not be restored. > Correct? > > For btrfs raid10, I only can loose a single device, but in order to rebuild > it, I only need to read the amount of data which was stored on the failed > device, as no parity is used, but mirroring. Correct? Therefore, the amount > of bits I need to read successfully for a rebuild is independent of the > number of devices included in the raid10, while the amount of read data > scales with the number of devices in a raid5. > > Still, I think it is unfortunate, that btrfs raid10 does not stick to a > fixed layout, as than the entire array must be available. If you have your > devices attached by more than one controller, in more than one case powered > by different power supplies etc., the probability for their failure has to > be summed up,
This formulation might be a bit vague. For m devices of which non is allowed to fail, the total failure probability should be p_tot = (1-p_f)^m where p_f is the probability of failure for a single device, assuming p_f is the same for all m devices. > as no component is allowed to fail. Is work under way to > change this, or is this s.th. out of reach for btrfs as it is an > implementation detail of the kernel. > > > But apparently the pairing is different between mkfs and kernel code. > > And due to that I can't reliably lose more than one device. There is > > an edge case where I could lose two: > > > > > > > > stripe 0 devid 4 > > stripe 1 devid 3 > > stripe 2 devid 2 > > stripe 3 devid 1 > > > > stripe 0 devid 2 > > stripe 1 devid 3 > > stripe 2 devid 4 > > stripe 3 devid 1 > > > > > > I could, in theory, lose devid 3 and devid 1 and still have one of > > each stripe copies for all block groups, but kernel code doesn't > > permit this: > > > > [352467.557960] BTRFS warning (device dm-9): missing devices (2) > > exceeds the limit (1), writeable mount is not allowed > > > > > 2) Recover raid10 from a failed disk > > > Raid10 inherits its redundancy from the raid1 scheme. If I build a > > > raid10 > > > from n devices, each bit is mirrored across two devices. Therefore, in > > > order to restore a raid10 from a single failed device, I need to read > > > the > > > amount of data worth this device from the remaining n-1 devices. > > > > Maybe? In a traditional raid10, rebuild of a faulty device means > > reading 100% of its mirror device and that's it. For Btrfs the same > > could be true, it just depends on where the block group copies are > > located, they could all be on just one other device, or they could be > > spread across more than one device. Also for Btrfs it's only copying > > extents, it's not doing sector level rebuild, it'll skip the empty > > space. > > > > >In case, the amount of > > > > > > data on the failed disk is in the order of the number of bits for which > > > I > > > can expect an unrecoverable read error from a device, I will most likely > > > not be able to recover from the disk failure. Is this conclusion > > > correct, > > > or am I am missing something here. > > > > I think you're over estimating the probability of URE. They're pretty > > rare, and it's far less likely if you're doing regular scrubs. > > > > I haven't actually tested this but if a URE or even a checksum > > mismatch were to happen on a data block group during rebuild following > > replacing a failed device, I'd like to think Btrfs just complains, it > > doesn't stop the remainder of the rebuild. If it happens on metadata > > or system chunk, well that's bad and could be fatal. > > > > > > As an aside, I'm finding the size information for the data chunk in > > 'fi us' confusing... > > > > The sample file system contains one file: > > [root@f24s ~]# ls -lh /mnt/0 > > total 1.4G > > -rw-r--r--. 1 root root 1.4G Aug 13 19:24 > > Fedora-Workstation-Live-x86_64-25-20160810.n.0.iso > > > > > > [root@f24s ~]# btrfs fi us /mnt/0 > > > > Overall: > > Device size: 400.00GiB > > Device allocated: 8.03GiB > > Device unallocated: 391.97GiB > > Device missing: 0.00B > > Used: 2.66GiB > > Free (estimated): 196.66GiB (min: 196.66GiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 16.00MiB (used: 0.00B) > > > > ## "Device size" is total volume or pool size, "Used" shows actual > > usage accounting for the replication of raid1, and yet "Free" shows > > 1/2. This can't work long term as by the time I have 100GiB in the > > volume, Used will report 200Gib while Free will report 100GiB for a > > total of 300GiB which does not match the device size. So that's a bug > > in my opinion. > > > > Data,RAID10: Size:2.00GiB, Used:1.33GiB > > > > /dev/mapper/VG-1 512.00MiB > > /dev/mapper/VG-2 512.00MiB > > /dev/mapper/VG-3 512.00MiB > > /dev/mapper/VG-4 512.00MiB > > > > ## The file is 1.4GiB but the Used reported is 1.33GiB? That's weird. > > And now in this area the user is somehow expected to know that all of > > these values are 1/2 their actual value due to the RAID10. I don't > > like this inconsistency for one. But it's made worse by using the > > secret decoder ring method of usage when it comes to individual device > > allocations. Very clearly Size if really 4, and each device has a 1GiB > > chunk. So why not say that? This is consistent with the earlier > > "Device allocated" value of 8GiB.
signature.asc
Description: This is a digitally signed message part.