On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidj...@gmail.com> wrote: > 26.06.2016 00:52, Chris Murphy пишет: >> Interestingly enough, so far I'm finding with full stripe writes, i.e. >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This >> is raid4. > > That's not what code suggests and what I see in practice - parity seems > to be distributed across all disks; each new 128KiB file (extent) has > parity on new disk. At least as long as we can trust btrfs-map-logical > to always show parity as "mirror 2".
tl;dr Andrei is correct there's no raid4 behavior here. Looks like mirror 2 is always parity, more on that below. > > Do you see consecutive full stripes in your tests? Or how do you > determine which devid has parity for a given full stripe? I do see consecutive full stripe writes, but it doesn't always happen. But not checking the consecutivity is where I became confused. [root@f24s ~]# filefrag -v /mnt/5/ab* Filesystem type is: 9123683e File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456128.. 3456159: 32: last,eof /mnt/5/ab128_2.txt: 1 extent found File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456224.. 3456255: 32: last,eof /mnt/5/ab128_3.txt: 1 extent found File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456320.. 3456351: 32: last,eof /mnt/5/ab128_4.txt: 1 extent found File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456352.. 3456383: 32: last,eof /mnt/5/ab128_5.txt: 1 extent found File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456384.. 3456415: 32: last,eof /mnt/5/ab128_6.txt: 1 extent found File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456416.. 3456447: 32: last,eof /mnt/5/ab128_7.txt: 1 extent found File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456448.. 3456479: 32: last,eof /mnt/5/ab128_8.txt: 1 extent found File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456480.. 3456511: 32: last,eof /mnt/5/ab128_9.txt: 1 extent found File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456096.. 3456127: 32: last,eof /mnt/5/ab128.txt: 1 extent found Starting with the bottom file then from the top so they're in 4096 byte block order; and the 2nd column is the difference in value: 3456096 3456128 32 3456224 96 3456320 96 3456352 32 3456384 32 3456416 32 3456448 32 3456480 32 So the first two files are consecutive full stripe writes. The next two aren't. The next five are. They were all copied at the same time. I don't know why they aren't always consecutive writes. [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB "b", so expected parity is 0x03; if it's always 128KiB of the same value then parity is 0x00 and can result in confusion/mistakes with unwritten free space). [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 Ok so in particular the last five, parity is on device b, c, a, b, c - that suggests it's distributing parity on consecutive full stripe writes. Where I became confused is, there's not always a consecutive write, and that's what ends up causing parity to end up on one device less often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x VG/b. Basically it's a bad test. The sample size is too small. I'd need to increase the sample size by a ton in order to know for sure if this is really a problem. >This > information is not actually stored anywhere, it is computed based on > block group geometry and logical stripe offset. I think you're right. A better test is a scrub or balance on a raid5 that's exhibiting slowness, and find out if there's disk contention on that system, and whether it's the result of parity not being distributed enough. > P.S. usage of "stripe" to mean "stripe element" actually adds to > confusion when reading code :) It's confusing everywhere. mdadm chunk = strip = stripe element. And then LVM introduces -i --stripes which means "data strips" i.e. if you choose -i 3 with raid6 segment type, you get 5 strips per stripe (3 data 2 parity). It's horrible. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html