Re: Adventures in btrfs raid5 disk recovery

Andrei Borzenkov Fri, 24 Jun 2016 03:17:01 -0700

On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy <li...@colorremedies.com> wrote:


> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       3:    2931712..   2931715:      4:             last,eof

Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)

nohostname:~ # filefrag -v /mnt/1
Filesystem type is: 9123683e
File size of /mnt/1 is 3072 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:     269376..    269376:      1:             last,eof
/mnt/1: 1 extent found

But!

nohostname:~ # filefrag -v /etc/passwd
Filesystem type is: 9123683e
File size of /etc/passwd is 1527 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    4095:          0..      4095:   4096:
last,not_aligned,inline,eof
/etc/passwd: 1 extent found
nohostname:~ #

Why it works for one filesystem but does not work for an other one?
...
>
> So at the old address, it shows the "aaaaa..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
>
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk.

You misunderstand the nature of problem. What is put at risk is data
that is already on disk and "shares" parity with new data.

As example, here are the first 64K in several extents on 4 disk RAID5
with so far single data chunk

        item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
15491 itemsize 176
                chunk length 3221225472 owner 2 stripe_len 65536
                type DATA|RAID5 num_stripes 4
                        stripe 0 devid 4 offset 9437184
                        dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
                        stripe 1 devid 3 offset 9437184
                        dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
                        stripe 2 devid 2 offset 1074790400
                        dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
                        stripe 3 devid 1 offset 1094713344
                        dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824


        item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824
        item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824
        item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824
        item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824

where devid 1 = sdb1, 2 = sdc1 etc.

Now let's write some data (I created several files) up to 64K in size:

mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
mirror 2 logical 1103380480 physical 9519104 device /dev/sde1

Note that btrfs allocates 64K on the same device before switching to
the next one. What is a bit misleading here, sdc1 is data and sde1 is
parity (you can see it in checksum tree, where only items for sdc1
exist).

Now let's write next 64k and see what happens

nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
mirror 2 logical 1103429632 physical 9502720 device /dev/sde1

See? btrfs now allocates new stripe on sdb1; this stripe is at the
same offset as previous one on sdc1 (64K) and so shares the same
parity stripe on sde1. If you compare 64K on sde1 at offset 9502720
before and after, you will see that it has changed. INPLACE. Without
CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1
is updated but sde1 is not, attempt to reconstruct data on sdc1 will
either fail (if we have checksums) or result in silent corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adventures in btrfs raid5 disk recovery

Reply via email to