Re: Confused by btrfs quota group accounting

2019-06-23 Thread Qu Wenruo


On 2019/6/22 下午11:11, Andrei Borzenkov wrote:
[snip]
> 
> 10:/mnt # dd if=/dev/urandom of=test/file bs=1M count=100 seek=0
> conv=notrunc
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB, 100 MiB) copied, 0.685532 s, 153 MB/s
> 10:/mnt # sync
> 10:/mnt # btrfs qgroup show .
> qgroupid rfer excl
>   
> 0/5  16.00KiB 16.00KiB
> 0/258 1.01GiB100.02MiB
> 0/263 1.00GiB 85.02MiB

Sorry, I can't really reproduce it.

5.1.12 kernel, using the following script:
---
#!/bin/bash

dev="/dev/data/btrfs"
mnt="/mnt/btrfs"

umount $dev &> /dev/null
mkfs.btrfs -f $dev > /dev/null

mount $dev $mnt
btrfs sub create $mnt/subv1
btrfs quota enable $mnt
btrfs quota rescan -w $mnt

xfs_io -f -c "pwrite 0 1G" $mnt/subv1/file1
sync
btrfs sub snapshot $mnt/subv1 $mnt/subv2
sync
btrfs qgroup show -prce $mnt

xfs_io -c "pwrite 0 100m" $mnt/subv1/file1
sync
btrfs qgroup show -prce $mnt
---

The result is:
---
Create subvolume '/mnt/btrfs/subv1'
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0.5902 sec (1.694 GiB/sec and 444134.2107 ops/sec)
Create a snapshot of '/mnt/btrfs/subv1' in '/mnt/btrfs/subv2'
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  16.00KiB 16.00KiB none none --- ---
0/256 1.00GiB 16.00KiB none none --- ---
0/259 1.00GiB 16.00KiB none none --- ---
wrote 104857600/104857600 bytes at offset 0
100 MiB, 25600 ops; 0.0694 sec (1.406 GiB/sec and 368652.9766 ops/sec)
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  16.00KiB 16.00KiB none none --- ---
0/256 1.10GiB100.02MiB none none --- ---
0/259 1.00GiB 16.00KiB none none --- ---
---

> 10:/mnt # filefrag -v test/file
> Filesystem type is: 9123683e
> File size of test/file is 1073741824 (262144 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   22463: 315424..337887:  22464:

There is a initial 10.9MiB extent, I'm not sure how it's created.

>1:22464..   25599:  76896.. 80031:   3136: 337888:

Also here comes another 1.5MiB extent.

From the result of the fiemap, it's definitely not 100MiB written.
Only 12.5M written.

The fiemap result doesn't match with your dd command.

Any clue how this happened?

Thanks,
Qu

>2:25600..   43135:  59264.. 76799:  17536:  80032: shared
>3:43136..   97279:  86048..140191:  54144:  76800: shared
>4:97280..  151551: 143392..197663:  54272: 140192: shared
>5:   151552..  207359: 200736..256543:  55808: 197664: shared
>6:   207360..  262143: 258080..312863:  54784: 256544:
> last,shared,eof
> test/file: 7 extents found
> 10:/mnt # filefrag -v snap1/file
> Filesystem type is: 9123683e
> File size of snap1/file is 1073741824 (262144 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   43135:  33664.. 76799:  43136:
>1:43136..   97279:  86048..140191:  54144:  76800: shared
>2:97280..  151551: 143392..197663:  54272: 140192: shared
>3:   151552..  207359: 200736..256543:  55808: 197664: shared
>4:   207360..  262143: 258080..312863:  54784: 256544:
> last,shared,eof
> snap1/file: 5 extents found
> 
> 
> Oops. Where 85MiB exclusive usage in snapshot comes from? I would expect
> one of
> 
> - 0 exclusive, because original first extent is still referenced by test
> (even though partially), so if qgroup counts physical space usage, snap1
> effectively refers to the same physical extents as test.
> 
> - 100MiB exclusive if qgroup counts logical space consumption, because
> snapshot now has 100MiB different data.
> 
> But 85MiB? It does not match any observed value. Judging by 1.01GiB of
> referenced space for subvolume test, qrgoup counts physical usage, at
> which point snapshot exclusive space consumption remains 0.
> 



signature.asc
Description: OpenPGP digital signature


Re: Confused by btrfs quota group accounting

2019-06-23 Thread Qu Wenruo


On 2019/6/23 下午3:55, Qu Wenruo wrote:
> 
> 
> On 2019/6/22 下午11:11, Andrei Borzenkov wrote:
> [snip]
>>
>> 10:/mnt # dd if=/dev/urandom of=test/file bs=1M count=100 seek=0
>> conv=notrunc
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB, 100 MiB) copied, 0.685532 s, 153 MB/s
>> 10:/mnt # sync
>> 10:/mnt # btrfs qgroup show .
>> qgroupid rfer excl
>>   
>> 0/5  16.00KiB 16.00KiB
>> 0/258 1.01GiB100.02MiB
>> 0/263 1.00GiB 85.02MiB
> 
> Sorry, I can't really reproduce it.
> 
> 5.1.12 kernel, using the following script:
> ---
> #!/bin/bash
> 
> dev="/dev/data/btrfs"
> mnt="/mnt/btrfs"
> 
> umount $dev &> /dev/null
> mkfs.btrfs -f $dev > /dev/null
> 
> mount $dev $mnt
> btrfs sub create $mnt/subv1
> btrfs quota enable $mnt
> btrfs quota rescan -w $mnt
> 
> xfs_io -f -c "pwrite 0 1G" $mnt/subv1/file1
> sync
> btrfs sub snapshot $mnt/subv1 $mnt/subv2
> sync
> btrfs qgroup show -prce $mnt
> 
> xfs_io -c "pwrite 0 100m" $mnt/subv1/file1
> sync
> btrfs qgroup show -prce $mnt
> ---
> 
> The result is:
> ---
> Create subvolume '/mnt/btrfs/subv1'
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.5902 sec (1.694 GiB/sec and 444134.2107 ops/sec)
> Create a snapshot of '/mnt/btrfs/subv1' in '/mnt/btrfs/subv2'
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5  16.00KiB 16.00KiB none none --- ---
> 0/256 1.00GiB 16.00KiB none none --- ---
> 0/259 1.00GiB 16.00KiB none none --- ---
> wrote 104857600/104857600 bytes at offset 0
> 100 MiB, 25600 ops; 0.0694 sec (1.406 GiB/sec and 368652.9766 ops/sec)
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5  16.00KiB 16.00KiB none none --- ---
> 0/256 1.10GiB100.02MiB none none --- ---
> 0/259 1.00GiB 16.00KiB none none --- ---
> ---
> 
>> 10:/mnt # filefrag -v test/file
>> Filesystem type is: 9123683e
>> File size of test/file is 1073741824 (262144 blocks of 4096 bytes)

My bad, I'm still using 512 bytes as blocksize.
If using 4K blocksize, then the fiemap result matches.

Then please discard my previous comment.

Then we need to check data extents layout to make sure what's going on.

Would you please provide the following output?
# btrfs ins dump-tree -t 258 /dev/vdb
# btrfs ins dump-tree -t 263 /dev/vdb
# btrfs check /dev/vdb

If the last command reports qgroup mismatch, then it means qgroup is
indeed incorrect.

Also, I see your subvolume id is not continuous, did you created/removed
some other subvolumes during your test?

Thanks,
Qu

>> Oops. Where 85MiB exclusive usage in snapshot comes from? I would expect
>> one of
>>
>> - 0 exclusive, because original first extent is still referenced by test
>> (even though partially), so if qgroup counts physical space usage, snap1
>> effectively refers to the same physical extents as test.
>>
>> - 100MiB exclusive if qgroup counts logical space consumption, because
>> snapshot now has 100MiB different data.
>>
>> But 85MiB? It does not match any observed value. Judging by 1.01GiB of
>> referenced space for subvolume test, qrgoup counts physical usage, at
>> which point snapshot exclusive space consumption remains 0.
>>
> 



signature.asc
Description: OpenPGP digital signature


Re: Confused by btrfs quota group accounting

2019-06-23 Thread Andrei Borzenkov
23.06.2019 11:08, Qu Wenruo пишет:
> 
> 
> On 2019/6/23 下午3:55, Qu Wenruo wrote:
>>
>>
>> On 2019/6/22 下午11:11, Andrei Borzenkov wrote:
>> [snip]
>>>
>>> 10:/mnt # dd if=/dev/urandom of=test/file bs=1M count=100 seek=0
>>> conv=notrunc
>>> 100+0 records in
>>> 100+0 records out
>>> 104857600 bytes (105 MB, 100 MiB) copied, 0.685532 s, 153 MB/s
>>> 10:/mnt # sync
>>> 10:/mnt # btrfs qgroup show .
>>> qgroupid rfer excl
>>>   
>>> 0/5  16.00KiB 16.00KiB
>>> 0/258 1.01GiB100.02MiB
>>> 0/263 1.00GiB 85.02MiB
>>
>> Sorry, I can't really reproduce it.
>>
>> 5.1.12 kernel, using the following script:
>> ---
>> #!/bin/bash
>>
>> dev="/dev/data/btrfs"
>> mnt="/mnt/btrfs"
>>
>> umount $dev &> /dev/null
>> mkfs.btrfs -f $dev > /dev/null
>>
>> mount $dev $mnt
>> btrfs sub create $mnt/subv1
>> btrfs quota enable $mnt
>> btrfs quota rescan -w $mnt
>>
>> xfs_io -f -c "pwrite 0 1G" $mnt/subv1/file1
>> sync
>> btrfs sub snapshot $mnt/subv1 $mnt/subv2
>> sync
>> btrfs qgroup show -prce $mnt
>>
>> xfs_io -c "pwrite 0 100m" $mnt/subv1/file1
>> sync
>> btrfs qgroup show -prce $mnt
>> ---
>>
>> The result is:
>> ---
>> Create subvolume '/mnt/btrfs/subv1'
>> wrote 1073741824/1073741824 bytes at offset 0
>> 1 GiB, 262144 ops; 0.5902 sec (1.694 GiB/sec and 444134.2107 ops/sec)
>> Create a snapshot of '/mnt/btrfs/subv1' in '/mnt/btrfs/subv2'
>> qgroupid rfer excl max_rfer max_excl parent  child
>>      --  -
>> 0/5  16.00KiB 16.00KiB none none --- ---
>> 0/256 1.00GiB 16.00KiB none none --- ---
>> 0/259 1.00GiB 16.00KiB none none --- ---
>> wrote 104857600/104857600 bytes at offset 0
>> 100 MiB, 25600 ops; 0.0694 sec (1.406 GiB/sec and 368652.9766 ops/sec)
>> qgroupid rfer excl max_rfer max_excl parent  child
>>      --  -
>> 0/5  16.00KiB 16.00KiB none none --- ---
>> 0/256 1.10GiB100.02MiB none none --- ---
>> 0/259 1.00GiB 16.00KiB none none --- ---
>> ---
>>
>>> 10:/mnt # filefrag -v test/file
>>> Filesystem type is: 9123683e
>>> File size of test/file is 1073741824 (262144 blocks of 4096 bytes)
> 
> My bad, I'm still using 512 bytes as blocksize.
> If using 4K blocksize, then the fiemap result matches.
> 
> Then please discard my previous comment.
> 
> Then we need to check data extents layout to make sure what's going on.
> 
> Would you please provide the following output?
> # btrfs ins dump-tree -t 258 /dev/vdb
> # btrfs ins dump-tree -t 263 /dev/vdb
> # btrfs check /dev/vdb
> 
> If the last command reports qgroup mismatch, then it means qgroup is
> indeed incorrect.
> 

no error reported.
10:/home/bor # btrfs ins dump-tree -t 258 /dev/vdb
btrfs-progs v5.1
file tree key (258 ROOT_ITEM 0)
leaf 32505856 items 45 free space 12677 generation 11 owner 258
leaf 32505856 flags 0x1(WRITTEN) backref revision 1
fs uuid d10df0fa-25aa-4d80-89d9-16033ae3392d
chunk uuid 1bf7922a-9f98-4c76-8511-77e5605b8112
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 8 transid 9 size 8 nbytes 0
block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
sequence 13 flags 0x0(none)
atime 1561214496.783636682 (2019-06-22 17:41:36)
ctime 1561214504.7643132 (2019-06-22 17:41:44)
mtime 1561214504.7643132 (2019-06-22 17:41:44)
otime 1561214496.783636682 (2019-06-22 17:41:36)
item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
index 0 namelen 2 name: ..
item 2 key (256 DIR_ITEM 1847562484) itemoff 16077 itemsize 34
location key (257 INODE_ITEM 0) type FILE
transid 9 data_len 0 name_len 4
name: file
item 3 key (256 DIR_INDEX 2) itemoff 16043 itemsize 34
location key (257 INODE_ITEM 0) type FILE
transid 9 data_len 0 name_len 4
name: file
item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160
generation 9 transid 11 size 1073741824 nbytes 1073741824
block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
sequence 1136 flags 0x0(none)
atime 1561214504.7643132 (2019-06-22 17:41:44)
ctime 1561214563.71728522 (2019-06-22 17:42:43)
mtime 1561214563.71728522 (2019-06-22 17:42:43)
otime 1561214504.7643132 (2019-06-22 17:41:44)
item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
index 2 namelen 4 name: file
item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
generation 11 type 1 (regul

[PATCH] btrfs-progs: misc-tests/029: exit manually after run_mayfail()

2019-06-23 Thread damenly . su
From: Su Yue 

Since the commmit 8dd3e5dc2df5
("btrfs-progs: tests: fix misc-tests/029 to run on NFS") added the
compatibility of NFS, it called run_mayfail() in the last of the test.

However, run_mayfail() always return the original code. If the test
case is not running on NFS, the last `run_mayfail rmdir "$SUBVOL_MNT"`
will fail with return value 1 then the test fails:

== RUN MAYFAIL rmdir 
btrfs-progs/tests/misc-tests/029-send-p-different-mountpoints/subvol_mnt
rmdir: failed to remove 
'btrfs-progs/tests/misc-tests/029-send-p-different-mountpoints/subvol_mnt': No 
such file or director
failed (ignored, ret=1): rmdir 
btrfs-progs/tests/misc-tests/029-send-p-different-mountpoints/subvol_mnt
test failed for case 029-send-p-different-mountpoints
=

Every instrument in this script handles its error well, so do exit 0
manually in the last.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202645
Fixes: 8dd3e5dc2df5 ("btrfs-progs: tests: fix misc-tests/029 to run on NFS")
Signed-off-by: Su Yue 
---
 tests/misc-tests/029-send-p-different-mountpoints/test.sh | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/misc-tests/029-send-p-different-mountpoints/test.sh 
b/tests/misc-tests/029-send-p-different-mountpoints/test.sh
index e092f8bba31e..d2b5e693f2d7 100755
--- a/tests/misc-tests/029-send-p-different-mountpoints/test.sh
+++ b/tests/misc-tests/029-send-p-different-mountpoints/test.sh
@@ -49,3 +49,6 @@ run_check_umount_test_dev "$TEST_MNT"
 
 run_mayfail $SUDO_HELPER rmdir "$SUBVOL_MNT"
 run_mayfail rmdir "$SUBVOL_MNT"
+# run_mayfail() may fail with nonzero value returned which causes failure
+# of this case. Do exit manually.
+exit 0
-- 
2.22.0



Re: Confused by btrfs quota group accounting

2019-06-23 Thread Qu Wenruo


On 2019/6/23 下午6:15, Andrei Borzenkov wrote:
[snip]
>> If the last command reports qgroup mismatch, then it means qgroup is
>> indeed incorrect.
>>
> 
> no error reported.

Then it's not a bug, and should be caused by btrfs extent booking behavior.

> 10:/home/bor # btrfs ins dump-tree -t 258 /dev/vdb
> btrfs-progs v5.1
> file tree key (258 ROOT_ITEM 0)
>   item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
>   index 2 namelen 4 name: file

The inode we care about.

>   item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
>   generation 11 type 1 (regular)
>   extent data disk byte 1291976704 nr 46137344
>   extent data offset 0 nr 46137344 ram 46137344

44 MiB extent, this should be exclusive for the subvol 258.

>   item 7 key (257 EXTENT_DATA 46137344) itemoff 15763 itemsize 53
>   generation 11 type 1 (regular)
>   extent data disk byte 1338114048 nr 45875200
>   extent data offset 0 nr 45875200 ram 45875200

Another 43.75 Mib extent, also exclusive for 258.

>   item 8 key (257 EXTENT_DATA 92012544) itemoff 15710 itemsize 53
>   generation 11 type 1 (regular)
>   extent data disk byte 314966016 nr 262144
>   extent data offset 0 nr 262144 ram 262144

Another 0.25MiB extent. Also exclusive.

>   item 9 key (257 EXTENT_DATA 92274688) itemoff 15657 itemsize 53
>   generation 11 type 1 (regular)
>   extent data disk byte 315228160 nr 12582912
>   extent data offset 0 nr 12582912 ram 12582912

Another 12.0 MiB extent, also exclusive.


BTW, so many fragmented extents, this normally means your system has
very high memory pressure or lack of memory, or lack of on-disk space.
Above 100MiB should be in one large extent, not split into so many small
ones.

So 258 have 100 MiB extents exclusive. No problem so far.

>   item 10 key (257 EXTENT_DATA 104857600) itemoff 15604 itemsize 53
>   generation 9 type 1 (regular)
>   extent data disk byte 227016704 nr 43515904
>   extent data offset 15728640 nr 27787264 ram 43515904

From this extents on, data extent at 227016704 (len 41.5M) are all
shared with another extent.

You can just search the bytenr 227016704, which also shows up in subvol 265.

[snip]

> file tree key (263 ROOT_ITEM 10)
>   item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14
>   index 2 namelen 4 name: file

Starts from here, that's the inode we care.

>   item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
>   generation 9 type 1 (regular)
>   extent data disk byte 137887744 nr 43778048
>   extent data offset 0 nr 43778048 ram 43778048

Exclusive, 41.75 MiB.

>   item 7 key (257 EXTENT_DATA 43778048) itemoff 15763 itemsize 53
>   generation 9 type 1 (regular)
>   extent data disk byte 181665792 nr 1310720
>   extent data offset 0 nr 1310720 ram 1310720

Exclusive 1.25MiB.

>   item 8 key (257 EXTENT_DATA 45088768) itemoff 15710 itemsize 53
>   generation 9 type 1 (regular)
>   extent data disk byte 182976512 nr 43778048
>   extent data offset 0 nr 43778048 ram 43778048

Exclusive, 41.76 NiB.

>   item 9 key (257 EXTENT_DATA 88866816) itemoff 15657 itemsize 53
>   generation 9 type 1 (regular)
>   extent data disk byte 226754560 nr 262144
>   extent data offset 0 nr 262144 ram 262144
>   extent compression 0 (none)

This data extent get shared between subvol 258 and 263.
The difference is, subvol 258 only shared part of the extent, while 263
are using the full extent.
Btrfs qgroup calculates exclusive based on extents, not bytes, so even
only part of the extent get shared, it's still counted as shared.

So for subvol 263, your exclusive is 41.75 + 1.25 + 41.76 = 84.75 MiB.


In short, due to qgroup works at extents level, not bytes level, you'll
find strange behavior.

E.g. For my previous script, if on a system with enough free memory, if
you only writes 100 MiB, which is smaller than data extent size limit
(128MiB), only one subvolume will get 100MiB exclusive while the other
one has no exclusive (except the 16K leaf).

But if you're writing 128MiB, just at the extent size limit, then both
subvolume will have 128MiB exclusive.

Thanks,
Qu


>   item 10 key (257 EXTENT_DATA 89128960) itemoff 15604 itemsize 53
>   generation 9 type 1 (regular)
>   extent data disk byte 227016704 nr 43515904
>   extent data offset 0 nr 43515904 ram 43515904
>   extent compression 0 (none)
[snip]

> 
>> Also, I see your subvolume id is not continuous, did you created/removed
>> some other subvolumes during your test?
>>
> 
> No. At least on this filesystem. I have recreated it several times, but
> since the last mkfs these were the only two subvolu

Recover files from broken btrfs

2019-06-23 Thread Robert
Hi all

I have a ReadyNAS device with 4 4TB disks. It was working all right
for couple of years. At one point the system became read-only, and
after reboot data is inaccessible.
Can anyone give some advise how to recover data from the file system?

system details are
root@Dyskietka:~# uname -a
Linux Dyskietka 4.4.116.armada.1 #1 SMP Mon Feb 19 22:05:00 PST 2018
armv7l GNU/Linux
root@Dyskietka:~# btrfs --version
btrfs-progs v4.12
root@Dyskietka:~# btrfs fi show
Label: '2fe4f8e6:data'  uuid: 0970e8c4-fd47-43d3-aa93-593006e3d0c3
Total devices 1 FS bytes used 8.11TiB
devid1 size 10.90TiB used 8.11TiB path /dev/md127

root@Dyskietka:~# btrfs fi df /dev/md127
ERROR: not a btrfs filesystem: /dev/md127

dmesg
[Sat Jun  1 13:18:48 2019] md/raid:md1: device sdc2 operational as raid disk 0
[Sat Jun  1 13:18:48 2019] md/raid:md1: device sda2 operational as raid disk 3
[Sat Jun  1 13:18:48 2019] md/raid:md1: device sdb2 operational as raid disk 2
[Sat Jun  1 13:18:48 2019] md/raid:md1: device sdd2 operational as raid disk 1
[Sat Jun  1 13:18:48 2019] md/raid:md1: allocated 4294kB
[Sat Jun  1 13:18:48 2019] md/raid:md1: raid level 6 active with 4 out
of 4 devices, algorithm 2
[Sat Jun  1 13:18:48 2019] RAID conf printout:
[Sat Jun  1 13:18:48 2019]  --- level:6 rd:4 wd:4
[Sat Jun  1 13:18:48 2019]  disk 0, o:1, dev:sdc2
[Sat Jun  1 13:18:48 2019]  disk 1, o:1, dev:sdd2
[Sat Jun  1 13:18:48 2019]  disk 2, o:1, dev:sdb2
[Sat Jun  1 13:18:48 2019]  disk 3, o:1, dev:sda2
[Sat Jun  1 13:18:48 2019] md1: detected capacity change from 0 to 1071644672
[Sat Jun  1 13:18:49 2019] EXT4-fs (md0): mounted filesystem with
ordered data mode. Opts: (null)
[Sat Jun  1 13:18:49 2019] systemd[1]: Failed to insert module
'kdbus': Function not implemented
[Sat Jun  1 13:18:49 2019] systemd[1]: systemd 230 running in system
mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP
+LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS
+KMOD +IDN)
[Sat Jun  1 13:18:49 2019] systemd[1]: Detected architecture arm.
[Sat Jun  1 13:18:49 2019] systemd[1]: Set hostname to .
[Sat Jun  1 13:18:50 2019] systemd[1]: Started Dispatch Password
Requests to Console Directory Watch.
[Sat Jun  1 13:18:50 2019] systemd[1]: Listening on /dev/initctl
Compatibility Named Pipe.
[Sat Jun  1 13:18:50 2019] systemd[1]: Listening on Journal Socket (/dev/log).
[Sat Jun  1 13:18:50 2019] systemd[1]: Listening on udev Kernel Socket.
[Sat Jun  1 13:18:50 2019] systemd[1]: Created slice System Slice.
[Sat Jun  1 13:18:50 2019] systemd[1]: Created slice
system-serial\x2dgetty.slice.
[Sat Jun  1 13:18:50 2019] systemd[1]: Reached target Encrypted Volumes.
[Sat Jun  1 13:18:50 2019] systemd[1]: Started Forward Password
Requests to Wall Directory Watch.
[Sat Jun  1 13:18:50 2019] systemd[1]: Listening on Journal Socket.
[Sat Jun  1 13:18:50 2019] systemd[1]: Started ReadyNAS LCD splasher.
[Sat Jun  1 13:18:50 2019] systemd[1]: Starting ReadyNASOS system prep...
[Sat Jun  1 13:18:50 2019] systemd[1]: Mounting RPC Pipe File System...
[Sat Jun  1 13:18:50 2019] systemd[1]: Starting Remount Root and
Kernel File Systems...
[Sat Jun  1 13:18:50 2019] systemd[1]: Starting Create list of
required static device nodes for the current kernel...
[Sat Jun  1 13:18:50 2019] systemd[1]: Mounting POSIX Message Queue
File System...
[Sat Jun  1 13:18:50 2019] systemd[1]: Created slice User and Session Slice.
[Sat Jun  1 13:18:50 2019] systemd[1]: Reached target Slices.
[Sat Jun  1 13:18:50 2019] systemd[1]: Created slice system-getty.slice.
[Sat Jun  1 13:18:50 2019] systemd[1]: Starting Load Kernel Modules...
[Sat Jun  1 13:18:50 2019] systemd[1]: Mounting RPC Pipe File System...
[Sat Jun  1 13:18:50 2019] systemd[1]: Starting Journal Service...
[Sat Jun  1 13:18:50 2019] systemd[1]: Listening on udev Control Socket.
[Sat Jun  1 13:18:50 2019] systemd[1]: Reached target Paths.
[Sat Jun  1 13:18:51 2019] systemd[1]: Mounted RPC Pipe File System.
[Sat Jun  1 13:18:51 2019] systemd[1]: Mounted RPC Pipe File System.
[Sat Jun  1 13:18:51 2019] systemd[1]: Mounted POSIX Message Queue File System.
[Sat Jun  1 13:18:51 2019] systemd[1]: Started ReadyNASOS system prep.
[Sat Jun  1 13:18:51 2019] systemd[1]: Started Remount Root and Kernel
File Systems.
[Sat Jun  1 13:18:51 2019] systemd[1]: Started Create list of required
static device nodes for the current kernel.
[Sat Jun  1 13:18:51 2019] systemd[1]: Started Load Kernel Modules.
[Sat Jun  1 13:18:51 2019] systemd[1]: Starting Apply Kernel Variables...
[Sat Jun  1 13:18:51 2019] systemd[1]: Mounting FUSE Control File System...
[Sat Jun  1 13:18:51 2019] systemd[1]: Mounting Configuration File System...
[Sat Jun  1 13:18:51 2019] systemd[1]: Starting Create Static Device
Nodes in /dev...
[Sat Jun  1 13:18:51 2019] systemd[1]: Starting Load/Save Random Seed...
[Sat Jun  1 13:18:51 2019] systemd[1]: Starting Rebuild Hardware Database...
[Sat Jun  1 13:18:51 2019] systemd[1]: Mounted Configuration File System.

Re: Confused by btrfs quota group accounting

2019-06-23 Thread Andrei Borzenkov
23.06.2019 14:29, Qu Wenruo пишет:
> 
> 
> BTW, so many fragmented extents, this normally means your system has
> very high memory pressure or lack of memory, or lack of on-disk space.

It is 1GiB QEMU VM with vanilla Tumbleweed with GNOME desktop; nothing
runs except user GNOME session. Does it fit "high memory pressure"
definition?

> Above 100MiB should be in one large extent, not split into so many small
> ones.
> 

OK, so this is where I was confused. I was sure that filefrag returns
true "physical" extent layout. It seems that in filefrag output
consecutive extents are merged giving false picture of large extent
instead of many small ones. Filefrag shows 5 ~200MiB extents, not over
30 smaller ones.

Is it how generic IOCTL is designed to work or is it something that
btrfs does internally? This *is* confusing.

In any case, thank you for clarification, this makes sense now.



signature.asc
Description: OpenPGP digital signature


Re: Confused by btrfs quota group accounting

2019-06-23 Thread Qu Wenruo


On 2019/6/23 下午9:42, Andrei Borzenkov wrote:
> 23.06.2019 14:29, Qu Wenruo пишет:
>>
>>
>> BTW, so many fragmented extents, this normally means your system has
>> very high memory pressure or lack of memory, or lack of on-disk space.
> 
> It is 1GiB QEMU VM with vanilla Tumbleweed with GNOME desktop; nothing
> runs except user GNOME session. Does it fit "high memory pressure"
> definition?

1GiB Vram? That's very easy to trigger memory pressure.
I'm not 100% sure about the percentage page cache can use, but 1/8 would
be a safe guess.
Which means, you can only write at most 128M before triggering writeback.
Considering other program uses some page cache, you have less available
page caches for filesystem.

> 
>> Above 100MiB should be in one large extent, not split into so many small
>> ones.
>>
> 
> OK, so this is where I was confused. I was sure that filefrag returns
> true "physical" extent layout. It seems that in filefrag output
> consecutive extents are merged giving false picture of large extent
> instead of many small ones. Filefrag shows 5 ~200MiB extents, not over
> 30 smaller ones.

Btrfs does file extent mapping merge at fiemap reporting time.
Personally speaking, I don't know why user would expect real extent mapping.
As long as the all these extents have the same flags, continuous
address, then merging them shouldn't be a problem.

In fact, after viewing the real on-disk extent mapping, it's possible to
explain just by the fiemap result, but using fiemap result alone is
indeed much harder to expose such case.

For your use case, it's already deep into the implementation, thus I
recommend low level tools like "btrfs ins dump-tree" to discover the
underlying extent mapping.

> 
> Is it how generic IOCTL is designed to work or is it something that
> btrfs does internally? This *is* confusing.
> 
> In any case, thank you for clarification, this makes sense now.

AFAIK, we should put this into the btrfs(5) man page as an special use
case. Along with btrfs extent booking.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: Global reserve and ENOSPC while deleting snapshots on 5.0.9 - still happens on 5.1.11

2019-06-23 Thread Zygo Blaxell
On Tue, Apr 23, 2019 at 07:06:51PM -0400, Zygo Blaxell wrote:
> I had a test filesystem that ran out of unallocated space, then ran
> out of metadata space during a snapshot delete, and forced readonly.
> The workload before the failure was a lot of rsync and bees dedupe
> combined with random snapshot creates and deletes.

Had this happen again on a production filesystem, this time on 5.1.11,
and it happened during orphan inode cleanup instead of snapshot delete:

[14303.076134][T20882] BTRFS: error (device dm-21) in 
add_to_free_space_tree:1037: errno=-28 No space left
[14303.076144][T20882] BTRFS: error (device dm-21) in 
__btrfs_free_extent:7196: errno=-28 No space left
[14303.076157][T20882] BTRFS: error (device dm-21) in 
btrfs_run_delayed_refs:3008: errno=-28 No space left
[14303.076203][T20882] BTRFS error (device dm-21): Error removing 
orphan entry, stopping orphan cleanup
[14303.076210][T20882] BTRFS error (device dm-21): could not do orphan 
cleanup -22
[14303.076281][T20882] BTRFS error (device dm-21): commit super ret -30
[14303.357337][T20882] BTRFS error (device dm-21): open_ctree failed

Same fix:  I bumped the reserved size limit from 512M to 2G and mounted
normally.  (OK, technically, I booted my old 5.0.21 kernel--but my 5.0.21
kernel has the 2G reserved space patch below in it.)

I've not been able to repeat this ENOSPC behavior under test conditions
in the last two months of trying, but it's now happened twice in different
places, so it has non-zero repeatability.

> I tried the usual fix strategies:
> 
>   1.  Immediately after mount, try to balance to free space for
>   metadata
> 
>   2.  Immediately after mount, add additional disks to provide
>   unallocated space for metadata
> 
>   3.  Mount -o nossd to increase metadata density
> 
> #3 had no effect.  #1 failed consistently.
> 
> #2 was successful, but the additional space was not used because
> btrfs couldn't allocate chunks for metadata because it ran out of
> metadata space for new metadata chunks.
> 
> When btrfs-cleaner tried to remove the first pending deleted snapshot,
> it started a transaction that failed due to lack of metadata space.
> Since the transaction failed, the filesystem reverts to its earlier state,
> and exactly the same thing happens on the next mount.  The 'btrfs dev
> add' in #2 is successful only if it is executed immediately after mount,
> before the btrfs-cleaner thread wakes up.
> 
> Here's what the kernel said during one of the attempts:
> 
>   [41263.822252] BTRFS info (device dm-3): use zstd compression, level 0
>   [41263.825135] BTRFS info (device dm-3): using free space tree
>   [41263.827319] BTRFS info (device dm-3): has skinny extents
>   [42046.463356] [ cut here ]
>   [42046.463387] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: 
> errno=-28 No space left
>   [42046.463404] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: 
> errno=-28 No space left
>   [42046.463407] BTRFS info (device dm-3): forced readonly
>   [42046.463414] BTRFS: error (device dm-3) in 
> btrfs_run_delayed_refs:3011: errno=-28 No space left
>   [42046.463429] BTRFS: error (device dm-3) in 
> btrfs_create_pending_block_groups:10517: errno=-28 No space left
>   [42046.463548] BTRFS: error (device dm-3) in 
> btrfs_create_pending_block_groups:10520: errno=-28 No space left
>   [42046.471363] BTRFS: error (device dm-3) in 
> btrfs_run_delayed_refs:3011: errno=-28 No space left
>   [42046.471475] BTRFS: error (device dm-3) in 
> btrfs_create_pending_block_groups:10517: errno=-28 No space left
>   [42046.471506] BTRFS: error (device dm-3) in 
> btrfs_create_pending_block_groups:10520: errno=-28 No space left
>   [42046.473672] BTRFS: error (device dm-3) in btrfs_drop_snapshot:9489: 
> errno=-28 No space left
>   [42046.475643] WARNING: CPU: 0 PID: 10187 at 
> fs/btrfs/extent-tree.c:7056 __btrfs_free_extent+0x364/0xf60
>   [42046.475645] Modules linked in: mq_deadline bfq dm_cache_smq dm_cache 
> dm_persistent_data dm_bio_prison dm_bufio joydev ppdev crct10dif_pclmul 
> crc32_pclmul crc32c_intel ghash_clmulni_intel dm_mod snd_pcm aesni_intel 
> aes_x86_64 snd_timer crypto_simd cryptd glue_helper sr_mod snd cdrom psmouse 
> sg soundcore input_leds pcspkr serio_raw ide_pci_generic i2c_piix4 bochs_drm 
> parport_pc piix rtc_cmos floppy parport pcc_cpufreq ide_core qemu_fw_cfg 
> evbug evdev ip_tables x_tables ipv6 crc_ccitt autofs4
>   [42046.475677] CPU: 0 PID: 10187 Comm: btrfs-transacti Tainted: GB  
>  W 5.0.8-zb64-10a85e8a1569+ #1
>   [42046.475678] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.10.2-1 04/01/2014
>   [42046.475681] RIP: 0010:__btrfs_free_extent+0x364/0xf60
>   [42046.475684] Code: 50 f0 48 0f ba a8 90 22 00 00 02 72 1f 8b 85 88 fe 
> ff ff 83 f8 fb 0f 84 59 04 00 00 89 c6 48 c7 c7 00

Per-entry or per-subvolume physical location settings

2019-06-23 Thread Valery Plotnikov
Greetings!

When using btrfs with multiple devices in a "single" mode, is it
possible to force some files and directories onto one drive and some to
the other? Or at least specify "single" mode on a specific device for
some directories and "DUP" for some others.

The following scenario, if it is possible, would make btrfs even cooler
than bcachefs:

- Single mode, multiple devices - a HDD and a SSD
- Force system folders to be located on the SSD, or both - those have
little writes and need speed
- And still manage snapshots easily on one place, instead of two
top-level directories and a lot of symlinks!

Maybe it is at least possible to do it on a per-subvolume basis? This
would still be better than mounting subvolumes from different devices.


I remember reading somewhere in the early days of btrfs that the main
advantage on btrfs over LVM is that btrfs knows about your files, which
makes those things possible - even on a file structure level. However, I
could not find anything about it, while LVM allows per-volume physical
location settings, if I remember correctly.


-- 
Valery Plotnikov


Per-entry or per-subvolume physical location settings

2019-06-23 Thread Dark Penguin
Greetings!

When using btrfs with multiple devices in a "single" mode, is it
possible to force some files and directories onto one drive and some to
the other? Or at least specify "single" mode on a specific device for
some directories and "DUP" for some others.

The following scenario, if it is possible, would make btrfs even cooler
than bcachefs:

- Single mode, multiple devices - a HDD and a SSD
- Force system folders to be located on the SSD, or both - those have
little writes and need speed
- And still manage snapshots easily on one place, instead of two
top-level directories and a lot of symlinks!

Maybe it is at least possible to do it on a per-subvolume basis? This
would still be better than mounting subvolumes from different devices.


I remember reading somewhere in the early days of btrfs that the main
advantage on btrfs over LVM is that btrfs knows about your files, which
makes those things possible - even on a file structure level. However, I
could not find anything about it, while LVM allows per-volume physical
location settings, if I remember correctly.


-- 
darkpenguin


btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

2019-06-23 Thread Zygo Blaxell
On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote:
> On 2019/6/20 上午7:45, Zygo Blaxell wrote:
> > On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote:
> >> What should I do now ... to use btrfs safely? Should i not use it with
> >> DM-crypt
> > 
> > You might need to disable write caching on your drives, i.e. hdparm -W0.
> 
> This is quite troublesome.
> 
> Disabling write cache normally means performance impact.

The drives I've found that need write cache disabled aren't particularly
fast to begin with, so disabling write cache doesn't harm their
performance very much.  All the speed gains of write caching are lost
when someone has to spend time doing a forced restore from backup after
transid-verify failure.  If you really do need performance, there are
drives with working firmware available that don't cost much more.

> And disabling it normally would hide the true cause (if it's something
> btrfs' fault).

This is true; however, even if a hypothetical btrfs bug existed,
disabling write caching is an immediately deployable workaround, and
there's currently no other solution other than avoiding drives with
bad firmware.

There could be improvements possible for btrfs to work around bad
firmware...if someone's willing to donate their sanity to get inside
the heads of firmware bugs, and can find a way to fix it that doesn't
make things worse for everyone with working firmware.

> > I have a few drives in my collection that don't have working write cache.
> > They are usually fine, but when otherwise minor failure events occur (e.g.
> > bad cables, bad power supply, failing UNC sectors) then the write cache
> > doesn't behave correctly, and any filesystem or database on the drive
> > gets trashed.
> 
> Normally this shouldn't be the case, as long as the fs has correct
> journal and flush/barrier.

If you are asking the question:

"Are there some currently shipping retail hard drives that are
orders of magnitude more likely to corrupt data after simple
power failures than other drives?"

then the answer is:

"Hell, yes!  How could there NOT be?"

It wouldn't take very much capital investment or time to find this out
in lab conditions.  Just kill power every 25 minutes while running a
btrfs stress-test should do it--or have a UPS hardware failure in ops,
the effect is the same.  Bad drives will show up in a few hours, good
drives take much longer--long enough that, statistically, the good drives
will probably fail outright before btrfs gets corrupted.

> If it's really the hardware to blame, then it means its flush/fua is not
> implemented properly at all, thus the possibility of a single power loss
> leading to corruption should be VERY VERY high.

That exactly matches my observations.  Only a few disks fail at all,
but the ones that do fail do so very often:  60% of corruptions at
10 power failures or less, 100% at 30 power failures or more.

> >  This isn't normal behavior, but the problem does affect
> > the default configuration of some popular mid-range drive models from
> > top-3 hard disk vendors, so it's quite common.
> 
> Would you like to share the info and test methodology to determine it's
> the device to blame? (maybe in another thread)

It's basic data mining on operations failure event logs.

We track events like filesystem corruption, data loss, other hardware
failure, operator errors, power failures, system crashes, dmesg error
messages, etc., and count how many times each failure occurs in systems
with which hardware components.  When a failure occurs, we break the
affected system apart and place its components into other systems or
test machines to isolate which component is causing the failure (e.g. a
failing power supply could create RAM corruption events and disk failure
events, so we move the hardware around to see where the failure goes).
If the same component is involved in repeatable failure events, the
correlation jumps out of the data and we know that component is bad.
We can also do correlations by attributes of the components, i.e. vendor,
model, size, firmware revision, manufacturing date, and correlate
vendor-model-size-firmware to btrfs transid verify failures across
a fleet of different systems.

I can go to the data and get a list of all the drive model and firmware
revisions that have been installed in machines with 0 "parent transid
verify failed" events since 2014, and are still online today:

Device Model: CT240BX500SSD1 Firmware Version: M6CR013
Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060
Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G
Device Model: INTEL SSDSC2KW256G8 Firmware Version: LHF002C
Device Model: KINGSTON SA400S37240G Firmware Version: R0105A
Device Model: ST12000VN0007-2GS116 Firmware Version: SC60
Device Model: ST5000VN0001-1SF17X Firmware Version: AN02
Device Model: ST8000VN0002-1Z8112 Firmware Version: SC6

Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

2019-06-23 Thread Qu Wenruo


On 2019/6/24 上午4:45, Zygo Blaxell wrote:
> On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote:
>> On 2019/6/20 上午7:45, Zygo Blaxell wrote:
>>> On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote:
 What should I do now ... to use btrfs safely? Should i not use it with
 DM-crypt
>>>
>>> You might need to disable write caching on your drives, i.e. hdparm -W0.
>>
>> This is quite troublesome.
>>
>> Disabling write cache normally means performance impact.
> 
> The drives I've found that need write cache disabled aren't particularly
> fast to begin with, so disabling write cache doesn't harm their
> performance very much.  All the speed gains of write caching are lost
> when someone has to spend time doing a forced restore from backup after
> transid-verify failure.  If you really do need performance, there are
> drives with working firmware available that don't cost much more.
> 
>> And disabling it normally would hide the true cause (if it's something
>> btrfs' fault).
> 
> This is true; however, even if a hypothetical btrfs bug existed,
> disabling write caching is an immediately deployable workaround, and
> there's currently no other solution other than avoiding drives with
> bad firmware.
> 
> There could be improvements possible for btrfs to work around bad
> firmware...if someone's willing to donate their sanity to get inside
> the heads of firmware bugs, and can find a way to fix it that doesn't
> make things worse for everyone with working firmware.
> 
>>> I have a few drives in my collection that don't have working write cache.
>>> They are usually fine, but when otherwise minor failure events occur (e.g.
>>> bad cables, bad power supply, failing UNC sectors) then the write cache
>>> doesn't behave correctly, and any filesystem or database on the drive
>>> gets trashed.
>>
>> Normally this shouldn't be the case, as long as the fs has correct
>> journal and flush/barrier.
> 
> If you are asking the question:
> 
> "Are there some currently shipping retail hard drives that are
> orders of magnitude more likely to corrupt data after simple
> power failures than other drives?"
> 
> then the answer is:
> 
>   "Hell, yes!  How could there NOT be?"
> 
> It wouldn't take very much capital investment or time to find this out
> in lab conditions.  Just kill power every 25 minutes while running a
> btrfs stress-test should do it--or have a UPS hardware failure in ops,
> the effect is the same.  Bad drives will show up in a few hours, good
> drives take much longer--long enough that, statistically, the good drives
> will probably fail outright before btrfs gets corrupted.

Now it sounds like we really need some good (more elegant than just
random power failure, but more controlled system) way to do such test.

> 
>> If it's really the hardware to blame, then it means its flush/fua is not
>> implemented properly at all, thus the possibility of a single power loss
>> leading to corruption should be VERY VERY high.
> 
> That exactly matches my observations.  Only a few disks fail at all,
> but the ones that do fail do so very often:  60% of corruptions at
> 10 power failures or less, 100% at 30 power failures or more.
> 
>>>  This isn't normal behavior, but the problem does affect
>>> the default configuration of some popular mid-range drive models from
>>> top-3 hard disk vendors, so it's quite common.
>>
>> Would you like to share the info and test methodology to determine it's
>> the device to blame? (maybe in another thread)
> 
> It's basic data mining on operations failure event logs.
> 
> We track events like filesystem corruption, data loss, other hardware
> failure, operator errors, power failures, system crashes, dmesg error
> messages, etc., and count how many times each failure occurs in systems
> with which hardware components.  When a failure occurs, we break the
> affected system apart and place its components into other systems or
> test machines to isolate which component is causing the failure (e.g. a
> failing power supply could create RAM corruption events and disk failure
> events, so we move the hardware around to see where the failure goes).
> If the same component is involved in repeatable failure events, the
> correlation jumps out of the data and we know that component is bad.
> We can also do correlations by attributes of the components, i.e. vendor,
> model, size, firmware revision, manufacturing date, and correlate
> vendor-model-size-firmware to btrfs transid verify failures across
> a fleet of different systems.
> 
> I can go to the data and get a list of all the drive model and firmware
> revisions that have been installed in machines with 0 "parent transid
> verify failed" events since 2014, and are still online today:
> 
> Device Model: CT240BX500SSD1 Firmware Version: M6CR013
> Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060
> Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G
> 

Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

2019-06-23 Thread Remi Gauvin
On 2019-06-23 4:45 p.m., Zygo Blaxell wrote:

>   Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 
> Firmware Version: 80.00A80
> 
> Change the query to 1-30 power cycles, and we get another model with
> the same firmware version string:
> 
>   Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 
> Firmware Version: 80.00A80
> 

> 
> These drives have 0 power fail events between mkfs and "parent transid
> verify failed" events, i.e. it's not necessary to have a power failure
> at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> failure occurs on the same days as "Current Pending Sector" and "Offline
> UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> cache enabled most of the time (there's years in the log data without any
> transid-verify failures), but the WD Black will drop its write cache when
> it sees a UNC sector, and btrfs notices the failure a few hours later.
> 

First, thank you very much for sharing.  I've seen you mention several
times before problems with common consumer drives, but seeing one
specific identified problem firmware version is *very* valuable info.

I have a question about the Black Drives dropping the cache on UNC
error.  If a transid id error like that occurred on a BTRFS RAID 1,
would BTRFS find the correct metadata on the 2nd drive, or does it stop
dead on 1 transid failure?




Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

2019-06-23 Thread Zygo Blaxell
On Mon, Jun 24, 2019 at 08:46:06AM +0800, Qu Wenruo wrote:
> On 2019/6/24 上午4:45, Zygo Blaxell wrote:
> > On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote:
> >> On 2019/6/20 上午7:45, Zygo Blaxell wrote:
[...]
> So the worst scenario really happens in real world, badly implemented
> flush/fua from firmware.
> Btrfs has no way to fix such low level problem.
> 
> BTW, do you have any corruption using the bad drivers (with write cache)
> with traditional journal based fs like XFS/EXT4?

Those filesystems don't make full-filesystem data integrity guarantees
like btrfs does, and there's no ext4 equivalent of dup metadata for
self-repair (even metadata csums in ext4 are a recent invention).
Ops didn't record failure events when e2fsck quietly repairs unexpected
filesystem inconsistencies.  On ext3, maybe data corruption happens
because of drive firmware bugs, or maybe the application just didn't use
fsync properly.  Maybe two disks in md-RAID1 have different contents
because they had slightly different IO timings.  Who knows?  There's no
way to tell from passive ops failure monitoring.

On btrfs with flushoncommit, every data anomaly (e.g. backups not
matching origin hosts, obviously corrupted files, scrub failures, etc)
is a distinct failure event.  Differences between disk contents in RAID1
arrays are failure events.  We can put disks with two different firmware
versions in a RAID1 pair, and btrfs will tell us if they disagree, use
the correct one to fix the broken one, or tell us they're both wrong
and it's time to warm up the backups.

In 2013 I had some big RAID10 arrays of WD Green 2TB disks using ext3/4
and mdadm, and there were a *lot* of data corruption events.  So many
events that we didn't have the capacity to investigate them before new
ones came in.  File restore requests for corrupted data were piling up
faster than they could be processed, and we had no systematic way to tell
whether the origin or backup file was correct when they were different.
Those problems eventually expedited our migration to btrfs, because btrfs
let us do deeper and more uniform data collection to see where all the
corruption was coming from.  While changing filesystems, we moved all
the data onto new disks that happened to not have firmware bugs, and all
the corruption abruptly disappeared (well, except for data corrupted by
bugs in btrfs itself, but now those are fixed too).  We didn't know
what was happening until years later when the smaller/cheaper systems
had enough failures to make noticeable patterns.

I would not be surprised if we were having firmware corruption problems
with ext3/ext4 the whole time those RAID10 arrays existed.  Alas, we were
not capturing firmware revision data at the time (only vendor/model),
and we only started capturing firmware revisions after all the old
drives were recycled.  I don't know exactly what firmware versions were
in those arrays...though I do have a short list of suspects.  ;)

> Btrfs is relying more the hardware to implement barrier/flush properly,
> or CoW can be easily ruined.
> If the firmware is only tested (if tested) against such fs, it may be
> the problem of the vendor.
[...]
> > WD Green and Black are low-cost consumer hard drives under $250.
> > One drive of each size in both product ranges comes to a total price
> > of around $1200 on Amazon.  Lots of end users will have these drives,
> > and some of them will want to use btrfs, but some of the drives apparently
> > do not have working write caching.  We should at least know which ones
> > those are, maybe make a kernel blacklist to disable the write caching
> > feature on some firmware versions by default.
> 
> To me, the problem isn't for anyone to test these drivers, but how
> convincing the test methodology is and how accessible the test device
> would be.
> 
> Your statistic has a lot of weight, but it takes you years and tons of
> disks to expose it, not something can be reproduced easily.
>
> On the other hand, if we're going to reproduce power failure quickly and
> reliably in a lab enivronment, then how?
> Software based SATA power cutoff? Or hardware controllable SATA power cable?

You might be overthinking this a bit.  Software-controlled switched
PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a
Raspberry Pi) can turn the AC power on and off on a test box.  Get a
cheap desktop machine, put as many different drives into it as it can
hold, start writing test patterns, kill mains power to the whole thing,
power it back up, analyze the data that is now present on disk, log the
result over the network, repeat.  This is the most accurate simulation,
since it replicates all the things that happen during a typical end-user's
power failure, only much more often.  Hopefully all the hardware involved
is designed to handle this situation already.  A standard office PC is
theoretically designed for 1000 cycles (200 working days over 5 years)
and should be able to test 60 drives (6 SATA ports, 1

Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

2019-06-23 Thread Zygo Blaxell
On Sun, Jun 23, 2019 at 10:45:50PM -0400, Remi Gauvin wrote:
> On 2019-06-23 4:45 p.m., Zygo Blaxell wrote:
> 
> > Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 
> > Firmware Version: 80.00A80
> > 
> > Change the query to 1-30 power cycles, and we get another model with
> > the same firmware version string:
> > 
> > Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 
> > Firmware Version: 80.00A80
> > 
> 
> > 
> > These drives have 0 power fail events between mkfs and "parent transid
> > verify failed" events, i.e. it's not necessary to have a power failure
> > at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> > failure occurs on the same days as "Current Pending Sector" and "Offline
> > UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> > cache enabled most of the time (there's years in the log data without any
> > transid-verify failures), but the WD Black will drop its write cache when
> > it sees a UNC sector, and btrfs notices the failure a few hours later.
> > 
> 
> First, thank you very much for sharing.  I've seen you mention several
> times before problems with common consumer drives, but seeing one
> specific identified problem firmware version is *very* valuable info.
> 
> I have a question about the Black Drives dropping the cache on UNC
> error.  If a transid id error like that occurred on a BTRFS RAID 1,
> would BTRFS find the correct metadata on the 2nd drive, or does it stop
> dead on 1 transid failure?

Well, the 2nd drive has to have correct metadata--if you are mirroring
a pair of disks with the same firmware bug, that's not likely to happen.

There is a bench test that will demonstrate the transid verify self-repair
procedure: disconnect one half of a RAID1 array, write for a while, then
reconnect and do a scrub.  btrfs should self-repair all the metadata on
the disconnected drive until it all matches the connected one.  Some of
the data blocks might be hosed though (due to CRC32 collisions), so
don't do this test on data you care about.

> 
> 


signature.asc
Description: PGP signature


Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

2019-06-23 Thread Zygo Blaxell
On Mon, Jun 24, 2019 at 12:37:51AM -0400, Zygo Blaxell wrote:
> On Sun, Jun 23, 2019 at 10:45:50PM -0400, Remi Gauvin wrote:
> > On 2019-06-23 4:45 p.m., Zygo Blaxell wrote:
> > 
> > >   Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 
> > > Firmware Version: 80.00A80
> > > 
> > > Change the query to 1-30 power cycles, and we get another model with
> > > the same firmware version string:
> > > 
> > >   Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 
> > > Firmware Version: 80.00A80
> > > 
> > 
> > > 
> > > These drives have 0 power fail events between mkfs and "parent transid
> > > verify failed" events, i.e. it's not necessary to have a power failure
> > > at all for these drives to unrecoverably corrupt btrfs.  In all cases the
> > > failure occurs on the same days as "Current Pending Sector" and "Offline
> > > UNC sector" SMART events.  The WD Black firmware seems to be OK with write
> > > cache enabled most of the time (there's years in the log data without any
> > > transid-verify failures), but the WD Black will drop its write cache when
> > > it sees a UNC sector, and btrfs notices the failure a few hours later.
> > > 
> > 
> > First, thank you very much for sharing.  I've seen you mention several
> > times before problems with common consumer drives, but seeing one
> > specific identified problem firmware version is *very* valuable info.
> > 
> > I have a question about the Black Drives dropping the cache on UNC
> > error.  If a transid id error like that occurred on a BTRFS RAID 1,
> > would BTRFS find the correct metadata on the 2nd drive, or does it stop
> > dead on 1 transid failure?
> 
> Well, the 2nd drive has to have correct metadata--if you are mirroring
> a pair of disks with the same firmware bug, that's not likely to happen.

OK, I forgot the Black case is a little complicated...

I guess if you had two WD Black drives and they had all their UNC sector
events at different times, then the btrfs RAID1 repair should still
work with write cache enabled.  That seems kind of risky, though--what
if something bumps the machine and both disks get UNC sectors at once?

Alternatives in roughly decreasing order of risk:

1.  Disable write caching on both Blacks in the pair

2.  Replace both Blacks with drives in the 0-failure list

3.  Replace one Black with a Seagate Firecuda or WD Red Pro
(any other 0-failure drive will do, but these have similar
performance specs to Black) to ensure firmware diversity

4.  Find some Black drives with different firmware that have UNC
sectors and see what happens with write caching during sector
remap events:  if they behave well, enable write caching on
all drives with matching firmware, disable if not

5.  Leave write caching on for now, but as soon as any Black
reports UNC sectors or reallocation events in SMART data, turn
write caching off for the remainder of the drive's service life.

> There is a bench test that will demonstrate the transid verify self-repair
> procedure: disconnect one half of a RAID1 array, write for a while, then
> reconnect and do a scrub.  btrfs should self-repair all the metadata on
> the disconnected drive until it all matches the connected one.  Some of
> the data blocks might be hosed though (due to CRC32 collisions), so
> don't do this test on data you care about.
> 
> > 
> > 




signature.asc
Description: PGP signature


Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)

2019-06-23 Thread Qu Wenruo


On 2019/6/24 下午12:29, Zygo Blaxell wrote:
[...]
> 
>> Btrfs is relying more the hardware to implement barrier/flush properly,
>> or CoW can be easily ruined.
>> If the firmware is only tested (if tested) against such fs, it may be
>> the problem of the vendor.
> [...]
>>> WD Green and Black are low-cost consumer hard drives under $250.
>>> One drive of each size in both product ranges comes to a total price
>>> of around $1200 on Amazon.  Lots of end users will have these drives,
>>> and some of them will want to use btrfs, but some of the drives apparently
>>> do not have working write caching.  We should at least know which ones
>>> those are, maybe make a kernel blacklist to disable the write caching
>>> feature on some firmware versions by default.
>>
>> To me, the problem isn't for anyone to test these drivers, but how
>> convincing the test methodology is and how accessible the test device
>> would be.
>>
>> Your statistic has a lot of weight, but it takes you years and tons of
>> disks to expose it, not something can be reproduced easily.
>>
>> On the other hand, if we're going to reproduce power failure quickly and
>> reliably in a lab enivronment, then how?
>> Software based SATA power cutoff? Or hardware controllable SATA power cable?
> 
> You might be overthinking this a bit.  Software-controlled switched
> PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a
> Raspberry Pi) can turn the AC power on and off on a test box.  Get a
> cheap desktop machine, put as many different drives into it as it can
> hold, start writing test patterns, kill mains power to the whole thing,
> power it back up, analyze the data that is now present on disk, log the
> result over the network, repeat.  This is the most accurate simulation,
> since it replicates all the things that happen during a typical end-user's
> power failure, only much more often.

To me, this is not as good as expected methodology.
It simulates the most common real world power loss case, but I'd say
it's less reliable in pinning down the incorrect behavior.
(And extra time wasted on POST, booting into OS and things like that)

My idea is, some SBC based controller controlling the power cable of the
disk.
And another system (or the same SBC if it supports SATA) doing regular
workload, with dm-log-writes recording every write operations.
Then kill the power to the disk.

Then compare the data on-disk against dm-log-writes to see how the data
differs.

From the view point of end user, this is definitely overkilled, but at
least to me, this could proof how bad the firmware is, leaving no excuse
for the vendor to dodge the bullet, and maybe do them a favor by pinning
down the sequence leading to corruption.

Although there are a lot of untested things which can go wrong:
- How kernel handles unresponsible disk?
- Will dm-log-writes record and handle error correctly?
- Is there anything special SATA controller will do?

But at least this is going to be a very interesting project.
I already have a rockpro64 SBC with SATA PCIE card, just need to craft
an GPIO controlled switch to kill SATA power.

>  Hopefully all the hardware involved
> is designed to handle this situation already.  A standard office PC is
> theoretically designed for 1000 cycles (200 working days over 5 years)
> and should be able to test 60 drives (6 SATA ports, 10 sets of drives
> tested 100 cycles each).  The hardware is all standard equipment in any
> IT department.
> 
> You only need special-purpose hardware if the general-purpose stuff
> is failing in ways that aren't interesting (e.g. host RAM is corrupted
> during writes so the drive writes garbage, or the power supply breaks
> before 1000 cycles).  Some people build elaborate hard disk torture
> rigs that mess with input voltages, control temperature and vibration,
> etc. to try to replicate the effects effects of aging, but these setups
> aren't representative of typical end-user environments and the results
> will only be interesting to hardware makers.
> 
> We expect most drives to work and it seems that they do most of the
> time--it is the drives that fail most frequently that are interesting.
> The drives that fail most frequently are also the easiest to identify
> in testing--by definition, they will reproduce failures faster than
> the others.
> 
> Even if there is an intermittent firmware bug that only appears under
> rare conditions, if it happens with lower probability than drive hardware
> failure then it's not particularly important.  The target hardware failure
> rate for hard drives is 0.1% over the warranty period according to the
> specs for many models.  If one drive's hardware is going to fail
> with p < 0.001, then maybe the firmware bug makes it lose data at p =
> 0.00075 instead of p = 0.00050.  Users won't care about this--they'll
> use RAID to contain the damage, or just accept the failure risks of a
> single-disk system.  Filesystem failures that occur after the drive has
> degraded to