Re: Confused by btrfs quota group accounting
On 2019/6/22 下午11:11, Andrei Borzenkov wrote: [snip] > > 10:/mnt # dd if=/dev/urandom of=test/file bs=1M count=100 seek=0 > conv=notrunc > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB, 100 MiB) copied, 0.685532 s, 153 MB/s > 10:/mnt # sync > 10:/mnt # btrfs qgroup show . > qgroupid rfer excl > > 0/5 16.00KiB 16.00KiB > 0/258 1.01GiB100.02MiB > 0/263 1.00GiB 85.02MiB Sorry, I can't really reproduce it. 5.1.12 kernel, using the following script: --- #!/bin/bash dev="/dev/data/btrfs" mnt="/mnt/btrfs" umount $dev &> /dev/null mkfs.btrfs -f $dev > /dev/null mount $dev $mnt btrfs sub create $mnt/subv1 btrfs quota enable $mnt btrfs quota rescan -w $mnt xfs_io -f -c "pwrite 0 1G" $mnt/subv1/file1 sync btrfs sub snapshot $mnt/subv1 $mnt/subv2 sync btrfs qgroup show -prce $mnt xfs_io -c "pwrite 0 100m" $mnt/subv1/file1 sync btrfs qgroup show -prce $mnt --- The result is: --- Create subvolume '/mnt/btrfs/subv1' wrote 1073741824/1073741824 bytes at offset 0 1 GiB, 262144 ops; 0.5902 sec (1.694 GiB/sec and 444134.2107 ops/sec) Create a snapshot of '/mnt/btrfs/subv1' in '/mnt/btrfs/subv2' qgroupid rfer excl max_rfer max_excl parent child -- - 0/5 16.00KiB 16.00KiB none none --- --- 0/256 1.00GiB 16.00KiB none none --- --- 0/259 1.00GiB 16.00KiB none none --- --- wrote 104857600/104857600 bytes at offset 0 100 MiB, 25600 ops; 0.0694 sec (1.406 GiB/sec and 368652.9766 ops/sec) qgroupid rfer excl max_rfer max_excl parent child -- - 0/5 16.00KiB 16.00KiB none none --- --- 0/256 1.10GiB100.02MiB none none --- --- 0/259 1.00GiB 16.00KiB none none --- --- --- > 10:/mnt # filefrag -v test/file > Filesystem type is: 9123683e > File size of test/file is 1073741824 (262144 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 22463: 315424..337887: 22464: There is a initial 10.9MiB extent, I'm not sure how it's created. >1:22464.. 25599: 76896.. 80031: 3136: 337888: Also here comes another 1.5MiB extent. From the result of the fiemap, it's definitely not 100MiB written. Only 12.5M written. The fiemap result doesn't match with your dd command. Any clue how this happened? Thanks, Qu >2:25600.. 43135: 59264.. 76799: 17536: 80032: shared >3:43136.. 97279: 86048..140191: 54144: 76800: shared >4:97280.. 151551: 143392..197663: 54272: 140192: shared >5: 151552.. 207359: 200736..256543: 55808: 197664: shared >6: 207360.. 262143: 258080..312863: 54784: 256544: > last,shared,eof > test/file: 7 extents found > 10:/mnt # filefrag -v snap1/file > Filesystem type is: 9123683e > File size of snap1/file is 1073741824 (262144 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 43135: 33664.. 76799: 43136: >1:43136.. 97279: 86048..140191: 54144: 76800: shared >2:97280.. 151551: 143392..197663: 54272: 140192: shared >3: 151552.. 207359: 200736..256543: 55808: 197664: shared >4: 207360.. 262143: 258080..312863: 54784: 256544: > last,shared,eof > snap1/file: 5 extents found > > > Oops. Where 85MiB exclusive usage in snapshot comes from? I would expect > one of > > - 0 exclusive, because original first extent is still referenced by test > (even though partially), so if qgroup counts physical space usage, snap1 > effectively refers to the same physical extents as test. > > - 100MiB exclusive if qgroup counts logical space consumption, because > snapshot now has 100MiB different data. > > But 85MiB? It does not match any observed value. Judging by 1.01GiB of > referenced space for subvolume test, qrgoup counts physical usage, at > which point snapshot exclusive space consumption remains 0. > signature.asc Description: OpenPGP digital signature
Re: Confused by btrfs quota group accounting
On 2019/6/23 下午3:55, Qu Wenruo wrote: > > > On 2019/6/22 下午11:11, Andrei Borzenkov wrote: > [snip] >> >> 10:/mnt # dd if=/dev/urandom of=test/file bs=1M count=100 seek=0 >> conv=notrunc >> 100+0 records in >> 100+0 records out >> 104857600 bytes (105 MB, 100 MiB) copied, 0.685532 s, 153 MB/s >> 10:/mnt # sync >> 10:/mnt # btrfs qgroup show . >> qgroupid rfer excl >> >> 0/5 16.00KiB 16.00KiB >> 0/258 1.01GiB100.02MiB >> 0/263 1.00GiB 85.02MiB > > Sorry, I can't really reproduce it. > > 5.1.12 kernel, using the following script: > --- > #!/bin/bash > > dev="/dev/data/btrfs" > mnt="/mnt/btrfs" > > umount $dev &> /dev/null > mkfs.btrfs -f $dev > /dev/null > > mount $dev $mnt > btrfs sub create $mnt/subv1 > btrfs quota enable $mnt > btrfs quota rescan -w $mnt > > xfs_io -f -c "pwrite 0 1G" $mnt/subv1/file1 > sync > btrfs sub snapshot $mnt/subv1 $mnt/subv2 > sync > btrfs qgroup show -prce $mnt > > xfs_io -c "pwrite 0 100m" $mnt/subv1/file1 > sync > btrfs qgroup show -prce $mnt > --- > > The result is: > --- > Create subvolume '/mnt/btrfs/subv1' > wrote 1073741824/1073741824 bytes at offset 0 > 1 GiB, 262144 ops; 0.5902 sec (1.694 GiB/sec and 444134.2107 ops/sec) > Create a snapshot of '/mnt/btrfs/subv1' in '/mnt/btrfs/subv2' > qgroupid rfer excl max_rfer max_excl parent child > -- - > 0/5 16.00KiB 16.00KiB none none --- --- > 0/256 1.00GiB 16.00KiB none none --- --- > 0/259 1.00GiB 16.00KiB none none --- --- > wrote 104857600/104857600 bytes at offset 0 > 100 MiB, 25600 ops; 0.0694 sec (1.406 GiB/sec and 368652.9766 ops/sec) > qgroupid rfer excl max_rfer max_excl parent child > -- - > 0/5 16.00KiB 16.00KiB none none --- --- > 0/256 1.10GiB100.02MiB none none --- --- > 0/259 1.00GiB 16.00KiB none none --- --- > --- > >> 10:/mnt # filefrag -v test/file >> Filesystem type is: 9123683e >> File size of test/file is 1073741824 (262144 blocks of 4096 bytes) My bad, I'm still using 512 bytes as blocksize. If using 4K blocksize, then the fiemap result matches. Then please discard my previous comment. Then we need to check data extents layout to make sure what's going on. Would you please provide the following output? # btrfs ins dump-tree -t 258 /dev/vdb # btrfs ins dump-tree -t 263 /dev/vdb # btrfs check /dev/vdb If the last command reports qgroup mismatch, then it means qgroup is indeed incorrect. Also, I see your subvolume id is not continuous, did you created/removed some other subvolumes during your test? Thanks, Qu >> Oops. Where 85MiB exclusive usage in snapshot comes from? I would expect >> one of >> >> - 0 exclusive, because original first extent is still referenced by test >> (even though partially), so if qgroup counts physical space usage, snap1 >> effectively refers to the same physical extents as test. >> >> - 100MiB exclusive if qgroup counts logical space consumption, because >> snapshot now has 100MiB different data. >> >> But 85MiB? It does not match any observed value. Judging by 1.01GiB of >> referenced space for subvolume test, qrgoup counts physical usage, at >> which point snapshot exclusive space consumption remains 0. >> > signature.asc Description: OpenPGP digital signature
Re: Confused by btrfs quota group accounting
23.06.2019 11:08, Qu Wenruo пишет: > > > On 2019/6/23 下午3:55, Qu Wenruo wrote: >> >> >> On 2019/6/22 下午11:11, Andrei Borzenkov wrote: >> [snip] >>> >>> 10:/mnt # dd if=/dev/urandom of=test/file bs=1M count=100 seek=0 >>> conv=notrunc >>> 100+0 records in >>> 100+0 records out >>> 104857600 bytes (105 MB, 100 MiB) copied, 0.685532 s, 153 MB/s >>> 10:/mnt # sync >>> 10:/mnt # btrfs qgroup show . >>> qgroupid rfer excl >>> >>> 0/5 16.00KiB 16.00KiB >>> 0/258 1.01GiB100.02MiB >>> 0/263 1.00GiB 85.02MiB >> >> Sorry, I can't really reproduce it. >> >> 5.1.12 kernel, using the following script: >> --- >> #!/bin/bash >> >> dev="/dev/data/btrfs" >> mnt="/mnt/btrfs" >> >> umount $dev &> /dev/null >> mkfs.btrfs -f $dev > /dev/null >> >> mount $dev $mnt >> btrfs sub create $mnt/subv1 >> btrfs quota enable $mnt >> btrfs quota rescan -w $mnt >> >> xfs_io -f -c "pwrite 0 1G" $mnt/subv1/file1 >> sync >> btrfs sub snapshot $mnt/subv1 $mnt/subv2 >> sync >> btrfs qgroup show -prce $mnt >> >> xfs_io -c "pwrite 0 100m" $mnt/subv1/file1 >> sync >> btrfs qgroup show -prce $mnt >> --- >> >> The result is: >> --- >> Create subvolume '/mnt/btrfs/subv1' >> wrote 1073741824/1073741824 bytes at offset 0 >> 1 GiB, 262144 ops; 0.5902 sec (1.694 GiB/sec and 444134.2107 ops/sec) >> Create a snapshot of '/mnt/btrfs/subv1' in '/mnt/btrfs/subv2' >> qgroupid rfer excl max_rfer max_excl parent child >> -- - >> 0/5 16.00KiB 16.00KiB none none --- --- >> 0/256 1.00GiB 16.00KiB none none --- --- >> 0/259 1.00GiB 16.00KiB none none --- --- >> wrote 104857600/104857600 bytes at offset 0 >> 100 MiB, 25600 ops; 0.0694 sec (1.406 GiB/sec and 368652.9766 ops/sec) >> qgroupid rfer excl max_rfer max_excl parent child >> -- - >> 0/5 16.00KiB 16.00KiB none none --- --- >> 0/256 1.10GiB100.02MiB none none --- --- >> 0/259 1.00GiB 16.00KiB none none --- --- >> --- >> >>> 10:/mnt # filefrag -v test/file >>> Filesystem type is: 9123683e >>> File size of test/file is 1073741824 (262144 blocks of 4096 bytes) > > My bad, I'm still using 512 bytes as blocksize. > If using 4K blocksize, then the fiemap result matches. > > Then please discard my previous comment. > > Then we need to check data extents layout to make sure what's going on. > > Would you please provide the following output? > # btrfs ins dump-tree -t 258 /dev/vdb > # btrfs ins dump-tree -t 263 /dev/vdb > # btrfs check /dev/vdb > > If the last command reports qgroup mismatch, then it means qgroup is > indeed incorrect. > no error reported. 10:/home/bor # btrfs ins dump-tree -t 258 /dev/vdb btrfs-progs v5.1 file tree key (258 ROOT_ITEM 0) leaf 32505856 items 45 free space 12677 generation 11 owner 258 leaf 32505856 flags 0x1(WRITTEN) backref revision 1 fs uuid d10df0fa-25aa-4d80-89d9-16033ae3392d chunk uuid 1bf7922a-9f98-4c76-8511-77e5605b8112 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 8 transid 9 size 8 nbytes 0 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 sequence 13 flags 0x0(none) atime 1561214496.783636682 (2019-06-22 17:41:36) ctime 1561214504.7643132 (2019-06-22 17:41:44) mtime 1561214504.7643132 (2019-06-22 17:41:44) otime 1561214496.783636682 (2019-06-22 17:41:36) item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 index 0 namelen 2 name: .. item 2 key (256 DIR_ITEM 1847562484) itemoff 16077 itemsize 34 location key (257 INODE_ITEM 0) type FILE transid 9 data_len 0 name_len 4 name: file item 3 key (256 DIR_INDEX 2) itemoff 16043 itemsize 34 location key (257 INODE_ITEM 0) type FILE transid 9 data_len 0 name_len 4 name: file item 4 key (257 INODE_ITEM 0) itemoff 15883 itemsize 160 generation 9 transid 11 size 1073741824 nbytes 1073741824 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 sequence 1136 flags 0x0(none) atime 1561214504.7643132 (2019-06-22 17:41:44) ctime 1561214563.71728522 (2019-06-22 17:42:43) mtime 1561214563.71728522 (2019-06-22 17:42:43) otime 1561214504.7643132 (2019-06-22 17:41:44) item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 index 2 namelen 4 name: file item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 11 type 1 (regul
[PATCH] btrfs-progs: misc-tests/029: exit manually after run_mayfail()
From: Su Yue Since the commmit 8dd3e5dc2df5 ("btrfs-progs: tests: fix misc-tests/029 to run on NFS") added the compatibility of NFS, it called run_mayfail() in the last of the test. However, run_mayfail() always return the original code. If the test case is not running on NFS, the last `run_mayfail rmdir "$SUBVOL_MNT"` will fail with return value 1 then the test fails: == RUN MAYFAIL rmdir btrfs-progs/tests/misc-tests/029-send-p-different-mountpoints/subvol_mnt rmdir: failed to remove 'btrfs-progs/tests/misc-tests/029-send-p-different-mountpoints/subvol_mnt': No such file or director failed (ignored, ret=1): rmdir btrfs-progs/tests/misc-tests/029-send-p-different-mountpoints/subvol_mnt test failed for case 029-send-p-different-mountpoints = Every instrument in this script handles its error well, so do exit 0 manually in the last. Link: https://bugzilla.kernel.org/show_bug.cgi?id=202645 Fixes: 8dd3e5dc2df5 ("btrfs-progs: tests: fix misc-tests/029 to run on NFS") Signed-off-by: Su Yue --- tests/misc-tests/029-send-p-different-mountpoints/test.sh | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tests/misc-tests/029-send-p-different-mountpoints/test.sh b/tests/misc-tests/029-send-p-different-mountpoints/test.sh index e092f8bba31e..d2b5e693f2d7 100755 --- a/tests/misc-tests/029-send-p-different-mountpoints/test.sh +++ b/tests/misc-tests/029-send-p-different-mountpoints/test.sh @@ -49,3 +49,6 @@ run_check_umount_test_dev "$TEST_MNT" run_mayfail $SUDO_HELPER rmdir "$SUBVOL_MNT" run_mayfail rmdir "$SUBVOL_MNT" +# run_mayfail() may fail with nonzero value returned which causes failure +# of this case. Do exit manually. +exit 0 -- 2.22.0
Re: Confused by btrfs quota group accounting
On 2019/6/23 下午6:15, Andrei Borzenkov wrote: [snip] >> If the last command reports qgroup mismatch, then it means qgroup is >> indeed incorrect. >> > > no error reported. Then it's not a bug, and should be caused by btrfs extent booking behavior. > 10:/home/bor # btrfs ins dump-tree -t 258 /dev/vdb > btrfs-progs v5.1 > file tree key (258 ROOT_ITEM 0) > item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 > index 2 namelen 4 name: file The inode we care about. > item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 > generation 11 type 1 (regular) > extent data disk byte 1291976704 nr 46137344 > extent data offset 0 nr 46137344 ram 46137344 44 MiB extent, this should be exclusive for the subvol 258. > item 7 key (257 EXTENT_DATA 46137344) itemoff 15763 itemsize 53 > generation 11 type 1 (regular) > extent data disk byte 1338114048 nr 45875200 > extent data offset 0 nr 45875200 ram 45875200 Another 43.75 Mib extent, also exclusive for 258. > item 8 key (257 EXTENT_DATA 92012544) itemoff 15710 itemsize 53 > generation 11 type 1 (regular) > extent data disk byte 314966016 nr 262144 > extent data offset 0 nr 262144 ram 262144 Another 0.25MiB extent. Also exclusive. > item 9 key (257 EXTENT_DATA 92274688) itemoff 15657 itemsize 53 > generation 11 type 1 (regular) > extent data disk byte 315228160 nr 12582912 > extent data offset 0 nr 12582912 ram 12582912 Another 12.0 MiB extent, also exclusive. BTW, so many fragmented extents, this normally means your system has very high memory pressure or lack of memory, or lack of on-disk space. Above 100MiB should be in one large extent, not split into so many small ones. So 258 have 100 MiB extents exclusive. No problem so far. > item 10 key (257 EXTENT_DATA 104857600) itemoff 15604 itemsize 53 > generation 9 type 1 (regular) > extent data disk byte 227016704 nr 43515904 > extent data offset 15728640 nr 27787264 ram 43515904 From this extents on, data extent at 227016704 (len 41.5M) are all shared with another extent. You can just search the bytenr 227016704, which also shows up in subvol 265. [snip] > file tree key (263 ROOT_ITEM 10) > item 5 key (257 INODE_REF 256) itemoff 15869 itemsize 14 > index 2 namelen 4 name: file Starts from here, that's the inode we care. > item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 > generation 9 type 1 (regular) > extent data disk byte 137887744 nr 43778048 > extent data offset 0 nr 43778048 ram 43778048 Exclusive, 41.75 MiB. > item 7 key (257 EXTENT_DATA 43778048) itemoff 15763 itemsize 53 > generation 9 type 1 (regular) > extent data disk byte 181665792 nr 1310720 > extent data offset 0 nr 1310720 ram 1310720 Exclusive 1.25MiB. > item 8 key (257 EXTENT_DATA 45088768) itemoff 15710 itemsize 53 > generation 9 type 1 (regular) > extent data disk byte 182976512 nr 43778048 > extent data offset 0 nr 43778048 ram 43778048 Exclusive, 41.76 NiB. > item 9 key (257 EXTENT_DATA 88866816) itemoff 15657 itemsize 53 > generation 9 type 1 (regular) > extent data disk byte 226754560 nr 262144 > extent data offset 0 nr 262144 ram 262144 > extent compression 0 (none) This data extent get shared between subvol 258 and 263. The difference is, subvol 258 only shared part of the extent, while 263 are using the full extent. Btrfs qgroup calculates exclusive based on extents, not bytes, so even only part of the extent get shared, it's still counted as shared. So for subvol 263, your exclusive is 41.75 + 1.25 + 41.76 = 84.75 MiB. In short, due to qgroup works at extents level, not bytes level, you'll find strange behavior. E.g. For my previous script, if on a system with enough free memory, if you only writes 100 MiB, which is smaller than data extent size limit (128MiB), only one subvolume will get 100MiB exclusive while the other one has no exclusive (except the 16K leaf). But if you're writing 128MiB, just at the extent size limit, then both subvolume will have 128MiB exclusive. Thanks, Qu > item 10 key (257 EXTENT_DATA 89128960) itemoff 15604 itemsize 53 > generation 9 type 1 (regular) > extent data disk byte 227016704 nr 43515904 > extent data offset 0 nr 43515904 ram 43515904 > extent compression 0 (none) [snip] > >> Also, I see your subvolume id is not continuous, did you created/removed >> some other subvolumes during your test? >> > > No. At least on this filesystem. I have recreated it several times, but > since the last mkfs these were the only two subvolu
Recover files from broken btrfs
Hi all I have a ReadyNAS device with 4 4TB disks. It was working all right for couple of years. At one point the system became read-only, and after reboot data is inaccessible. Can anyone give some advise how to recover data from the file system? system details are root@Dyskietka:~# uname -a Linux Dyskietka 4.4.116.armada.1 #1 SMP Mon Feb 19 22:05:00 PST 2018 armv7l GNU/Linux root@Dyskietka:~# btrfs --version btrfs-progs v4.12 root@Dyskietka:~# btrfs fi show Label: '2fe4f8e6:data' uuid: 0970e8c4-fd47-43d3-aa93-593006e3d0c3 Total devices 1 FS bytes used 8.11TiB devid1 size 10.90TiB used 8.11TiB path /dev/md127 root@Dyskietka:~# btrfs fi df /dev/md127 ERROR: not a btrfs filesystem: /dev/md127 dmesg [Sat Jun 1 13:18:48 2019] md/raid:md1: device sdc2 operational as raid disk 0 [Sat Jun 1 13:18:48 2019] md/raid:md1: device sda2 operational as raid disk 3 [Sat Jun 1 13:18:48 2019] md/raid:md1: device sdb2 operational as raid disk 2 [Sat Jun 1 13:18:48 2019] md/raid:md1: device sdd2 operational as raid disk 1 [Sat Jun 1 13:18:48 2019] md/raid:md1: allocated 4294kB [Sat Jun 1 13:18:48 2019] md/raid:md1: raid level 6 active with 4 out of 4 devices, algorithm 2 [Sat Jun 1 13:18:48 2019] RAID conf printout: [Sat Jun 1 13:18:48 2019] --- level:6 rd:4 wd:4 [Sat Jun 1 13:18:48 2019] disk 0, o:1, dev:sdc2 [Sat Jun 1 13:18:48 2019] disk 1, o:1, dev:sdd2 [Sat Jun 1 13:18:48 2019] disk 2, o:1, dev:sdb2 [Sat Jun 1 13:18:48 2019] disk 3, o:1, dev:sda2 [Sat Jun 1 13:18:48 2019] md1: detected capacity change from 0 to 1071644672 [Sat Jun 1 13:18:49 2019] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null) [Sat Jun 1 13:18:49 2019] systemd[1]: Failed to insert module 'kdbus': Function not implemented [Sat Jun 1 13:18:49 2019] systemd[1]: systemd 230 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN) [Sat Jun 1 13:18:49 2019] systemd[1]: Detected architecture arm. [Sat Jun 1 13:18:49 2019] systemd[1]: Set hostname to . [Sat Jun 1 13:18:50 2019] systemd[1]: Started Dispatch Password Requests to Console Directory Watch. [Sat Jun 1 13:18:50 2019] systemd[1]: Listening on /dev/initctl Compatibility Named Pipe. [Sat Jun 1 13:18:50 2019] systemd[1]: Listening on Journal Socket (/dev/log). [Sat Jun 1 13:18:50 2019] systemd[1]: Listening on udev Kernel Socket. [Sat Jun 1 13:18:50 2019] systemd[1]: Created slice System Slice. [Sat Jun 1 13:18:50 2019] systemd[1]: Created slice system-serial\x2dgetty.slice. [Sat Jun 1 13:18:50 2019] systemd[1]: Reached target Encrypted Volumes. [Sat Jun 1 13:18:50 2019] systemd[1]: Started Forward Password Requests to Wall Directory Watch. [Sat Jun 1 13:18:50 2019] systemd[1]: Listening on Journal Socket. [Sat Jun 1 13:18:50 2019] systemd[1]: Started ReadyNAS LCD splasher. [Sat Jun 1 13:18:50 2019] systemd[1]: Starting ReadyNASOS system prep... [Sat Jun 1 13:18:50 2019] systemd[1]: Mounting RPC Pipe File System... [Sat Jun 1 13:18:50 2019] systemd[1]: Starting Remount Root and Kernel File Systems... [Sat Jun 1 13:18:50 2019] systemd[1]: Starting Create list of required static device nodes for the current kernel... [Sat Jun 1 13:18:50 2019] systemd[1]: Mounting POSIX Message Queue File System... [Sat Jun 1 13:18:50 2019] systemd[1]: Created slice User and Session Slice. [Sat Jun 1 13:18:50 2019] systemd[1]: Reached target Slices. [Sat Jun 1 13:18:50 2019] systemd[1]: Created slice system-getty.slice. [Sat Jun 1 13:18:50 2019] systemd[1]: Starting Load Kernel Modules... [Sat Jun 1 13:18:50 2019] systemd[1]: Mounting RPC Pipe File System... [Sat Jun 1 13:18:50 2019] systemd[1]: Starting Journal Service... [Sat Jun 1 13:18:50 2019] systemd[1]: Listening on udev Control Socket. [Sat Jun 1 13:18:50 2019] systemd[1]: Reached target Paths. [Sat Jun 1 13:18:51 2019] systemd[1]: Mounted RPC Pipe File System. [Sat Jun 1 13:18:51 2019] systemd[1]: Mounted RPC Pipe File System. [Sat Jun 1 13:18:51 2019] systemd[1]: Mounted POSIX Message Queue File System. [Sat Jun 1 13:18:51 2019] systemd[1]: Started ReadyNASOS system prep. [Sat Jun 1 13:18:51 2019] systemd[1]: Started Remount Root and Kernel File Systems. [Sat Jun 1 13:18:51 2019] systemd[1]: Started Create list of required static device nodes for the current kernel. [Sat Jun 1 13:18:51 2019] systemd[1]: Started Load Kernel Modules. [Sat Jun 1 13:18:51 2019] systemd[1]: Starting Apply Kernel Variables... [Sat Jun 1 13:18:51 2019] systemd[1]: Mounting FUSE Control File System... [Sat Jun 1 13:18:51 2019] systemd[1]: Mounting Configuration File System... [Sat Jun 1 13:18:51 2019] systemd[1]: Starting Create Static Device Nodes in /dev... [Sat Jun 1 13:18:51 2019] systemd[1]: Starting Load/Save Random Seed... [Sat Jun 1 13:18:51 2019] systemd[1]: Starting Rebuild Hardware Database... [Sat Jun 1 13:18:51 2019] systemd[1]: Mounted Configuration File System.
Re: Confused by btrfs quota group accounting
23.06.2019 14:29, Qu Wenruo пишет: > > > BTW, so many fragmented extents, this normally means your system has > very high memory pressure or lack of memory, or lack of on-disk space. It is 1GiB QEMU VM with vanilla Tumbleweed with GNOME desktop; nothing runs except user GNOME session. Does it fit "high memory pressure" definition? > Above 100MiB should be in one large extent, not split into so many small > ones. > OK, so this is where I was confused. I was sure that filefrag returns true "physical" extent layout. It seems that in filefrag output consecutive extents are merged giving false picture of large extent instead of many small ones. Filefrag shows 5 ~200MiB extents, not over 30 smaller ones. Is it how generic IOCTL is designed to work or is it something that btrfs does internally? This *is* confusing. In any case, thank you for clarification, this makes sense now. signature.asc Description: OpenPGP digital signature
Re: Confused by btrfs quota group accounting
On 2019/6/23 下午9:42, Andrei Borzenkov wrote: > 23.06.2019 14:29, Qu Wenruo пишет: >> >> >> BTW, so many fragmented extents, this normally means your system has >> very high memory pressure or lack of memory, or lack of on-disk space. > > It is 1GiB QEMU VM with vanilla Tumbleweed with GNOME desktop; nothing > runs except user GNOME session. Does it fit "high memory pressure" > definition? 1GiB Vram? That's very easy to trigger memory pressure. I'm not 100% sure about the percentage page cache can use, but 1/8 would be a safe guess. Which means, you can only write at most 128M before triggering writeback. Considering other program uses some page cache, you have less available page caches for filesystem. > >> Above 100MiB should be in one large extent, not split into so many small >> ones. >> > > OK, so this is where I was confused. I was sure that filefrag returns > true "physical" extent layout. It seems that in filefrag output > consecutive extents are merged giving false picture of large extent > instead of many small ones. Filefrag shows 5 ~200MiB extents, not over > 30 smaller ones. Btrfs does file extent mapping merge at fiemap reporting time. Personally speaking, I don't know why user would expect real extent mapping. As long as the all these extents have the same flags, continuous address, then merging them shouldn't be a problem. In fact, after viewing the real on-disk extent mapping, it's possible to explain just by the fiemap result, but using fiemap result alone is indeed much harder to expose such case. For your use case, it's already deep into the implementation, thus I recommend low level tools like "btrfs ins dump-tree" to discover the underlying extent mapping. > > Is it how generic IOCTL is designed to work or is it something that > btrfs does internally? This *is* confusing. > > In any case, thank you for clarification, this makes sense now. AFAIK, we should put this into the btrfs(5) man page as an special use case. Along with btrfs extent booking. Thanks, Qu signature.asc Description: OpenPGP digital signature
Re: Global reserve and ENOSPC while deleting snapshots on 5.0.9 - still happens on 5.1.11
On Tue, Apr 23, 2019 at 07:06:51PM -0400, Zygo Blaxell wrote: > I had a test filesystem that ran out of unallocated space, then ran > out of metadata space during a snapshot delete, and forced readonly. > The workload before the failure was a lot of rsync and bees dedupe > combined with random snapshot creates and deletes. Had this happen again on a production filesystem, this time on 5.1.11, and it happened during orphan inode cleanup instead of snapshot delete: [14303.076134][T20882] BTRFS: error (device dm-21) in add_to_free_space_tree:1037: errno=-28 No space left [14303.076144][T20882] BTRFS: error (device dm-21) in __btrfs_free_extent:7196: errno=-28 No space left [14303.076157][T20882] BTRFS: error (device dm-21) in btrfs_run_delayed_refs:3008: errno=-28 No space left [14303.076203][T20882] BTRFS error (device dm-21): Error removing orphan entry, stopping orphan cleanup [14303.076210][T20882] BTRFS error (device dm-21): could not do orphan cleanup -22 [14303.076281][T20882] BTRFS error (device dm-21): commit super ret -30 [14303.357337][T20882] BTRFS error (device dm-21): open_ctree failed Same fix: I bumped the reserved size limit from 512M to 2G and mounted normally. (OK, technically, I booted my old 5.0.21 kernel--but my 5.0.21 kernel has the 2G reserved space patch below in it.) I've not been able to repeat this ENOSPC behavior under test conditions in the last two months of trying, but it's now happened twice in different places, so it has non-zero repeatability. > I tried the usual fix strategies: > > 1. Immediately after mount, try to balance to free space for > metadata > > 2. Immediately after mount, add additional disks to provide > unallocated space for metadata > > 3. Mount -o nossd to increase metadata density > > #3 had no effect. #1 failed consistently. > > #2 was successful, but the additional space was not used because > btrfs couldn't allocate chunks for metadata because it ran out of > metadata space for new metadata chunks. > > When btrfs-cleaner tried to remove the first pending deleted snapshot, > it started a transaction that failed due to lack of metadata space. > Since the transaction failed, the filesystem reverts to its earlier state, > and exactly the same thing happens on the next mount. The 'btrfs dev > add' in #2 is successful only if it is executed immediately after mount, > before the btrfs-cleaner thread wakes up. > > Here's what the kernel said during one of the attempts: > > [41263.822252] BTRFS info (device dm-3): use zstd compression, level 0 > [41263.825135] BTRFS info (device dm-3): using free space tree > [41263.827319] BTRFS info (device dm-3): has skinny extents > [42046.463356] [ cut here ] > [42046.463387] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: > errno=-28 No space left > [42046.463404] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: > errno=-28 No space left > [42046.463407] BTRFS info (device dm-3): forced readonly > [42046.463414] BTRFS: error (device dm-3) in > btrfs_run_delayed_refs:3011: errno=-28 No space left > [42046.463429] BTRFS: error (device dm-3) in > btrfs_create_pending_block_groups:10517: errno=-28 No space left > [42046.463548] BTRFS: error (device dm-3) in > btrfs_create_pending_block_groups:10520: errno=-28 No space left > [42046.471363] BTRFS: error (device dm-3) in > btrfs_run_delayed_refs:3011: errno=-28 No space left > [42046.471475] BTRFS: error (device dm-3) in > btrfs_create_pending_block_groups:10517: errno=-28 No space left > [42046.471506] BTRFS: error (device dm-3) in > btrfs_create_pending_block_groups:10520: errno=-28 No space left > [42046.473672] BTRFS: error (device dm-3) in btrfs_drop_snapshot:9489: > errno=-28 No space left > [42046.475643] WARNING: CPU: 0 PID: 10187 at > fs/btrfs/extent-tree.c:7056 __btrfs_free_extent+0x364/0xf60 > [42046.475645] Modules linked in: mq_deadline bfq dm_cache_smq dm_cache > dm_persistent_data dm_bio_prison dm_bufio joydev ppdev crct10dif_pclmul > crc32_pclmul crc32c_intel ghash_clmulni_intel dm_mod snd_pcm aesni_intel > aes_x86_64 snd_timer crypto_simd cryptd glue_helper sr_mod snd cdrom psmouse > sg soundcore input_leds pcspkr serio_raw ide_pci_generic i2c_piix4 bochs_drm > parport_pc piix rtc_cmos floppy parport pcc_cpufreq ide_core qemu_fw_cfg > evbug evdev ip_tables x_tables ipv6 crc_ccitt autofs4 > [42046.475677] CPU: 0 PID: 10187 Comm: btrfs-transacti Tainted: GB > W 5.0.8-zb64-10a85e8a1569+ #1 > [42046.475678] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.10.2-1 04/01/2014 > [42046.475681] RIP: 0010:__btrfs_free_extent+0x364/0xf60 > [42046.475684] Code: 50 f0 48 0f ba a8 90 22 00 00 02 72 1f 8b 85 88 fe > ff ff 83 f8 fb 0f 84 59 04 00 00 89 c6 48 c7 c7 00
Per-entry or per-subvolume physical location settings
Greetings! When using btrfs with multiple devices in a "single" mode, is it possible to force some files and directories onto one drive and some to the other? Or at least specify "single" mode on a specific device for some directories and "DUP" for some others. The following scenario, if it is possible, would make btrfs even cooler than bcachefs: - Single mode, multiple devices - a HDD and a SSD - Force system folders to be located on the SSD, or both - those have little writes and need speed - And still manage snapshots easily on one place, instead of two top-level directories and a lot of symlinks! Maybe it is at least possible to do it on a per-subvolume basis? This would still be better than mounting subvolumes from different devices. I remember reading somewhere in the early days of btrfs that the main advantage on btrfs over LVM is that btrfs knows about your files, which makes those things possible - even on a file structure level. However, I could not find anything about it, while LVM allows per-volume physical location settings, if I remember correctly. -- Valery Plotnikov
Per-entry or per-subvolume physical location settings
Greetings! When using btrfs with multiple devices in a "single" mode, is it possible to force some files and directories onto one drive and some to the other? Or at least specify "single" mode on a specific device for some directories and "DUP" for some others. The following scenario, if it is possible, would make btrfs even cooler than bcachefs: - Single mode, multiple devices - a HDD and a SSD - Force system folders to be located on the SSD, or both - those have little writes and need speed - And still manage snapshots easily on one place, instead of two top-level directories and a lot of symlinks! Maybe it is at least possible to do it on a per-subvolume basis? This would still be better than mounting subvolumes from different devices. I remember reading somewhere in the early days of btrfs that the main advantage on btrfs over LVM is that btrfs knows about your files, which makes those things possible - even on a file structure level. However, I could not find anything about it, while LVM allows per-volume physical location settings, if I remember correctly. -- darkpenguin
btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote: > On 2019/6/20 上午7:45, Zygo Blaxell wrote: > > On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote: > >> What should I do now ... to use btrfs safely? Should i not use it with > >> DM-crypt > > > > You might need to disable write caching on your drives, i.e. hdparm -W0. > > This is quite troublesome. > > Disabling write cache normally means performance impact. The drives I've found that need write cache disabled aren't particularly fast to begin with, so disabling write cache doesn't harm their performance very much. All the speed gains of write caching are lost when someone has to spend time doing a forced restore from backup after transid-verify failure. If you really do need performance, there are drives with working firmware available that don't cost much more. > And disabling it normally would hide the true cause (if it's something > btrfs' fault). This is true; however, even if a hypothetical btrfs bug existed, disabling write caching is an immediately deployable workaround, and there's currently no other solution other than avoiding drives with bad firmware. There could be improvements possible for btrfs to work around bad firmware...if someone's willing to donate their sanity to get inside the heads of firmware bugs, and can find a way to fix it that doesn't make things worse for everyone with working firmware. > > I have a few drives in my collection that don't have working write cache. > > They are usually fine, but when otherwise minor failure events occur (e.g. > > bad cables, bad power supply, failing UNC sectors) then the write cache > > doesn't behave correctly, and any filesystem or database on the drive > > gets trashed. > > Normally this shouldn't be the case, as long as the fs has correct > journal and flush/barrier. If you are asking the question: "Are there some currently shipping retail hard drives that are orders of magnitude more likely to corrupt data after simple power failures than other drives?" then the answer is: "Hell, yes! How could there NOT be?" It wouldn't take very much capital investment or time to find this out in lab conditions. Just kill power every 25 minutes while running a btrfs stress-test should do it--or have a UPS hardware failure in ops, the effect is the same. Bad drives will show up in a few hours, good drives take much longer--long enough that, statistically, the good drives will probably fail outright before btrfs gets corrupted. > If it's really the hardware to blame, then it means its flush/fua is not > implemented properly at all, thus the possibility of a single power loss > leading to corruption should be VERY VERY high. That exactly matches my observations. Only a few disks fail at all, but the ones that do fail do so very often: 60% of corruptions at 10 power failures or less, 100% at 30 power failures or more. > > This isn't normal behavior, but the problem does affect > > the default configuration of some popular mid-range drive models from > > top-3 hard disk vendors, so it's quite common. > > Would you like to share the info and test methodology to determine it's > the device to blame? (maybe in another thread) It's basic data mining on operations failure event logs. We track events like filesystem corruption, data loss, other hardware failure, operator errors, power failures, system crashes, dmesg error messages, etc., and count how many times each failure occurs in systems with which hardware components. When a failure occurs, we break the affected system apart and place its components into other systems or test machines to isolate which component is causing the failure (e.g. a failing power supply could create RAM corruption events and disk failure events, so we move the hardware around to see where the failure goes). If the same component is involved in repeatable failure events, the correlation jumps out of the data and we know that component is bad. We can also do correlations by attributes of the components, i.e. vendor, model, size, firmware revision, manufacturing date, and correlate vendor-model-size-firmware to btrfs transid verify failures across a fleet of different systems. I can go to the data and get a list of all the drive model and firmware revisions that have been installed in machines with 0 "parent transid verify failed" events since 2014, and are still online today: Device Model: CT240BX500SSD1 Firmware Version: M6CR013 Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060 Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G Device Model: INTEL SSDSC2KW256G8 Firmware Version: LHF002C Device Model: KINGSTON SA400S37240G Firmware Version: R0105A Device Model: ST12000VN0007-2GS116 Firmware Version: SC60 Device Model: ST5000VN0001-1SF17X Firmware Version: AN02 Device Model: ST8000VN0002-1Z8112 Firmware Version: SC6
Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
On 2019/6/24 上午4:45, Zygo Blaxell wrote: > On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote: >> On 2019/6/20 上午7:45, Zygo Blaxell wrote: >>> On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote: What should I do now ... to use btrfs safely? Should i not use it with DM-crypt >>> >>> You might need to disable write caching on your drives, i.e. hdparm -W0. >> >> This is quite troublesome. >> >> Disabling write cache normally means performance impact. > > The drives I've found that need write cache disabled aren't particularly > fast to begin with, so disabling write cache doesn't harm their > performance very much. All the speed gains of write caching are lost > when someone has to spend time doing a forced restore from backup after > transid-verify failure. If you really do need performance, there are > drives with working firmware available that don't cost much more. > >> And disabling it normally would hide the true cause (if it's something >> btrfs' fault). > > This is true; however, even if a hypothetical btrfs bug existed, > disabling write caching is an immediately deployable workaround, and > there's currently no other solution other than avoiding drives with > bad firmware. > > There could be improvements possible for btrfs to work around bad > firmware...if someone's willing to donate their sanity to get inside > the heads of firmware bugs, and can find a way to fix it that doesn't > make things worse for everyone with working firmware. > >>> I have a few drives in my collection that don't have working write cache. >>> They are usually fine, but when otherwise minor failure events occur (e.g. >>> bad cables, bad power supply, failing UNC sectors) then the write cache >>> doesn't behave correctly, and any filesystem or database on the drive >>> gets trashed. >> >> Normally this shouldn't be the case, as long as the fs has correct >> journal and flush/barrier. > > If you are asking the question: > > "Are there some currently shipping retail hard drives that are > orders of magnitude more likely to corrupt data after simple > power failures than other drives?" > > then the answer is: > > "Hell, yes! How could there NOT be?" > > It wouldn't take very much capital investment or time to find this out > in lab conditions. Just kill power every 25 minutes while running a > btrfs stress-test should do it--or have a UPS hardware failure in ops, > the effect is the same. Bad drives will show up in a few hours, good > drives take much longer--long enough that, statistically, the good drives > will probably fail outright before btrfs gets corrupted. Now it sounds like we really need some good (more elegant than just random power failure, but more controlled system) way to do such test. > >> If it's really the hardware to blame, then it means its flush/fua is not >> implemented properly at all, thus the possibility of a single power loss >> leading to corruption should be VERY VERY high. > > That exactly matches my observations. Only a few disks fail at all, > but the ones that do fail do so very often: 60% of corruptions at > 10 power failures or less, 100% at 30 power failures or more. > >>> This isn't normal behavior, but the problem does affect >>> the default configuration of some popular mid-range drive models from >>> top-3 hard disk vendors, so it's quite common. >> >> Would you like to share the info and test methodology to determine it's >> the device to blame? (maybe in another thread) > > It's basic data mining on operations failure event logs. > > We track events like filesystem corruption, data loss, other hardware > failure, operator errors, power failures, system crashes, dmesg error > messages, etc., and count how many times each failure occurs in systems > with which hardware components. When a failure occurs, we break the > affected system apart and place its components into other systems or > test machines to isolate which component is causing the failure (e.g. a > failing power supply could create RAM corruption events and disk failure > events, so we move the hardware around to see where the failure goes). > If the same component is involved in repeatable failure events, the > correlation jumps out of the data and we know that component is bad. > We can also do correlations by attributes of the components, i.e. vendor, > model, size, firmware revision, manufacturing date, and correlate > vendor-model-size-firmware to btrfs transid verify failures across > a fleet of different systems. > > I can go to the data and get a list of all the drive model and firmware > revisions that have been installed in machines with 0 "parent transid > verify failed" events since 2014, and are still online today: > > Device Model: CT240BX500SSD1 Firmware Version: M6CR013 > Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060 > Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G >
Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
On 2019-06-23 4:45 p.m., Zygo Blaxell wrote: > Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 > Firmware Version: 80.00A80 > > Change the query to 1-30 power cycles, and we get another model with > the same firmware version string: > > Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 > Firmware Version: 80.00A80 > > > These drives have 0 power fail events between mkfs and "parent transid > verify failed" events, i.e. it's not necessary to have a power failure > at all for these drives to unrecoverably corrupt btrfs. In all cases the > failure occurs on the same days as "Current Pending Sector" and "Offline > UNC sector" SMART events. The WD Black firmware seems to be OK with write > cache enabled most of the time (there's years in the log data without any > transid-verify failures), but the WD Black will drop its write cache when > it sees a UNC sector, and btrfs notices the failure a few hours later. > First, thank you very much for sharing. I've seen you mention several times before problems with common consumer drives, but seeing one specific identified problem firmware version is *very* valuable info. I have a question about the Black Drives dropping the cache on UNC error. If a transid id error like that occurred on a BTRFS RAID 1, would BTRFS find the correct metadata on the 2nd drive, or does it stop dead on 1 transid failure?
Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
On Mon, Jun 24, 2019 at 08:46:06AM +0800, Qu Wenruo wrote: > On 2019/6/24 上午4:45, Zygo Blaxell wrote: > > On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote: > >> On 2019/6/20 上午7:45, Zygo Blaxell wrote: [...] > So the worst scenario really happens in real world, badly implemented > flush/fua from firmware. > Btrfs has no way to fix such low level problem. > > BTW, do you have any corruption using the bad drivers (with write cache) > with traditional journal based fs like XFS/EXT4? Those filesystems don't make full-filesystem data integrity guarantees like btrfs does, and there's no ext4 equivalent of dup metadata for self-repair (even metadata csums in ext4 are a recent invention). Ops didn't record failure events when e2fsck quietly repairs unexpected filesystem inconsistencies. On ext3, maybe data corruption happens because of drive firmware bugs, or maybe the application just didn't use fsync properly. Maybe two disks in md-RAID1 have different contents because they had slightly different IO timings. Who knows? There's no way to tell from passive ops failure monitoring. On btrfs with flushoncommit, every data anomaly (e.g. backups not matching origin hosts, obviously corrupted files, scrub failures, etc) is a distinct failure event. Differences between disk contents in RAID1 arrays are failure events. We can put disks with two different firmware versions in a RAID1 pair, and btrfs will tell us if they disagree, use the correct one to fix the broken one, or tell us they're both wrong and it's time to warm up the backups. In 2013 I had some big RAID10 arrays of WD Green 2TB disks using ext3/4 and mdadm, and there were a *lot* of data corruption events. So many events that we didn't have the capacity to investigate them before new ones came in. File restore requests for corrupted data were piling up faster than they could be processed, and we had no systematic way to tell whether the origin or backup file was correct when they were different. Those problems eventually expedited our migration to btrfs, because btrfs let us do deeper and more uniform data collection to see where all the corruption was coming from. While changing filesystems, we moved all the data onto new disks that happened to not have firmware bugs, and all the corruption abruptly disappeared (well, except for data corrupted by bugs in btrfs itself, but now those are fixed too). We didn't know what was happening until years later when the smaller/cheaper systems had enough failures to make noticeable patterns. I would not be surprised if we were having firmware corruption problems with ext3/ext4 the whole time those RAID10 arrays existed. Alas, we were not capturing firmware revision data at the time (only vendor/model), and we only started capturing firmware revisions after all the old drives were recycled. I don't know exactly what firmware versions were in those arrays...though I do have a short list of suspects. ;) > Btrfs is relying more the hardware to implement barrier/flush properly, > or CoW can be easily ruined. > If the firmware is only tested (if tested) against such fs, it may be > the problem of the vendor. [...] > > WD Green and Black are low-cost consumer hard drives under $250. > > One drive of each size in both product ranges comes to a total price > > of around $1200 on Amazon. Lots of end users will have these drives, > > and some of them will want to use btrfs, but some of the drives apparently > > do not have working write caching. We should at least know which ones > > those are, maybe make a kernel blacklist to disable the write caching > > feature on some firmware versions by default. > > To me, the problem isn't for anyone to test these drivers, but how > convincing the test methodology is and how accessible the test device > would be. > > Your statistic has a lot of weight, but it takes you years and tons of > disks to expose it, not something can be reproduced easily. > > On the other hand, if we're going to reproduce power failure quickly and > reliably in a lab enivronment, then how? > Software based SATA power cutoff? Or hardware controllable SATA power cable? You might be overthinking this a bit. Software-controlled switched PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a Raspberry Pi) can turn the AC power on and off on a test box. Get a cheap desktop machine, put as many different drives into it as it can hold, start writing test patterns, kill mains power to the whole thing, power it back up, analyze the data that is now present on disk, log the result over the network, repeat. This is the most accurate simulation, since it replicates all the things that happen during a typical end-user's power failure, only much more often. Hopefully all the hardware involved is designed to handle this situation already. A standard office PC is theoretically designed for 1000 cycles (200 working days over 5 years) and should be able to test 60 drives (6 SATA ports, 1
Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
On Sun, Jun 23, 2019 at 10:45:50PM -0400, Remi Gauvin wrote: > On 2019-06-23 4:45 p.m., Zygo Blaxell wrote: > > > Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 > > Firmware Version: 80.00A80 > > > > Change the query to 1-30 power cycles, and we get another model with > > the same firmware version string: > > > > Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 > > Firmware Version: 80.00A80 > > > > > > > These drives have 0 power fail events between mkfs and "parent transid > > verify failed" events, i.e. it's not necessary to have a power failure > > at all for these drives to unrecoverably corrupt btrfs. In all cases the > > failure occurs on the same days as "Current Pending Sector" and "Offline > > UNC sector" SMART events. The WD Black firmware seems to be OK with write > > cache enabled most of the time (there's years in the log data without any > > transid-verify failures), but the WD Black will drop its write cache when > > it sees a UNC sector, and btrfs notices the failure a few hours later. > > > > First, thank you very much for sharing. I've seen you mention several > times before problems with common consumer drives, but seeing one > specific identified problem firmware version is *very* valuable info. > > I have a question about the Black Drives dropping the cache on UNC > error. If a transid id error like that occurred on a BTRFS RAID 1, > would BTRFS find the correct metadata on the 2nd drive, or does it stop > dead on 1 transid failure? Well, the 2nd drive has to have correct metadata--if you are mirroring a pair of disks with the same firmware bug, that's not likely to happen. There is a bench test that will demonstrate the transid verify self-repair procedure: disconnect one half of a RAID1 array, write for a while, then reconnect and do a scrub. btrfs should self-repair all the metadata on the disconnected drive until it all matches the connected one. Some of the data blocks might be hosed though (due to CRC32 collisions), so don't do this test on data you care about. > > signature.asc Description: PGP signature
Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
On Mon, Jun 24, 2019 at 12:37:51AM -0400, Zygo Blaxell wrote: > On Sun, Jun 23, 2019 at 10:45:50PM -0400, Remi Gauvin wrote: > > On 2019-06-23 4:45 p.m., Zygo Blaxell wrote: > > > > > Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 > > > Firmware Version: 80.00A80 > > > > > > Change the query to 1-30 power cycles, and we get another model with > > > the same firmware version string: > > > > > > Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 > > > Firmware Version: 80.00A80 > > > > > > > > > > > These drives have 0 power fail events between mkfs and "parent transid > > > verify failed" events, i.e. it's not necessary to have a power failure > > > at all for these drives to unrecoverably corrupt btrfs. In all cases the > > > failure occurs on the same days as "Current Pending Sector" and "Offline > > > UNC sector" SMART events. The WD Black firmware seems to be OK with write > > > cache enabled most of the time (there's years in the log data without any > > > transid-verify failures), but the WD Black will drop its write cache when > > > it sees a UNC sector, and btrfs notices the failure a few hours later. > > > > > > > First, thank you very much for sharing. I've seen you mention several > > times before problems with common consumer drives, but seeing one > > specific identified problem firmware version is *very* valuable info. > > > > I have a question about the Black Drives dropping the cache on UNC > > error. If a transid id error like that occurred on a BTRFS RAID 1, > > would BTRFS find the correct metadata on the 2nd drive, or does it stop > > dead on 1 transid failure? > > Well, the 2nd drive has to have correct metadata--if you are mirroring > a pair of disks with the same firmware bug, that's not likely to happen. OK, I forgot the Black case is a little complicated... I guess if you had two WD Black drives and they had all their UNC sector events at different times, then the btrfs RAID1 repair should still work with write cache enabled. That seems kind of risky, though--what if something bumps the machine and both disks get UNC sectors at once? Alternatives in roughly decreasing order of risk: 1. Disable write caching on both Blacks in the pair 2. Replace both Blacks with drives in the 0-failure list 3. Replace one Black with a Seagate Firecuda or WD Red Pro (any other 0-failure drive will do, but these have similar performance specs to Black) to ensure firmware diversity 4. Find some Black drives with different firmware that have UNC sectors and see what happens with write caching during sector remap events: if they behave well, enable write caching on all drives with matching firmware, disable if not 5. Leave write caching on for now, but as soon as any Black reports UNC sectors or reallocation events in SMART data, turn write caching off for the remainder of the drive's service life. > There is a bench test that will demonstrate the transid verify self-repair > procedure: disconnect one half of a RAID1 array, write for a while, then > reconnect and do a scrub. btrfs should self-repair all the metadata on > the disconnected drive until it all matches the connected one. Some of > the data blocks might be hosed though (due to CRC32 collisions), so > don't do this test on data you care about. > > > > > signature.asc Description: PGP signature
Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible)
On 2019/6/24 下午12:29, Zygo Blaxell wrote: [...] > >> Btrfs is relying more the hardware to implement barrier/flush properly, >> or CoW can be easily ruined. >> If the firmware is only tested (if tested) against such fs, it may be >> the problem of the vendor. > [...] >>> WD Green and Black are low-cost consumer hard drives under $250. >>> One drive of each size in both product ranges comes to a total price >>> of around $1200 on Amazon. Lots of end users will have these drives, >>> and some of them will want to use btrfs, but some of the drives apparently >>> do not have working write caching. We should at least know which ones >>> those are, maybe make a kernel blacklist to disable the write caching >>> feature on some firmware versions by default. >> >> To me, the problem isn't for anyone to test these drivers, but how >> convincing the test methodology is and how accessible the test device >> would be. >> >> Your statistic has a lot of weight, but it takes you years and tons of >> disks to expose it, not something can be reproduced easily. >> >> On the other hand, if we're going to reproduce power failure quickly and >> reliably in a lab enivronment, then how? >> Software based SATA power cutoff? Or hardware controllable SATA power cable? > > You might be overthinking this a bit. Software-controlled switched > PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a > Raspberry Pi) can turn the AC power on and off on a test box. Get a > cheap desktop machine, put as many different drives into it as it can > hold, start writing test patterns, kill mains power to the whole thing, > power it back up, analyze the data that is now present on disk, log the > result over the network, repeat. This is the most accurate simulation, > since it replicates all the things that happen during a typical end-user's > power failure, only much more often. To me, this is not as good as expected methodology. It simulates the most common real world power loss case, but I'd say it's less reliable in pinning down the incorrect behavior. (And extra time wasted on POST, booting into OS and things like that) My idea is, some SBC based controller controlling the power cable of the disk. And another system (or the same SBC if it supports SATA) doing regular workload, with dm-log-writes recording every write operations. Then kill the power to the disk. Then compare the data on-disk against dm-log-writes to see how the data differs. From the view point of end user, this is definitely overkilled, but at least to me, this could proof how bad the firmware is, leaving no excuse for the vendor to dodge the bullet, and maybe do them a favor by pinning down the sequence leading to corruption. Although there are a lot of untested things which can go wrong: - How kernel handles unresponsible disk? - Will dm-log-writes record and handle error correctly? - Is there anything special SATA controller will do? But at least this is going to be a very interesting project. I already have a rockpro64 SBC with SATA PCIE card, just need to craft an GPIO controlled switch to kill SATA power. > Hopefully all the hardware involved > is designed to handle this situation already. A standard office PC is > theoretically designed for 1000 cycles (200 working days over 5 years) > and should be able to test 60 drives (6 SATA ports, 10 sets of drives > tested 100 cycles each). The hardware is all standard equipment in any > IT department. > > You only need special-purpose hardware if the general-purpose stuff > is failing in ways that aren't interesting (e.g. host RAM is corrupted > during writes so the drive writes garbage, or the power supply breaks > before 1000 cycles). Some people build elaborate hard disk torture > rigs that mess with input voltages, control temperature and vibration, > etc. to try to replicate the effects effects of aging, but these setups > aren't representative of typical end-user environments and the results > will only be interesting to hardware makers. > > We expect most drives to work and it seems that they do most of the > time--it is the drives that fail most frequently that are interesting. > The drives that fail most frequently are also the easiest to identify > in testing--by definition, they will reproduce failures faster than > the others. > > Even if there is an intermittent firmware bug that only appears under > rare conditions, if it happens with lower probability than drive hardware > failure then it's not particularly important. The target hardware failure > rate for hard drives is 0.1% over the warranty period according to the > specs for many models. If one drive's hardware is going to fail > with p < 0.001, then maybe the firmware bug makes it lose data at p = > 0.00075 instead of p = 0.00050. Users won't care about this--they'll > use RAID to contain the damage, or just accept the failure risks of a > single-disk system. Filesystem failures that occur after the drive has > degraded to