Re: Help recover from btrfs error
On Sat, Apr 17, 2021 at 4:03 PM Florian Franzeck wrote: > > Dear users, > > I need help to recover from a btrfs error after a power cut > > btrfs-progs v5.4.1 > > Linux banana 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC > 2021 x86_64 x86_64 x86_64 GNU/Linux > > dmesg output: > > [ 30.330824] BTRFS info (device md1): disk space caching is enabled > [ 30.330826] BTRFS info (device md1): has skinny extents > [ 30.341269] BTRFS error (device md1): parent transid verify failed on > 201818112 wanted 147946 found 147960 > [ 30.342887] BTRFS error (device md1): parent transid verify failed on > 201818112 wanted 147946 found 147960 > [ 30.344154] BTRFS warning (device md1): failed to read root > (objectid=4): -5 > [ 30.375400] BTRFS error (device md1): open_ctree failed > > Please advise what to do next to recover data on this disk > > Thank a lot > https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#parent_transid_verify_failed This might be repairable with 'btrfs check --repair --init-extent-tree' but it's really slow. It's almost always faster to just mkfs and restore from backups. If you don't have current backups, you shouldn't use this option first because there's a chance it makes things worse and then it's harder to recover the data. These are safer if you need to first update backups: Try 'mount -o usebackuproot' If that doesn't work, there is a very small chance 5.11 or newer will allow you to mount the file system using 'mount -o rescue=usebackuproot,ignorebadroots' which is a lot easier to do recovery on because you can use normal tools to update your backups. Try btrfs restore: https://btrfs.wiki.kernel.org/index.php/Restore This tool is quite dense with features to help isolate what you want to recover. But the most simple command that tries to recover everything that isn't a snapshot: btrfs restore -vi -D /dev/ /path/to/save/files It is also possible to use 'btrfs-find-root' and plug in the address for roots (try most recent first, and then go older) into the 'btrfs restore -t' option. Basically you're pointing it to an older root that hopefully doesn't have damage. The older back you go though, the more stale the trees are and they could have been overwritten. So you pretty much have to try roots in order from most recent, one by one. Might be easier to ask on irc.freenode.net, #btrfs. -- Chris Murphy
5.12-rc7 occasional btrfs splat when rebooting
I'm not sure with which rc I first saw this appear. I don't recalling seeing it with 5.11 series. There's nothing unusual reported during the subsequent reboot. [16212.441466] kernel: dnf (7568) used greatest stack depth: 10752 bytes left [16332.569785] kernel: Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7 [16337.349525] kernel: rfkill: input handler enabled [16339.203377] kernel: BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low! [16339.203439] kernel: turning off the locking correctness validator. [16339.203491] kernel: Please attach the output of /proc/lock_stat to the bug report [16339.203555] kernel: CPU: 2 PID: 5625 Comm: signal-desktop Not tainted 5.12.0-0.rc7.189.fc35.x86_64+debug #1 [16339.203636] kernel: Hardware name: HP HP Spectre Notebook/81A0, BIOS F.44 11/25/2019 [16339.203698] kernel: Call Trace: [16339.203723] kernel: dump_stack+0x7f/0xa1 [16339.203762] kernel: __lock_acquire.cold+0x1a9/0x2bf [16339.203810] kernel: lock_acquire+0xc4/0x3a0 [16339.203850] kernel: ? __delayacct_thrashing_end+0x36/0x60 [16339.203898] kernel: ? mark_held_locks+0x50/0x80 [16339.203938] kernel: _raw_spin_lock_irqsave+0x4d/0x90 [16339.203981] kernel: ? __delayacct_thrashing_end+0x36/0x60 [16339.204030] kernel: __delayacct_thrashing_end+0x36/0x60 [16339.204077] kernel: wait_on_page_bit_common+0x38e/0x490 [16339.204125] kernel: ? add_page_wait_queue+0xf0/0xf0 [16339.204170] kernel: read_extent_buffer_pages+0x55e/0x610 [16339.204222] kernel: btree_read_extent_buffer_pages+0x97/0x110 [16339.204277] kernel: read_tree_block+0x39/0x60 [16339.204314] kernel: btrfs_read_node_slot+0xe3/0x130 [16339.204358] kernel: push_leaf_left+0x98/0x190 [16339.204400] kernel: btrfs_del_items+0x2ba/0x440 [16339.204446] kernel: btrfs_truncate_inode_items+0x254/0xfc0 [16339.204499] kernel: ? _raw_spin_unlock+0x1f/0x30 [16339.204542] kernel: ? btrfs_block_rsv_migrate+0x6d/0xb0 [16339.204589] kernel: btrfs_evict_inode+0x3fe/0x4e0 [16339.204631] kernel: evict+0xcf/0x1d0 [16339.204662] kernel: __dentry_kill+0xe8/0x190 [16339.204697] kernel: ? dput+0x20/0x480 [16339.204729] kernel: dput+0x2b8/0x480 [16339.204758] kernel: __fput+0x102/0x260 [16339.204792] kernel: task_work_run+0x5c/0xa0 [16339.204830] kernel: do_exit+0x3e1/0xc20 [16339.204864] kernel: ? find_held_lock+0x32/0x90 [16339.204903] kernel: ? sched_clock+0x5/0x10 [16339.204938] kernel: ? sched_clock_cpu+0xc/0xb0 [16339.204977] kernel: do_group_exit+0x39/0xb0 [16339.205008] kernel: get_signal+0x16f/0xb00 [16339.205037] kernel: arch_do_signal_or_restart+0xfc/0x750 [16339.205075] kernel: ? finish_task_switch.isra.0+0xa0/0x2c0 [16339.205120] kernel: ? finish_task_switch.isra.0+0x6a/0x2c0 [16339.205165] kernel: ? do_user_addr_fault+0x1ea/0x6b0 [16339.205208] kernel: exit_to_user_mode_prepare+0x15d/0x240 [16339.205253] kernel: ? asm_exc_page_fault+0x8/0x30 [16339.205296] kernel: irqentry_exit_to_user_mode+0x5/0x40 [16339.205343] kernel: asm_exc_page_fault+0x1e/0x30 [16339.205383] kernel: RIP: 0033:0x7f49d11b6674 [16339.205421] kernel: Code: Unable to access opcode bytes at RIP 0x7f49d11b664a. [16339.205481] kernel: RSP: 002b:7f49ce07f250 EFLAGS: 00010206 [16339.205530] kernel: RAX: 55593f9bc088 RBX: 7f49d11d9140 RCX: 084e [16339.205602] kernel: RDX: 0c4e RSI: 0099c84e RDI: 267213a2 [16339.205664] kernel: RBP: R08: 7f49ce07f390 R09: 7f49d11d9400 [16339.205720] kernel: R10: 7f49d11aa540 R11: 005a R12: 005a [16339.205781] kernel: R13: 7f49ce1c5688 R14: 0001 R15: [16339.626109] kernel: wlp108s0: deauthenticating from f8:a0:97:6e:c7:e8 by local choice (Reason: 3=DEAUTH_LEAVING) [16340.238863] kernel: kauditd_printk_skb: 93 callbacks suppressed -- Chris Murphy
Re: Design strangeness of incremental btrfs send/recieve
On Fri, Apr 16, 2021 at 9:03 PM Alexandru Stan wrote: > > # sending back incrementally (eg: without sending back file-0) fails > alex@alex-desktop:/mnt% sudo btrfs send bigfs/myvolume-1 -p > bigfs/myvolume-3|sudo btrfs receive ssdfs/ > At subvol bigfs/myvolume-1 > At snapshot myvolume-1 > ERROR: cannot find parent subvolume What about using -c instead of -p? -- Chris Murphy
Re: Dead fs on 2 Fedora systems: block=57084067840 write time tree block corruption detected
On Thu, Apr 15, 2021 at 2:04 AM Niccolò Belli wrote: > > Full dmesg: https://pastebin.com/pNBhAPS5 This is at initial ro mount time during boot: [ 4.035226] BTRFS info (device nvme0n1p8): bdev /dev/nvme0n1p8 errs: wr 0, rd 0, flush 0, corrupt 41, gen 0 There are previously detected corruption events. This is just a simple counter. It could be the same corruption encountered 41 times, or it could be 41 separate corrupt blocks. In other words, older logs might have a clue about what first started going wrong. > I have another laptop with Arch Linux and btrfs, should I be worried > about it? Maybe it's a Fedora thing? Both are using upstream stable Btrfs code. I think the focus at this point is on tracking down a hardware cause for the two problems, however unusually bad luck that is; but also there could be a bug (e.g. repair shouldn't crash). The correct reaction to corruption on Btrfs is to update backups while you still can, while it's still mounted or can be mounted. Then try repair once the underlying problem has been rectified. -- Chris Murphy
Re: Dead fs on 2 Fedora systems: block=57084067840 write time tree block corruption detected
First computer/file system: (from the photo): [ 136.259984] BTRFS critical (device nvme0n1p8): corrupt leaf: root=257 block=31259951104 slot=9 ino=3244515, name hash mismatch with key, have 0xF22F547D expect 0x92294C62 This is not obviously a bit flip. I'm not sure what's going on here. Second computer/file system: [30177.298027] BTRFS critical (device nvme0n1p8): corrupt leaf: root=791 block=57084067840 slot=64 ino=1537855, name hash mismatch with key, have 0xa461adfd expect 0xa461adf5 This is clearly a bit flip. It's likely some kind of hardware related problem, despite the memory checking already done, it just is rare enough to evade detection with a typical memory tester like memtest86(+). You could try 'memtester' or '7z b 100' and see if you can trigger it. It's a catch-22 with such a straightforward problem like a bit flip, that it's risky to attempt a repair which can end up causing worse corruption. What about the mount options for both file systems? (cat /proc/mounts or /etc/fstab) -- Chris Murphy
Re: Parent transid verify failed (and more): BTRFS for data storage in Xen VM setup
On Sat, Apr 10, 2021 at 8:49 AM Roman Mamedov wrote: > > On Sat, 10 Apr 2021 13:38:57 + > Paul Leiber wrote: > > > d) Perhaps the complete BTRFS setup (Xen, VMs, pass through the partition, > > Samba share) is flawed? > > I kept reading and reading to find where you say you unmounted in on the host, > and then... :) > > > e) Perhaps it is wrong to mount the BTRFS root first in the Dom0 and then > > accessing the subvolumes in the DomU? > > Absolutely O.o > > Subvolumes are very much like directories, not any kind of subpartitions. Right. The block device (partition containing the Btrfs file system) must be exclusively used by one kernel, host or guest. Dom0 or DomU. Can't be both. The only exception I'm aware of is virtiofs or virtio-9p, but I haven't messed with that stuff yet. -- Chris Murphy
Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.
Keeping everything else the same, and only reverting to kernel 5.9.16-200.fc33.x86_64, this kernel message >overlayfs: upper fs does not support xattr, falling back to index=off and >metacopy=off no longer appears when I 'podman system reset' or when 'podman build' bolt, using the overlay driver. However, I do still get Bail out! ERROR:../tests/test-common.c:1413:test_io_dir_is_empty: 'empty' should be FALSE -- Chris Murphy
Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.
On Sat, Apr 10, 2021 at 1:43 PM Chris Murphy wrote: > > On Sat, Apr 10, 2021 at 1:42 PM Chris Murphy wrote: > > > > On Sat, Apr 10, 2021 at 1:36 PM Chris Murphy > > wrote: > > > > > > $ sudo mount -o remount,userxattr /home > > > mount: /home: mount point not mounted or bad option. > > > > > > [ 92.573364] BTRFS error (device sda6): unrecognized mount option > > > 'userxattr' > > > > > > > [ 63.320831] BTRFS error (device sda6): unrecognized mount option > > 'user_xattr' > > > > And if I try it with rootflags at boot, boot fails due to mount > > failure due to unrecognized mount option. > > These are all with kernel 5.12-rc6 Ohhh to tmpfs. Hmmm. I have no idea how to do that with this test suite. I'll ask bolt folks. I'm just good at bumping into walls, obviously. -- Chris Murphy
Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.
On Sat, Apr 10, 2021 at 1:42 PM Chris Murphy wrote: > > On Sat, Apr 10, 2021 at 1:36 PM Chris Murphy wrote: > > > > $ sudo mount -o remount,userxattr /home > > mount: /home: mount point not mounted or bad option. > > > > [ 92.573364] BTRFS error (device sda6): unrecognized mount option > > 'userxattr' > > > > [ 63.320831] BTRFS error (device sda6): unrecognized mount option > 'user_xattr' > > And if I try it with rootflags at boot, boot fails due to mount > failure due to unrecognized mount option. These are all with kernel 5.12-rc6 -- Chris Murphy
Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.
On Sat, Apr 10, 2021 at 1:36 PM Chris Murphy wrote: > > $ sudo mount -o remount,userxattr /home > mount: /home: mount point not mounted or bad option. > > [ 92.573364] BTRFS error (device sda6): unrecognized mount option > 'userxattr' > [ 63.320831] BTRFS error (device sda6): unrecognized mount option 'user_xattr' And if I try it with rootflags at boot, boot fails due to mount failure due to unrecognized mount option. -- Chris Murphy
Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.
On Sat, Apr 10, 2021 at 11:55 AM Amir Goldstein wrote: > > On Sat, Apr 10, 2021 at 8:36 PM Chris Murphy wrote: > > > > I can reproduce the bolt testcase problem in a podman container, with > > overlay driver, using ext4, xfs, and btrfs. So I think I can drop > > linux-btrfs@ from this thread. > > > > Also I can reproduce the title of this thread simply by 'podman system > > reset' and see the kernel messages before doing the actual reset. I > > have a strace here of what it's doing: > > > > https://drive.google.com/file/d/1L9lEm5n4-d9qemgCq3ijqoBstM-PP1By/view?usp=sharing > > > > I'm confused. The error in the title of the page is from overlayfs mount(). > I see no mount in the strace. > I feel that I am missing some info. > Can you provide the overlayfs mount arguments > and more information about the underlying layers? Not really? There are none if a container isn't running, and in this case no containers are running, in fact there are no upper or lower dirs because I had already reset podman before doing 'strace podman system reset' - I get the kernel message twice every time I merely do 'podman system reset' overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off This part of the issue might be something of a goose chase. I don't know if it's relevant or distracting. > > Yep. I think tmpfs supports xattr but not user xattr? And this example > > is rootless podman, so it's all unprivileged. > > > > OK, so unprivileged overlayfs mount support was added in v5.11 > and it requires opt-in with mount option "userxattr", which could > explain the problem if tmpfs is used as upper layer. > > Do you know if that is the case? > I sounds to me like it may not be a kernel regression per-se, > but a regression in the container runtime that started to use > a new kernel feature? > Need more context to understand. > > Perhaps the solution will be to add user xattr support to tmpfs.. $ sudo mount -o remount,userxattr /home mount: /home: mount point not mounted or bad option. [ 92.573364] BTRFS error (device sda6): unrecognized mount option 'userxattr' /home is effectively a bind mount because it is backed by a btrfs subvolume... /dev/sda6 on /home type btrfs (rw,noatime,seclabel,compress=zstd:1,ssd,space_cache=v2,subvolid=586,subvol=/home) ...which is mounted via fstab using -o subvol=home Is it supported to remount,userxattr? If not then maybe this is needed: rootflags=subvol=root,userxattr -- Chris Murphy
Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.
I can reproduce the bolt testcase problem in a podman container, with overlay driver, using ext4, xfs, and btrfs. So I think I can drop linux-btrfs@ from this thread. Also I can reproduce the title of this thread simply by 'podman system reset' and see the kernel messages before doing the actual reset. I have a strace here of what it's doing: https://drive.google.com/file/d/1L9lEm5n4-d9qemgCq3ijqoBstM-PP1By/view?usp=sharing It may be something intentional. The failing testcase, :../tests/test-common.c:1413:test_io_dir_is_empty also has more instances of this line, but I don't know they are related. So I'll keep looking into that. On Sat, Apr 10, 2021 at 2:04 AM Amir Goldstein wrote: > As the first step, can you try the suggested fix to ovl_dentry_version_inc() > and/or adding the missing pr_debug() and including those prints in > your report? I'll work with bolt upstream and try to further narrow down when it is and isn't happening. > > I can reproduce this with 5.12.0-0.rc6.184.fc35.x86_64+debug and at > > approximately the same time I see one, sometimes more, kernel > > messages: > > > > [ 6295.379283] overlayfs: upper fs does not support xattr, falling > > back to index=off and metacopy=off. > > > > Can you say why there is no xattr support? I'm not sure. It could be podman specific or fuse-overlayfs related. Maybe something is using /tmp in one case and not another for some reason? > Is the overlayfs mount executed without privileges to create trusted.* xattrs? > The answer to that may be the key to understanding the bug. Yep. I think tmpfs supports xattr but not user xattr? And this example is rootless podman, so it's all unprivileged. > My guess is it has to do with changes related to mounting overlayfs > inside userns, but I couldn't find any immediate suspects. > > Do you have any idea since when the regression appeared? > A bisect would have been helpful here. Yep. All good ideas. Thanks for the fast reply. I'll report back once this has been narrowed down futher. -- Chris Murphy
btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.
Hi, The primary problem is Bolt (Thunderbolt 3) tests that are experiencing a regression when run in a container using overlayfs, failing at: Bail out! ERROR:../tests/test-common.c:1413:test_io_dir_is_empty: 'empty' should be FALSE https://gitlab.freedesktop.org/bolt/bolt/-/issues/171#note_872119 I can reproduce this with 5.12.0-0.rc6.184.fc35.x86_64+debug and at approximately the same time I see one, sometimes more, kernel messages: [ 6295.379283] overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off. But I don't know if that kernel message relates to the bolt test failure. If I run the test outside of a container, it doesn't fail. If I run the test in a podman container using the btrfs driver instead of the overlay driver, it doesn't fail. So it seems like this is an overlayfs bug, but could be some kind of overlayfs+btrfs interaction. Could this be related and just not yet merged? https://lore.kernel.org/linux-unionfs/20210309162654.243184-1-amir7...@gmail.com/ Thanks, -- Chris Murphy
5.12-rc6 splat, MAX_LOCKDEP_CHAIN_HLOCKS too low, Workqueue: btrfs-delalloc btrfs_work_helper
Got this while building bolt in a podman container. I've got reproduce steps and test files here https://bugzilla.redhat.com/show_bug.cgi?id=1948054 [ 3229.119497] overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off. [ 3229.155339] overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off. [ 3238.380647] BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low! [ 3238.380654] turning off the locking correctness validator. [ 3238.380656] Please attach the output of /proc/lock_stat to the bug report [ 3238.380657] CPU: 4 PID: 9115 Comm: kworker/u16:20 Not tainted 5.12.0-0.rc6.184.fc35.x86_64+debug #1 [ 3238.380660] Hardware name: Apple Inc. MacBookPro8,2/Mac-94245A3940C91C80, BIOS MBP81.88Z.0050.B00.1804101331 04/10/18 [ 3238.380663] Workqueue: btrfs-delalloc btrfs_work_helper [ 3238.380670] Call Trace: [ 3238.380674] dump_stack+0x7f/0xa1 [ 3238.380680] __lock_acquire.cold+0x1a9/0x2bf [ 3238.380686] ? __lock_acquire+0x3ac/0x1e10 [ 3238.380691] lock_acquire+0xc4/0x3a0 [ 3238.380695] ? percpu_counter_add_batch+0x45/0x60 [ 3238.380699] ? lock_acquire+0xc4/0x3a0 [ 3238.380702] ? lock_is_held_type+0xa7/0x120 [ 3238.380706] ? __set_page_dirty_nobuffers+0x6b/0x1e0 [ 3238.380711] _raw_spin_lock_irqsave+0x4d/0x90 [ 3238.380715] ? percpu_counter_add_batch+0x45/0x60 [ 3238.380718] percpu_counter_add_batch+0x45/0x60 [ 3238.380721] account_page_dirtied+0x102/0x320 [ 3238.380724] __set_page_dirty_nobuffers+0xa2/0x1e0 [ 3238.380727] set_extent_buffer_dirty+0x63/0x80 [ 3238.380732] btrfs_mark_buffer_dirty+0x60/0x80 [ 3238.380737] copy_for_split+0x29e/0x360 [ 3238.380741] split_leaf+0x1c2/0x5e0 [ 3238.380746] btrfs_search_slot+0x99a/0x9f0 [ 3238.380751] btrfs_insert_empty_items+0x58/0xa0 [ 3238.380754] cow_file_range_inline.constprop.0+0x1cf/0x760 [ 3238.380758] ? __local_bh_enable_ip+0x82/0xd0 [ 3238.380762] ? zstd_put_workspace+0x82/0x160 [ 3238.380765] ? __local_bh_enable_ip+0x82/0xd0 [ 3238.380769] compress_file_range+0x471/0x830 [ 3238.380774] async_cow_start+0x12/0x30 [ 3238.380777] ? submit_compressed_extents+0x410/0x410 [ 3238.380779] btrfs_work_helper+0x105/0x400 [ 3238.380782] ? lock_is_held_type+0xa7/0x120 [ 3238.380786] process_one_work+0x2b0/0x5e0 [ 3238.380791] worker_thread+0x55/0x3c0 [ 3238.380793] ? process_one_work+0x5e0/0x5e0 [ 3238.380796] kthread+0x13a/0x150 [ 3238.380799] ? __kthread_bind_mask+0x60/0x60 [ 3238.380801] ret_from_fork+0x1f/0x30 The /proc/lock_stat is in the downstream bug as an attachment. There's possibly three things going on here, the bogus overlayfs warning, the lockdep bug, and the call trace with btrfs bits in it. No idea if they are related.
Re: Any ideas what this warnings are about?
knlGS: > >> CS: 0010 DS: ES: CR0: 80050033 > >> CR2: 7f654cf39010 CR3: 03884000 CR4: 003506f0 > >> Call Trace: > >> btrfs_commit_transaction+0x448/0xbc0 [btrfs] > >> ? btrfs_wait_ordered_range+0x1b8/0x210 [btrfs] > >> ? btrfs_sync_file+0x2b8/0x4e0 [btrfs] > >> btrfs_sync_file+0x343/0x4e0 [btrfs] > >> __x64_sys_fsync+0x34/0x60 > >> do_syscall_64+0x33/0x40 > > > > Normally you need to mount -o flushoncommit to trigger this warning. > > Maybe sync is triggering it too? > > I've looked again and yes, this "special" filesystem is mounted > flushoncommit and discard=async. Would it be better to not set these > options, for now? Flushoncommit is safe but noisy in dmesg, and can make things slow it just depends on the workload. And discard=async is also considered safe, though relatively new. The only way to know for sure is disable it, and only it, run for some time period to establish "normative" behavior, and then enable only this option and see if behavior changes from the baseline. If you don't have a heavy write and delete workload, you may not really need discard=async anyway, and a weekly fstrim is generally sufficient for the fast majority of workloads. Conversely a heavy write and delete workload translates into a backlog of trim that gets issued all at once, once a week, and can make an SSD bog down after it's issued. So you just have to test it with your particular workload to know. Discard=async exists because a weekly fstrim, and discard=sync can supply way too much hinting all at once to the drive about what blocks are no longer needed and are ready for garbage collection. But again, it's workload specific, and even hardware specific. Some hardware is sufficiently overprovisioned that there's no benefit to issuing discards at all, and normal usage gives the drive firmware all it needs to know about what blocks are ready for garbage collection (and erasing blocks to prepare them for future writes). -- Chris Murphy
Re: Re[4]: Filesystem sometimes Hangs
On Wed, Mar 31, 2021 at 8:03 AM Hendrik Friedel wrote: > >>[Mo Mär 29 09:29:21 2021] BTRFS info (device sdc2): turning on sync discard > > > >Remove the discard mount option for this file system and see if that > >fixes the problem. Run it for a week or two, or until you're certain > >the problem is still happening (or certain it's gone). Some drives > >just can't handle sync discards, they become really slow and hang, > >just like you're reporting. > > In fstab, this option is not set: > /dev/disk/by-label/DataPool1/srv/dev-disk-by-label-DataPool1 > btrfs noatime,defaults,nofail 0 2 You have more than one btrfs file system. I'm suggesting not using discard on any of them to try and narrow down the problem. Something is turning on discards for sdc2, find it and don't use it for a while. > How do I deactivate discard then? > These drives are spinning disks. I thought that discard is only relevant > for SSDs? It's relevant for thin provisioning and sparse files too. But if sdc2 is a HDD then the sync discard message isn't related to the problem, but also makes me wonder why something is enabling sync discards on a HDD? Anway I think you're on the right track to try 5.11.11 and if you experience a hang again, use sysrq+w and that will dump the blocked task trace into dmesg. Also include a description of the workload at the time of the hang, and recent commands issued. -- Chris Murphy
Re: Re[2]: Filesystem sometimes Hangs
On Tue, Mar 30, 2021 at 6:50 AM Hendrik Friedel wrote: > > Next > >'btrfs check --readonly' (must be done offline ie booted from usb > >stick). And if it all comes up without errors or problems, you can > >zero the statistics with 'btrfs dev stats -z'. > No error found. Neither in btrfs check, nor in scrub. > So, shall I reset the stats then? Up to you. It's probably better to zero them because it's obvious if the numbers change from 0, there's a problem. > 5.10.0-0.bpo.3-amd64 It's probably OK. I'm not sure what upstream stable version this translates into, but current stable are 5.10.27 and 5.11.11. There have been multiple btrfs bug fixes since 5.10.0 was released. I missed in your first email this line: >[Mo Mär 29 09:29:21 2021] BTRFS info (device sdc2): turning on sync discard Remove the discard mount option for this file system and see if that fixes the problem. Run it for a week or two, or until you're certain the problem is still happening (or certain it's gone). Some drives just can't handle sync discards, they become really slow and hang, just like you're reporting. It's probably adequate to just enable the fstrim.timer, part of util-linux, which runs once per week. If you have really heavy write and delete workloads, you might benefit from discard=async mount option (async instead of sync). But first you should just not do any discards at all for a while to see if that's the problem and then deliberately re-introduce just that one single change so you can monitor it for problems. -- Chris Murphy
Re: Support demand on Btrfs crashed fs.
I'm going to fill in some details from the multiday conversation with IRC regulars. We couldn't figure out a way forward. * WDC Red with Firmware Version: 80.00A80, which is highly suspected to deal with power fail and write caching incorrectly, and at least on Btrfs apparently pretty much always drops writes for critical metadata. * A power fail / reset happened * No snapshots * --repair and --init-extent-tree may not have done anything because they didn't complete * Less than 10% needs to be recovered and it's accepted that it can't be repaired. The focus is just on a limited restore, but we can't get past the transid failures. zapan@UBUNTU-SERVER:~$ sudo btrfs check --readonly /dev/md0 Opening filesystem to check... parent transid verify failed on 23079040831488 wanted 524940 found 524941 parent transid verify failed on 23079040831488 wanted 524940 found 524941 Ignoring transid failure parent transid verify failed on 23079040319488 wanted 524931 found 524939 Ignoring transid failure Checking filesystem on /dev/md0 UUID: f4f04e16-ce38-4a57-8434-67562a0790bd [1/7] checking root items parent transid verify failed on 23079042863104 wanted 423153 found 524931 parent transid verify failed on 23079042863104 wanted 423153 found 524931 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=23079040999424 item=11 parent level=2 child bytenr=23079042863104 child level=0 ERROR: failed to repair root items: Input/output error [2/7] checking extents parent transid verify failed on 23079042863104 wanted 423153 found 524931 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=23079040999424 item=11 parent level=2 child bytenr=23079042863104 child level=0 ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache cache and super generation don't match, space cache will be invalidated [4/7] checking fs roots root 5 root dir 256 not found parent transid verify failed on 23079042863104 wanted 423153 found 524931 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=23079040999424 item=11 parent level=2 child bytenr=23079042863104 child level=0 ERROR: errors found in fs roots found 0 bytes used, error(s) found total csum bytes: 0 total tree bytes: 0 total fs tree bytes: 0 total extent tree bytes: 0 btree space waste bytes: 0 file data blocks allocated: 0 referenced 0 btrfs-find-root doesn't find many options to work with, and all of them fail with 'btrfs restore -t' zapan@UBUNTU-SERVER:~$ sudo btrfs-find-root /dev/md0 parent transid verify failed on 23079040831488 wanted 524940 found 524941 parent transid verify failed on 23079040831488 wanted 524940 found 524941 Ignoring transid failure parent transid verify failed on 23079040319488 wanted 524931 found 524939 Ignoring transid failure Superblock thinks the generation is 524941 Superblock thinks the level is 2 Found tree root at 23079040999424 gen 524941 level 2 Well block 23079040327680(gen: 524940 level: 2) seems good, but generation/level doesn't match, want gen: 524941 level: 2 Well block 23079040389120(gen: 524939 level: 2) seems good, but generation/level doesn't match, want gen: 524941 level: 2 zapan@UBUNTU-SERVER:~$ sudo btrfs restore -viD -t 23079040389120 /dev/md0 /mnt/raid1/restore/ parent transid verify failed on 23079040389120 wanted 524941 found 524939 parent transid verify failed on 23079040389120 wanted 524941 found 524939 Ignoring transid failure parent transid verify failed on 23079040323584 wanted 524939 found 524941 parent transid verify failed on 23079040323584 wanted 524939 found 524941 Ignoring transid failure parent transid verify failed on 23079040319488 wanted 524931 found 524939 Ignoring transid failure This is a dry-run, no files are going to be restored Reached the end of the tree searching the directory zapan@UBUNTU-SERVER:~$ sudo btrfs restore -viD -t 23079040327680 /dev/md0 /mnt/raid1/restore/ parent transid verify failed on 23079040327680 wanted 524941 found 524940 parent transid verify failed on 23079040327680 wanted 524941 found 524940 Ignoring transid failure parent transid verify failed on 23079040831488 wanted 524940 found 524941 parent transid verify failed on 23079040831488 wanted 524940 found 524941 Ignoring transid failure parent transid verify failed on 23079040319488 wanted 524931 found 524939 Ignoring transid failure This is a dry-run, no files are going to be restored Reached the end of the tree searching the directory -- Chris Murphy
Re: Re: Help needed with filesystem errors: parent transid verify failed
On Tue, Mar 30, 2021 at 2:44 AM B A wrote: > > > > Gesendet: Dienstag, 30. März 2021 um 00:07 Uhr > > Von: "Chris Murphy" > > An: "B A" > > Cc: "Btrfs BTRFS" > > Betreff: Re: Help needed with filesystem errors: parent transid verify > > failed > > > > On Sun, Mar 28, 2021 at 9:41 AM B A wrote: > > > > > > * Samsung 840 series SSD (SMART data looks fine) > > > > EVO or PRO? And what does its /proc/mounts line look like? > > Model is MZ-7TD500, which seems to be an EVO. Firmware is DXT08B0Q. For me smartctl reports Device Model: Samsung SSD 840 EVO 250GB Firmware Version: EXT0DB6Q Yours might be a PRO or it could just be a different era EVO. Last I checked, Samsung had no firmware updates on their website for the 840 EVO. While I'm aware of some minor firmware bugs related to smartctl testing, so far I've done well over 100 pull the power cord tests while doing heavy writes (with Btrfs), and have never had a problem. So I'd say there's probably not a "per se" problem with this model. Best guess is that since the leaves pass checksum, it's not corruption, but some SSD equivalent of a misdirected write (?) if that's possible. It just looks like these two leaves are in the wrong place. > > Total_LBAs_Written? > > Raw value: 92857573119 OK I'm at 33063832698 Well hopefully --repair will fix it (let us know either way) and if not, then we'll see what Josef can come up with, or alternatively you can just mkfs and restore from backups which will surely be faster. -- Chris Murphy
Re: Help needed with filesystem errors: parent transid verify failed
On Sun, Mar 28, 2021 at 9:41 AM B A wrote: > > * Samsung 840 series SSD (SMART data looks fine) EVO or PRO? And what does its /proc/mounts line look like? Total_LBAs_Written? -- Chris Murphy
Re: help needed with raid 6 filesystem with errors
On Mon, Mar 29, 2021 at 4:22 AM Bas Hulsken wrote: > > Dear list, > > due to a disk intermittently failing in my 4 disk array, I'm getting > "transid verify failed" errors on my btrfs filesystem (see attached > dmesg | grep -i btrfs dump in btrfs_dmesg.txt). When I run a scrub, the > bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try that > again (happened 3 times now, and was the root cause of the transid > verify failed errors possibly, at least they did not show up earlier > than the failed scrub). Is the dmesg filtered? An unfiltered dmesg might help understand what might be going on with the drive being unresponsive, if it's spitting out any kind of errors itself or if there are kernel link reset messages. Check if the drive supports SCT ERC. smartctl -l scterc /dev/sdX If it does but it isn't enabled, enable it. This is true for all the drives. smartctl -l scterc,70,70 That will result in the drive giving up on errors much sooner rather than doing the very slow "deep recovery" on reads. If this goes beyond 30 seconds, the kernel's command timer will think the device is unresponsive and issue a link reset which is ... bad for this use case. You really want the drive to error out quickly and allow Btrfs to do the fixups. If you can't configure the SCT ERC on the drives, you'll need to increase the kernel command timeout which is a per device value in /sys/block/sdX/device/timeout - default is 30 and chances are 180 is enough (which sounds terribly high and it is but reportedly some consumer drives can have such high timeouts). Basically you want the drive timeout to be shorter than the kernel's. >A new disk is on it's way to use btrfs replace, > but I'm not sure whehter that will be a wise choice for a filesystem > with errors. There was never a crash/power failure, so the filesystem > was unmounted at every reboot, but as said on 3 occasions (after a > scrub), that unmount was with on of the four drives unresponsive. The least amount of risk is to not change anything. When you do the replace, make sure you use recent btrfs-progs and use 'btrfs replace' instead of 'btrfs device add/remove' https://lore.kernel.org/linux-btrfs/20200627032414.gx10...@hungrycats.org/ If metadata is raid5 too, or if it's not already using space_cache v2, I'd probably leave it alone until after the flakey device is replaced. > Funnily enough, after a reboot every time the filesystem gets mounted > without issues (the unresponsive drive is back online), and btrfs check > --readonly claims the filesystem has no errors (see attached > btrfs_sdd_check.txt). I'd take advantage of it's cooperative moment by making sure backups are fresh in case things get worse. > Not sure what to do next, so seeking your advice! The important data on > the drive is backed up, and I'll be running a verify to see if there > are any corruptions overnight. Would still like to try to save the > filesystem if possible though. -- Chris Murphy
Re: Filesystem sometimes Hangs
Mar 28 20:26:20 homeserver kernel: [1298220.030331] > start_transaction+0x46d/0x540 [btrfs] > Mar 28 20:26:20 homeserver kernel: [1298220.030361] > btrfs_create+0x58/0x1f0 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.854109] task:btrfs-cleaner > state:D stack:0 pid:20078 ppid: 2 flags:0x4000 > Mar 28 20:28:21 homeserver kernel: [1298340.854151] > wait_current_trans+0xc2/0x120 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.854169] > start_transaction+0x46d/0x540 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.854183] > btrfs_drop_snapshot+0x90/0x7f0 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.854202] ? > btrfs_delete_unused_bgs+0x3e/0x850 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.854218] > btrfs_clean_one_deleted_snapshot+0xd7/0x130 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.854232] > cleaner_kthread+0xfa/0x120 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.854247] ? > btrfs_alloc_root+0x3d0/0x3d0 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.857610] > wait_current_trans+0xc2/0x120 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.857627] > start_transaction+0x46d/0x540 [btrfs] > Mar 28 20:28:21 homeserver kernel: [1298340.857643] > btrfs_create+0x58/0x1f0 [btrfs] > Mar 28 20:58:34 homeserver kernel: [1300153.336160] task:btrfs-transacti > state:D stack:0 pid:20080 ppid: 2 flags:0x4000 > Mar 28 20:58:34 homeserver kernel: [1300153.336215] > btrfs_commit_transaction+0x92b/0xa50 [btrfs] > Mar 28 20:58:34 homeserver kernel: [1300153.336246] > transaction_kthread+0x15d/0x180 [btrfs] > Mar 28 20:58:34 homeserver kernel: [1300153.336273] ? > btrfs_cleanup_transaction+0x590/0x590 [btrfs] > > > What could I do to find the cause? What kernel version? -- Chris Murphy
Re: Re: Help needed with filesystem errors: parent transid verify failed
On Mon, Mar 29, 2021 at 1:34 AM B A wrote: > > This is a very old BTRFS filesystem created with Fedora *23* i.e. a linux > kernel and btrfs-progs around version 4.2. It was probably created 2015-10-31 > with Fedora 23 beta and kernel 4.2.4 or 4.2.5. > > I ran `btrfs scrub` about a month ago without issues. I ran `btrfs check` > maybe a year ago without issues. I also run `btrfs filesystem balance` from > time to time (~once a year). None of these have shown the issue before. Does > that mean that the issue has not been present for a long time (>1 year)? Maybe. The generation on these two leaves look recent. But kernels since ~5.3 have a write time tree checker designed to catch metadata errors before they are written. What do you get for: btrfs insp dump-s -f /dev/dm-0 Hopefully Qu or Josef will have an idea. -- Chris Murphy
Re: Help needed with filesystem errors: parent transid verify failed
On Sun, Mar 28, 2021 at 7:02 PM Chris Murphy wrote: > > Can you post the output from both: > > btrfs insp dump-t -b 1144783093760 /dev/dm-0 > btrfs insp dump-t -b 1144881201152 /dev/dm-0 I'm not sure if those dumps will contain filenames, so check them. It's ok to remove filenames before posting the output. You can also use the option --hide-names. btrfs insp dump-t --hide-names -b 1144783093760 /dev/dm-0 It may be a good idea to do a memory test as well. -- Chris Murphy
Re: Help needed with filesystem errors: parent transid verify failed
On Sun, Mar 28, 2021 at 9:41 AM B A wrote: > > Dear btrfs experts, > > > On my desktop PC, I have 1 btrfs partition on a single SSD device with 3 > subvolumes (/, /home, /var). Whenever I boot my PC, after logging in to > GNOME, the btrfs partition is being remounted as ro due to errors. This is > the dmesg output at that time: > > > [ 616.155392] BTRFS error (device dm-0): parent transid verify failed on > > 1144783093760 wanted 2734307 found 2734305 > > [ 616.155650] BTRFS error (device dm-0): parent transid verify failed on > > 1144783093760 wanted 2734307 found 2734305 > > [ 616.155657] BTRFS: error (device dm-0) in __btrfs_free_extent:3054: > > errno=-5 IO failure > > [ 616.155662] BTRFS info (device dm-0): forced readonly > > [ 616.155665] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2124: > > errno=-5 IO failure transid error usually means something below Btrfs got the write ordering wrong and one or more writes dropped, but the problem isn't detected until later which means it's an older problem. What's the oldest kernel this file system has been written with? That is, is it a new Fedora 33 file system? Or older? Fedora 33 came with 5.8.15. ERROR: child eb corrupted: parent bytenr=1144783093760 item=14 parent level=1 child level=2 ERROR: child eb corrupted: parent bytenr=1144881201152 item=14 parent level=1 child level=2 Can you post the output from both: btrfs insp dump-t -b 1144783093760 /dev/dm-0 btrfs insp dump-t -b 1144881201152 /dev/dm-0 > What shall I do now? Do I need any of the invasive methods (`btrfs rescue` or > `btrfs check --repair`) and if yes, which method do I choose? No repairs yet until we know what's wrong and if it's safe to try to repair it. In the meantime I highly recommend refreshing backups of /home in case this can't be repaired. It might be easier to do this with a Live USB boot of Fedora 33, and use 'mount -o ro,subvol=home /dev/dm-0 /mnt/home' to mount your home read-only to get a backup. Live environment will be more cooperative. -- Chris Murphy
Re: 5.12-rc4: rm directory hangs for > 1m on an idle system
Fresh boot, this time no compression, everything else the same. Time to delete both directories takes as long as it takes to copy one of them ~1m17s. This time I took an early and late sysrq t pair, and maybe caught some extra stuff. [ 1190.094618] kernel: Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space [ 1190.094633] kernel: Call Trace: [ 1190.094641] kernel: ? find_extent_buffer+0x5/0x200 [ 1190.094656] kernel: ? find_held_lock+0x32/0x90 [ 1190.094683] kernel: ? __lock_acquire+0x172/0x1e10 [ 1190.094694] kernel: ? lock_is_held_type+0xa7/0x120 [ 1190.094714] kernel: ? btrfs_search_slot+0x6d2/0x9f0 [ 1190.094729] kernel: ? btrfs_get_64+0x5e/0x100 [ 1190.094751] kernel: ? lock_acquire+0xc2/0x3a0 [ 1190.094768] kernel: ? _raw_spin_unlock+0x1f/0x30 [ 1190.094779] kernel: ? rcu_read_lock_sched_held+0x3f/0x80 [ 1190.094798] kernel: ? __lock_acquire+0x172/0x1e10 [ 1190.094811] kernel: ? lookup_extent_backref+0x43/0xd0 [ 1190.094829] kernel: ? release_extent_buffer+0xa3/0xe0 [ 1190.094846] kernel: ? __btrfs_free_extent+0x49c/0x8f0 [ 1190.094878] kernel: ? __btrfs_run_delayed_refs+0x29a/0x1270 [ 1190.094912] kernel: ? _raw_spin_unlock+0x1f/0x30 [ 1190.094934] kernel: ? btrfs_run_delayed_refs+0x86/0x210 [ 1190.094954] kernel: ? flush_space+0x570/0x6d0 [ 1190.094966] kernel: ? lock_release+0x280/0x410 [ 1190.094987] kernel: ? btrfs_preempt_reclaim_metadata_space+0x170/0x2f0 [ 1190.095007] kernel: ? process_one_work+0x2b0/0x5e0 [ 1190.095035] kernel: ? worker_thread+0x55/0x3c0 [ 1190.095045] kernel: ? process_one_work+0x5e0/0x5e0 [ 1190.095060] kernel: ? kthread+0x13a/0x150 [ 1190.095070] kernel: ? __kthread_bind_mask+0x60/0x60 [ 1190.095085] kernel: ? ret_from_fork+0x1f/0x30 dmesg https://drive.google.com/file/d/1VQNAVynVTJo6VqsRX9K5-Z0dMsLmb-vH/view?usp=sharing
5.12-rc4: rm directory hangs for > 1m on an idle system
5.12.0-0.rc4.175.fc35.x86_64+debug /dev/sdb1 on /srv/extra type btrfs (rw,relatime,seclabel,compress=zstd:1,space_cache=v2,subvolid=5,subvol=/) The directories being deleted are on a separate drive (HDD) from / (SSD). It's an unpacked Firefox source tarball, ~2.7G. I had two separate copies, so the rm command was merely: rm -rf firefox1 firefox2 And that command did not return to a prompt for over a minute, with no disk activity all, on an otherwise idle laptop. sysrq+w shows nothing, sysrq+t shows some things. [ 9638.375968] kernel: task:rm state:R running task stack:13176 pid: 2275 ppid: 1892 flags:0x [ 9638.375986] kernel: Call Trace: [ 9638.375998] kernel: ? __lock_acquire+0x3ac/0x1e10 [ 9638.376014] kernel: ? __lock_acquire+0x3ac/0x1e10 [ 9638.376036] kernel: ? lock_acquire+0xc2/0x3a0 [ 9638.376051] kernel: ? lock_acquire+0xc2/0x3a0 [ 9638.376069] kernel: ? lock_acquire+0xc2/0x3a0 [ 9638.376081] kernel: ? lock_is_held_type+0xa7/0x120 [ 9638.376090] kernel: ? rcu_read_lock_sched_held+0x3f/0x80 [ 9638.376099] kernel: ? __btrfs_tree_lock+0x27/0x120 [ 9638.376111] kernel: ? __clear_extent_bit+0x274/0x560 [ 9638.376120] kernel: ? _raw_spin_lock_irqsave+0x67/0x90 [ 9638.376139] kernel: ? __lock_acquire+0x3ac/0x1e10 [ 9638.376153] kernel: ? lock_acquire+0xc2/0x3a0 [ 9638.376161] kernel: ? __lock_acquire+0x3ac/0x1e10 [ 9638.376189] kernel: ? lock_is_held_type+0xa7/0x120 [ 9638.376208] kernel: ? release_extent_buffer+0xa3/0xe0 [ 9638.376224] kernel: ? btrfs_update_root_times+0x2a/0x60 [ 9638.376237] kernel: ? btrfs_insert_orphan_item+0x62/0x80 [ 9638.376246] kernel: ? _atomic_dec_and_lock+0x31/0x50 [ 9638.376264] kernel: ? btrfs_evict_inode+0x16b/0x4e0 [ 9638.376273] kernel: ? btrfs_evict_inode+0x370/0x4e0 [ 9638.376293] kernel: ? evict+0xcf/0x1d0 [ 9638.376305] kernel: ? do_unlinkat+0x1b2/0x2c0 [ 9638.376329] kernel: ? do_syscall_64+0x33/0x40 [ 9638.376338] kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae The entire dmesg is here: https://drive.google.com/file/d/1gyyp59Ju1aRIz3FCZU-kmu05-W1NN89A/view?usp=sharing It isn't nearly as bad deleting one directory at once ~15s. -- Chris Murphy
Re: parent transid verify failed / ERROR: could not setup extent tree
On Tue, Mar 23, 2021 at 12:50 AM Dave T wrote: > > > d. Just skip the testing and try usebackuproot with a read-write > > mount. It might make things worse, but at least it's fast to test. If > > it messes things up, you'll have to recreate this backup from scratch. > > I took this approach. My command was simply: > > mount -o usebackuproot /dev/mapper/xzy /backup > > It appears to have succeeded because it mounted without errors. I > completed a new incremental backup (with btrbk) and it finished > without errors. > I'll be pleased if my backup history is preserved, as appears to be the case. > > I will run some checks on those backup subvolumes tomorrow. Are there > specific checks you would recommend? It will have replaced all the root nodes and super blocks within a minute, or immediately upon umount. So you can just do a 'btrfs check' and see if that comes up clean now. It's basically a kind of rollback and if it worked, there will be no inconsistencies found by btrfs check. -- Chris Murphy
Re: parent transid verify failed / ERROR: could not setup extent tree
On Mon, Mar 22, 2021 at 12:32 AM Dave T wrote: > > On Sun, Mar 21, 2021 at 2:03 PM Chris Murphy wrote: > > > > On Sat, Mar 20, 2021 at 11:54 PM Dave T wrote: > > > > > > # btrfs check -r 2853787942912 /dev/mapper/xyz > > > Opening filesystem to check... > > > parent transid verify failed on 2853787942912 wanted 29436 found 29433 > > > parent transid verify failed on 2853787942912 wanted 29436 found 29433 > > > parent transid verify failed on 2853787942912 wanted 29436 found 29433 > > > Ignoring transid failure > > > parent transid verify failed on 2853827723264 wanted 29433 found 29435 > > > parent transid verify failed on 2853827723264 wanted 29433 found 29435 > > > parent transid verify failed on 2853827723264 wanted 29433 found 29435 > > > Ignoring transid failure > > > leaf parent key incorrect 2853827723264 > > > ERROR: could not setup extent tree > > > ERROR: cannot open file system > > > > btrfs insp dump-t -t 2853827723264 /dev/ > > # btrfs insp dump-t -t 2853827723264 /dev/mapper/xzy > btrfs-progs v5.11 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > Ignoring transid failure > leaf parent key incorrect 2853827608576 > WARNING: could not setup extent tree, skipping it > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > Ignoring transid failure > leaf parent key incorrect 2853827608576 > Couldn't setup device tree > ERROR: unable to open /dev/mapper/xzy > > # btrfs insp dump-t -t 2853787942912 /dev/mapper/xzy > btrfs-progs v5.11 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > Ignoring transid failure > leaf parent key incorrect 2853827608576 > WARNING: could not setup extent tree, skipping it > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > Ignoring transid failure > leaf parent key incorrect 2853827608576 > Couldn't setup device tree > ERROR: unable to open /dev/mapper/xzy > > # btrfs insp dump-t -t 2853827608576 /dev/mapper/xzy > btrfs-progs v5.11 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > Ignoring transid failure > leaf parent key incorrect 2853827608576 > WARNING: could not setup extent tree, skipping it > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > Ignoring transid failure > leaf parent key incorrect 2853827608576 > Couldn't setup device tree > ERROR: unable to open /dev/mapper/xzy That does not look promising. I don't know whether a read-write mount with usebackuproot will recover, or end up with problems. Options: a. btrfs check --repair This probably fails on the same problem, it can't setup the extent tree. b. btrfs check --init-extent-tree This is a heavy hammer, it might succeed, but takes a long time. On 5T it might take double digit hours or even single digit days. It's generally faster to just wipe the drive and restore from backups than use init-extent-tree (I understand this *is* your backup). c. Setup an overlay file on device mapper, to redirect the writes from a read-write mount with usebackup root. I think it's sufficient to just mount, optionally write some files (empty or not), and umount. Then do a btrfs check to see if the current tree is healthy. https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file That guide is a bit complex to deal with many drives with mdadm raid, so you can simplify it for just one drive. The gist is no writes go to the drive itself, it's treated as read-only by device-mapper (in fact you can optionally add a pre-step with the blockdev command and --setro to make sure the entire drive is read-only; just make sure to make it rw once you're done testing). All the writes with this overlay go into a loop mounted file which you intentionally just throw away after testing. d. Just skip the testing and try usebackuproot with a read-write mount. It might make things worse, but at least it's fast to test. If it messes things up, you'll have to recreate this backup from scratch. As for how to prevent this? I'm not sure. About the best we can do is disable the drive write cache with a udev rule, and/or raid1 with another make/model drive, and let Btrfs de
Re: parent transid verify failed / ERROR: could not setup extent tree
On Sat, Mar 20, 2021 at 11:54 PM Dave T wrote: > > # btrfs check -r 2853787942912 /dev/mapper/xyz > Opening filesystem to check... > parent transid verify failed on 2853787942912 wanted 29436 found 29433 > parent transid verify failed on 2853787942912 wanted 29436 found 29433 > parent transid verify failed on 2853787942912 wanted 29436 found 29433 > Ignoring transid failure > parent transid verify failed on 2853827723264 wanted 29433 found 29435 > parent transid verify failed on 2853827723264 wanted 29433 found 29435 > parent transid verify failed on 2853827723264 wanted 29433 found 29435 > Ignoring transid failure > leaf parent key incorrect 2853827723264 > ERROR: could not setup extent tree > ERROR: cannot open file system btrfs insp dump-t -t 2853827723264 /dev/ > It appears the backup root is already stale. I'm not sure. If you can post the contents of that leaf (I don't think it will contain filenames but double check) Qu might have an idea if it's safe to try a read-write mount with -o usebackuproot without causing problems later. > > What you eventually need to look at is what precipitated the transid > > failures, and avoid it. > > The USB drive was disconnected by the user (an accident). I have other > devices with the same hardware that have never experienced this issue. > > Do you have further ideas or suggestions I can try? Thank you for your > time and for sharing your expertise. The drive could be getting write ordering wrong all the time, and it only turns into a problem with a crash, power fail, or accidental disconnect. More common is the write ordering is only sometimes wrong, and a crash or powerfail is usually survivable, but leads to a false sense of security about the drive. The simple theory of write order is data->metadata->sync->super->sync. It shouldn't ever be the case that a newer superblock generation is on stable media before the metadata it points to. -- Chris Murphy
Re: parent transid verify failed / ERROR: could not setup extent tree
On Sat, Mar 20, 2021 at 5:15 AM Dave T wrote: > > I hope to get some expert advice before I proceed. I don't want to > make things worse. Here's my situation now: > > This problem is with an external USB drive and it is encrypted. > cryptsetup open succeeds. But mount fails.k > > mount /backup > mount: /backup: wrong fs type, bad option, bad superblock on > /dev/mapper/xusbluks, missing codepage or helper program, or other > error. > > Next the following command succeeds: > > mount -o ro,recovery /dev/mapper/xusbluks /backup > > This is my backup disk (5TB), and I don't have another 5TB disk to > copy all the data to. I hope I can fix the issue without losing my > backups. > > Next step I did: > > # btrfs check /dev/mapper/xyz > Opening filesystem to check... > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > parent transid verify failed on 2853827608576 wanted 29436 found 29433 > Ignoring transid failure > leaf parent key incorrect 2853827608576 > ERROR: could not setup extent tree > ERROR: cannot open file system >From your superblock: backup 2: backup_tree_root: 2853787942912 gen: 29433 level: 1 Do this: btrfs check -r 2853787942912 /dev/xyz If it comes up clean it's safe to do: mount -o usebackuproot, without needing to use ro. And in that case it'll self recover. You will lose some data, between the commits. It is possible there's partial loss, so it's not enough to just do a scrub, you'll want to freshen the backups as well - if that's what was happening at the time that the trouble happened (the trouble causing the subsequent transid failures). Sometimes backup roots are already stale and inconsistent due to overwrites, so the btrfs check might find problems with that older root. What you eventually need to look at is what precipitated the transid failures, and avoid it. Typical is a drive firmware bug where it gets write ordering wrong and then there's a crash or power fail. Possibly one way to work around the bug is disabling the drive's write cache (use a udev rule to make sure it's always applied). Another way is add a different make/model drive to it, and convert to raid1 profile. And hopefully they won't have overlapping firmware bugs. -- Chris Murphy
Re: All files are damaged after btrfs restore
On Tue, Mar 16, 2021 at 7:39 PM Qu Wenruo wrote: > > Using that restore I was able to restore approx. 7 TB of the > > originally stored 22 TB under that directory. > > Unfortunately nearly all the files are damaged. Small text files are > > still OK. But every larger binary file is useless. > > Is there any possibility to fix the filesystem in a way, that I get > > the data less damaged? > > From the result, it looks like the on-disk data get (partially) wiped out. > I doubt if it's just simple controller failure, but more likely > something not really reaching disk or something more weird. Hey Qu, thanks for the reply. So it's not clear until further downthread that it's bcache in writeback mode with an SSD that failed. And I've probably underestimated the significance of how much data (in this case both Btrfs metadata and user data) and for how long it can stay *only* on the SSD with this policy. https://bcache.evilpiepirate.org/ says it straight up, if using writeback, it is not at all safe for the cache and backing devices to be separated. If the cache device fails, everything on it is gone. By my reading, for example, if the writeback percent is 50%, and the cache device is 128G, at any given time 64G is *only* on the SSD. There's no idle time flushing to the backing device that eventually makes the backing device possibly a self sufficient storage device on its own, it always needs the cache device. -- Chris Murphy
Re: All files are damaged after btrfs restore
Hi, The problem exceeds my knowledge of both Btrfs and bcache/ssd failure modes. I'm not sure what professional data recovery can really do, other than throw a bunch of people at stitching things back together again without any help from the file system. I know that the state of the repair tools is not great, and it is confusing what to use in what order. I don't know if a support contract from one of the distros supporting Btrfs (most likely SUSE) is a better way to get assistance with this kind of recovery while also supporting development. But that's a question for SUSE sales :) Most of the emphasis of upstream development has been on preventing problems from happening to critical Btrfs metadata in the first place. Its ability to self-heal really depends on it having independent block devices to write to, e.g. metadata raid 1. Metadata DUP might normally help with only spinning drives, but with a cache device, it's going to cache all of these concurrent metadata writes. If critical metadata is seriously damaged or missing, it's probably impossible to fix or even skip over with the current state of the tools. Current code needs an entry point into the chunk tree in order to make the logical to physical mapping; and then needs an entry point to the root tree to get to the proper snapshot file tree. If all the recent and critical metadata is lost on the failed bcache caching device, then a totally different strategy is needed. The file btree for the snapshot you want should be on the backing device, as well as its data chunks, and the mapping in the ~94% of the chunk tree that's on disk. I won't be surprised if the file system is broken beyond repair, but I'd be a little surprised if someone more knowledgeable can't figure out a way to get the data out of a week old snapshot. But that's speculation on my part. I really have no idea how long it could take for bcache in writeback mode to flush to the backing device. -- Chris Murphy On Tue, Mar 16, 2021 at 3:35 AM Sebastian Roller wrote: > > Hi again. > > > Looks like the answer is no. The chunk tree really has to be correct > > first before anything else because it's central to doing all the > > logical to physical address translation. And if it's busted and can't > > be repaired then nothing else is likely to work or be repairable. It's > > that critical. > > > > > I already ran chunk-recover. It needs two days to finish. But I used > > > btrfs-tools version 4.14 and it failed. > > > > I'd have to go dig in git history to even know if there's been > > improvements in chunk recover since then. But I pretty much consider > > any file system's tool obsolete within a year. I think it's total > > nonsense that distributions are intentionally using old tools. > > > > > > root@hikitty:/mnt$ btrfs rescue chunk-recover /dev/sdf1 > > > Scanning: DONE in dev0 > > > checksum verify failed on 99593231630336 found E4E3BDB6 wanted > > > checksum verify failed on 99593231630336 found E4E3BDB6 wanted > > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > > > bytenr mismatch, want=124762809384960, have=0 > > > open with broken chunk error > > > Chunk tree recovery failed > > > > > > I could try again with a newer version. (?) Because with version 4.14 > > > also btrfs restore failed. > > > > It is entirely possible that 5.11 fails exactly the same way because > > it's just too badly damaged for the current state of the recovery > > tools to deal with damage of this kind. But it's also possible it'll > > work. It's a coin toss unless someone else a lot more familiar with > > the restore code speaks up. But looking at just the summary change > > log, it looks like no work has happened in chunk recover for a while. > > > > https://btrfs.wiki.kernel.org/index.php/Changelog > > So I ran another chunk-recover with btrfs-progs version 5.11. This is > part of the output. (The list doesn't allow me attach the whole output > to this mail (5 mb zipped). But if you let me know what's important I > can send that.) > > root@hikitty:~$ nohup /root/install/btrfs-progs-5.11/btrfs -v rescue > chunk-recover /dev/sdi1 > > /transfer/sebroll/btrfs-rescue-chunk-recover.out.txt 2>&1 & > nohup: ignoring input > All Devices: > Device: id = 2, name = /dev/sdi1 >
Re: BTRFS error (device sda1): bdev /dev/sdb1 errs: wr 2702175, rd 2719033, flush 0, corrupt 6, gen 0
On Sat, Mar 13, 2021 at 5:22 AM Thomas <74cmo...@gmail.com> wrote: > Gerät Boot Anfang Ende Sektoren Größe Kn Typ > /dev/sdb1 2048 496093750 496091703 236,6G 83 Linux > However the output of btrfs insp dump-s is different: > thomas@pc1-desktop:~ > $ sudo btrfs insp dump-s /dev/sdb1 | grep dev_item.total_bytes > dev_item.total_bytes256059465728 sdb1 has 253998951936 bytes which is *less* than the btrfs super block is saying it should be. 1.919 GiB less. I'm going to guess that the sdb1 partition was reduced without first shrinking the file system. The most common way this happens is not realizing that each member device of a btrfs file system must be separately shrunk. If you do not specify a devid, then devid 1 is assumed. man btrfs filesystem "The devid can be found in the output of btrfs filesystem show and defaults to 1 if not specified." I bet that the file system was shunk one time, this shrunk only devid 1 which is also /dev/sda1. But then both partitions were shrunk thereby truncating sdb1, resulting in these errors. If that's correct, you need to change the sdb1 partition back to its original size (matching the size of the sdb1 btrfs superblock). Scrub the file system so sdb1 can be repaired from any prior damage from the mistake. Shrink this devid to match the size of the other devid, and then change the partition. > Gerät BootAnfang Ende Sektoren Größe Kn Typ > /dev/sda1 * 2048 496093750 496091703 236,6G 83 Linux > > thomas@pc1-desktop:~ > $ sudo btrfs insp dump-s /dev/sda1 | grep dev_item.total_bytes > dev_item.total_bytes253998948352 This is fine. The file system is 3584 bytes less than the partition. I'm not sure why it doesn't end on a 4KiB block boundary or why there's a gap before the start of sda2...but at least it's benign. -- Chris Murphy
Re: BTRFS error (device sda1): bdev /dev/sdb1 errs: wr 2702175, rd 2719033, flush 0, corrupt 6, gen 0
[4.365859] usb 8-1: device not accepting address 5, error -71 [4.365920] usb usb8-port1: unable to enumerate USB device [4.433539] BTRFS info (device sda1): bdev /dev/sdb1 errs: wr 2701995, rd 2718862, flush 0, corrupt 6, gen 0 /dev/sdb is dropping a lot of reads and writes. Is /dev/sdb in a SATA-USB enclosure of some kind? [ 16.914959] blk_update_request: I/O error, dev fd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 16.914963] floppy: error 10 while reading block 0 Curious but I don't think it's related. [ 20.685589] attempt to access beyond end of device sdb1: rw=524288, want=496544128, limit=496091703 [ 20.685798] attempt to access beyond end of device sdb1: rw=2049, want=496544128, limit=496091703 [ 20.685804] BTRFS error (device sda1): bdev /dev/sdb1 errs: wr 2701996, rd 2718862, flush 0, corrupt 6, gen 0 Something is definitely confused but I'm not sure what or why. $ sudo btrfs insp dump-s /dev/sdb1 | grep dev_item.total_bytes Compare that value with (Sectors * 512) from: $ sudo fdisk -l /dev/sdb The fdisk number of bytes should be the same as or more than the btrfs bytes. $ sudo smartctl -x /dev/sdb That might require installing the smartmontools package. -- Chris Murphy -- Chris Murphy
Re: All files are damaged after btrfs restore
On Tue, Mar 9, 2021 at 10:03 AM Sebastian Roller wrote: > I found 12 of these 'tree roots' on the volume. All the snapshots are > under the same tree root. This seems to be the subvolume where I put > the snapshots. Snapshots are subvolumes. All of them will appear in the root tree, even if they're organized as being in a directory or in some other subvolume. >So for the snapshots there is only one option to use > with btrfs restore -r. It can be done by its own root node address using -f or by subvolid using -r. The latter needs to be looked up in a reliable root tree. But I think the distinction may not matter here because really it's the chunk tree that's messed up, and that's what's used to find everything. The addresses in the file tree (the subvolume/snapshot tree that contains file listings, inodes, metadata, and the address of the file) are all logical addresses in btrfs linear space. That means nothing without the translation to physical device and blocks, which is the job of the chunk tree. >But I also found the data I'm looking for under > some other of these tree roots. One of them is clearly the subvolume > the backup went to (the source of the snapshots). But there is also a > very old snapshot (4 years old) that has a tree root on its own. The > files I restored from there are different -- regarding checksums. > They are also corrupted, but different. I have to do some more > hexdumps to figure out, if it's better. Unfortunately when things are messed up badly, the recovery tools may be looking at a wrong or partial checksum tree and it just spits out checksum complaints as a matter of course. You'd have to inspect the file contents themselves, the checksum warnings might be real or bogus. > > OK this is interesting. There's two chunk trees to choose from. So is > > the restore problem because older roots point to the older chunk tree > > which is already going stale, and just isn't assembling blocks > > correctly anymore? Or is it because the new chunk tree is bad? > > Is there a way to choose the chunk tree I'm using for operations like > btrfs restore? Looks like the answer is no. The chunk tree really has to be correct first before anything else because it's central to doing all the logical to physical address translation. And if it's busted and can't be repaired then nothing else is likely to work or be repairable. It's that critical. > I already ran chunk-recover. It needs two days to finish. But I used > btrfs-tools version 4.14 and it failed. I'd have to go dig in git history to even know if there's been improvements in chunk recover since then. But I pretty much consider any file system's tool obsolete within a year. I think it's total nonsense that distributions are intentionally using old tools. > > root@hikitty:/mnt$ btrfs rescue chunk-recover /dev/sdf1 > Scanning: DONE in dev0 > checksum verify failed on 99593231630336 found E4E3BDB6 wanted > checksum verify failed on 99593231630336 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > bytenr mismatch, want=124762809384960, have=0 > open with broken chunk error > Chunk tree recovery failed > > I could try again with a newer version. (?) Because with version 4.14 > also btrfs restore failed. It is entirely possible that 5.11 fails exactly the same way because it's just too badly damaged for the current state of the recovery tools to deal with damage of this kind. But it's also possible it'll work. It's a coin toss unless someone else a lot more familiar with the restore code speaks up. But looking at just the summary change log, it looks like no work has happened in chunk recover for a while. https://btrfs.wiki.kernel.org/index.php/Changelog > > btrfs insp dump-t -t 1 /dev/sdi1 > > > > And you'll need to look for a snapshot name in there, find its bytenr, > > and let's first see if just using that works. If it doesn't then maybe > > combining it with the next most recent root tree will work. > > I am working backwards right now using btrfs restore -f in combination > with -t. So far no success. Yep. I think it comes down to the chunk tree needing to be reasonable first, before anything else is possible. -- Chris Murphy
Re: btrfs fails to mount on kernel 5.11.4 but works on 5.10.19
On Sun, Mar 7, 2021 at 7:18 PM Norbert Preining wrote: > > Hi Chris, > > once more .. > > > > Does the initrd on this system contain? > > > /usr/lib/udev/rules.d/64-btrfs.rules > > No, it didn't. > > Now I added it, and with 64-btrfs.rules available in the initrd I still > get the same error (see previous screenshot) :-( I suspect something is wrong with devid 9 in that case. If it's a dracut system, then it waits indefinitely for sysroot. You'll need to boot with something like rd.break=pre-mount and see first if you can mount normally to /sysroot, but if devid 9 is still missing then mount degraded and replace that device. Or otherwise find out why it's missing. I don't think the scrub helps right now, the issue is the device is missing. Where scrub does help is if the device reappears for normal mount following previous degraded mount - the scrub is needed to get the missing device caught up with the rest. -- Chris Murphy
Re: btrfs fails to mount on kernel 5.11.4 but works on 5.10.19
On Sun, Mar 7, 2021 at 5:25 PM Norbert Preining wrote: > > Hi > > (please cc) > > thanks for your email. First some additional information. Since this > happened I searched and realized that there seem to have been a problem > with 5.12-rc1, which I tried for short time (checking whether AMD-GPU > hangs are fixed). Now I read that -rc1 is a btrfs-killer. I have swap > partition, not swap file, and 64G or RAM, so normally swap is not used, > though. That bug should not have affected the dedication swap partition case. -- Chris Murphy
Re: btrfs fails to mount on kernel 5.11.4 but works on 5.10.19
On Sun, Mar 7, 2021 at 4:28 PM Norbert Preining wrote: > > Dear all > > (please cc) > > not sure this is the right mailing list, but I cannot boot into 5.11.4 > it gives me > devid 9 uui . > failed to read the system array: -2 > open_ctree failed > (only partial, typed in from photo) Post the photo? This is a generic message and we need to see more information. Is devid 9 missing? Does the initrd on this system contain? /usr/lib/udev/rules.d/64-btrfs.rules That will wait until all devices are available before attempting to mount. If it's not in the initrd, it won't wait and it's prone to races, and you can often get mount failures because not all devices are ready to be mounted. > > OTOH, 5.10.19 boots without a hinch > $ btrfs fi show / > Label: none uuid: 911600cb-bd76-4299-9445-666382e8ad20 > Total devices 8 FS bytes used 3.28TiB > devid1 size 899.01GiB used 670.00GiB path /dev/sdb3 > devid2 size 489.05GiB used 271.00GiB path /dev/sdd > devid3 size 1.82TiB used 1.58TiB path /dev/sde1 > devid4 size 931.51GiB used 708.00GiB path /dev/sdf1 > devid5 size 1.82TiB used 1.58TiB path /dev/sdc1 > devid7 size 931.51GiB used 675.00GiB path /dev/nvme2n1p1 > devid8 size 931.51GiB used 680.03GiB path /dev/nvme1n1p1 > devid9 size 931.51GiB used 678.03GiB path /dev/nvme0n1p1 This seems to be a somewhat risky setup or at least highly performance variable. Any single device that fails will result in boot failure. -- Chris Murphy
Re: All files are damaged after btrfs restore
On Sun, Mar 7, 2021 at 6:58 AM Sebastian Roller wrote: > > Would it make sense to just try restore -t on any root I got with > btrfs-find-root with all of the snapshots? Yes but I think you've tried this and you only got corrupt files or files with holes, so that suggests very recent roots are just bad due to the corruption, and older ones are pointing to a mix of valid and stale blocks and it just ends up in confusion. I think what you're after is 'btrfs restore -f' -f only restore files that are under specified subvolume root pointed by You can get this value from each 'tree root' a.k.a. the root of roots tree, what the super calls simply 'root'. That contains references for all the other trees' roots. For example: item 12 key (257 ROOT_ITEM 0) itemoff 12936 itemsize 439 generation 97406 root_dirid 256 bytenr 30752768 level 1 refs 1 lastsnap 93151 byte_limit 0 bytes_used 2818048 flags 0x0(none) uuid 4a0fa0d3-783c-bc42-bee1-ffcbe7325753 ctransid 97406 otransid 7 stransid 0 rtransid 0 ctime 1615103595.233916841 (2021-03-07 00:53:15) otime 1603562604.21506964 (2020-10-24 12:03:24) drop key (0 UNKNOWN.0 0) level 0 item 13 key (257 ROOT_BACKREF 5) itemoff 12911 itemsize 25 root backref key dirid 256 sequence 2 name newpool The name of this subvolume is newpool, the subvolid is 257, and its address is bytenr 30752768. That's the value to plug into btrfs restore -f The thing is, it needs an intact chunk tree, i.e. not damaged and not too old, in order to translate that logical address into a physical device and physical address. > > > OK so you said there's an original and backup file system, are they > > both in equally bad shape, having been on the same controller? Are > > they both btrfs? > > The original / live file system was not btrfs but xfs. It is in a > different but equally bad state than the backup. We used bcache with a > write-back cache on a ssd which is now completely dead (does not get > recognized by any server anymore). To get the file system mounted I > ran xfs-repair. After that only 6% of the data was left and this is > nearly completely in lost+found. I'm now trying to sort these files by > type, since the data itself looks OK. Unfortunately the surviving > files seem to be the oldest ones. Yeah writeback means the bcache device must survive and be healthy before any repair attempts should be made, even restore attempts. It also means you need hardware isolation, one SSD per HDD. Otherwise one SSD failing means the whole thing falls apart. The mode to use for read caching is writethrough. > backup 0: > backup_tree_root: 122583415865344 gen: 825256 > level: 2 > backup_chunk_root: 141944043454464 gen: 825256 > level: 2 > backup 1: > backup_tree_root: 122343302234112 gen: 825253 > level: 2 > backup_chunk_root: 141944034426880 gen: 825251 > level: 2 > backup 2: > backup_tree_root: 122343762804736 gen: 825254 > level: 2 > backup_chunk_root: 141944034426880 gen: 825251 > level: 2 > backup 3: > backup_tree_root: 122574011269120 gen: 825255 > level: 2 > backup_chunk_root: 141944034426880 gen: 825251 > level: 2 OK this is interesting. There's two chunk trees to choose from. So is the restore problem because older roots point to the older chunk tree which is already going stale, and just isn't assembling blocks correctly anymore? Or is it because the new chunk tree is bad? On 72 TB, the last thing I want to recommend is chunk-recover. That'll take forever but it'd be interesting to know which of these chunk trees is good. The chunk tree is in the system block group. It's pretty tiny so it's a small target for being overwritten...and it's cow. So there isn't a reason to immediately start overwriting it. I'm thinking maybe the new one got interrupted by the failure and the old one is intact. Ok so the next step is to find a snapshot you want to restore. btrfs insp dump-t -t 1 /dev/sdi1 And you'll need to look for a snapshot name in there, find its bytenr, and let's first see if just using that works. If it doesn't then maybe combining it with the next most recent root tree will work. -- Chris Murphy
convert and scrub: spanning stripes, attempt to access beyond end of device
Hi, Downstream user is running into this bug: https://github.com/kdave/btrfs-progs/issues/349 But additionally the scrub of this converted file system, which still has ext2_saved/image, produces this message: [36365.549230] BTRFS error (device sda8): scrub: tree block 1777055424512 spanning stripes, ignored. logical=1777055367168 [36365.549262] attempt to access beyond end of device sda8: rw=0, want=3470811376, limit=3470811312 Is this a known artifact of the conversion process? Will it go away once the ext2_saved/image is removed? Should I ask the user to create an e2image -Q from the loop mounted rollback image file for inspection? Thanks -- Chris Murphy
Re: All files are damaged after btrfs restore
On Thu, Mar 4, 2021 at 8:35 AM Sebastian Roller wrote: > > > I don't know. The exact nature of the damage of a failing controller > > is adding a significant unknown component to it. If it was just a > > matter of not writing anything at all, then there'd be no problem. But > > it sounds like it wrote spurious or corrupt data, possibly into > > locations that weren't even supposed to be written to. > > Unfortunately I cannot figure out exactly what happened. Logs end > Friday night while the backup script was running -- which also > includes a finalizing balancing of the device. Monday morning after > some exchange of hardware the machine came up being unable to mount > the device. It's probably not discernible with logs anyway. What hardware does when it goes berserk? It's chaos. And all file systems have write order requirements. It's fine if at a certain point writes just abruptly stop going to stable media. But if things are written out of order, or if the hardware acknowledges critical metadata writes are written but were actually dropped, it's bad. For all file systems. > OK -- I now had the chance to temporarily switch to 5.11.2. Output > looks cleaner, but the error stays the same. > > root@hikitty:/mnt$ mount -o ro,rescue=all /dev/sdi1 hist/ > > [ 3937.815083] BTRFS info (device sdi1): enabling all of the rescue options > [ 3937.815090] BTRFS info (device sdi1): ignoring data csums > [ 3937.815093] BTRFS info (device sdi1): ignoring bad roots > [ 3937.815095] BTRFS info (device sdi1): disabling log replay at mount time > [ 3937.815098] BTRFS info (device sdi1): disk space caching is enabled > [ 3937.815100] BTRFS info (device sdi1): has skinny extents > [ 3938.903454] BTRFS error (device sdi1): bad tree block start, want > 122583416078336 have 0 > [ 3938.994662] BTRFS error (device sdi1): bad tree block start, want > 99593231630336 have 0 > [ 3939.201321] BTRFS error (device sdi1): bad tree block start, want > 124762809384960 have 0 > [ 3939.221395] BTRFS error (device sdi1): bad tree block start, want > 124762809384960 have 0 > [ 3939.221476] BTRFS error (device sdi1): failed to read block groups: -5 > [ 3939.268928] BTRFS error (device sdi1): open_ctree failed This looks like a super is expecting something that just isn't there at all. If spurious behavior lasted only briefly during the hardware failure, there's a chance of recovery. But this diminishes greatly if the chaotic behavior was on-going for a while, many seconds or a few minutes. > I still hope that there might be some error in the fs created by the > crash, which can be resolved instead of real damage to all the data in > the FS trees. I used a lot of snapshots and deduplication on that > device, so that I expect some damage by a hardware error. But I find > it hard to believe that every file got damaged. Correct. They aren't actually damaged. However, there's maybe 5-15 MiB of critical metadata on Btrfs, and if it gets corrupt, the keys to the maze are lost. And it becomes difficult, sometimes impossible, to "bootstrap" the file system. There are backup entry points, but depending on the workload, they go stale in seconds to a few minutes, and can be subject to being overwritten. When 'btrfs restore' is doing partial recovery that ends up with a lot of damage and holes tells me it's found stale parts of the file system - it's on old rails so to speak, there's nothing available to tell it that this portion of the tree is just old and not valid anymore (or only partially valid), but also the restore code is designed to be more tolerant of errors because otherwise it would just do nothing at all. I think if you're able to find the most recent root node for a snapshot you want to restore, along with an intact chunk tree it should be possible to get data out of that snapshot. The difficulty is finding it, because it could be almost anywhere. OK so you said there's an original and backup file system, are they both in equally bad shape, having been on the same controller? Are they both btrfs? What do you get for btrfs insp dump-s -f /dev/sdXY There might be a backup tree root in there that can be used with btrfs restore -t Also, sometimes easier to do this on IRC on freenode.net in the channel #btrfs -- Chris Murphy
Re: [report] lockdep warning when mounting seed device
On Wed, Feb 24, 2021 at 9:40 PM Su Yue wrote: > > > While playing with seed device(misc/next and v5.11), lockdep > complains the following: > > To reproduce: > > dev1=/dev/sdb1 > dev2=/dev/sdb2 > > umount /mnt > > mkfs.btrfs -f $dev1 > > btrfstune -S 1 $dev1 No mount or copying data to the file system after mkfs and before setting the seed flag? I wonder if that's related to the splat, even though it shouldn't happen. -- Chris Murphy
Re: All files are damaged after btrfs restore
On Fri, Feb 26, 2021 at 9:01 AM Sebastian Roller wrote: > > > > I think you best chance is to start out trying to restore from a > > > recent snapshot. As long as the failed controller wasn't writing > > > totally spurious data in random locations, that snapshot should be > > > intact. > > > > i.e. the strategy for this is btrfs restore -r option > > > > That only takes subvolid. You can get a subvolid listing with -l > > option but this doesn't show the subvolume names yet (patch is > > pending) > > https://github.com/kdave/btrfs-progs/issues/289 > > > > As an alternative to applying that and building yourself, you can > > approximate it with: > > > > sudo btrfs insp dump-t -t 1 /dev/sda6 | grep -A 1 ROOT_REF > > > > e.g. > > item 9 key (FS_TREE ROOT_REF 631) itemoff 14799 itemsize 26 > > root ref key dirid 256 sequence 54 name varlog34 > > > > Using this command I got a complete list of all the snapshots back to > 2016 with full name. > I tried to restore from different snapshots and using btrfs restore -t > from some other older roots. > Unfortunately no matter which root I restore from, the files are > always the same. I selected a list of some larger files, namely ppts > and sgmls from one of our own tools, and restored them from different > roots. Then I compared the files by checksums. They are the same from > all roots I could find the files. > The output of btrfs restore gives me some errors for checksums and > deflate, but most of the files are just listed as restored. > > Errors look like this: > > Restoring > /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/AWI/AWI_6.14-2_2015.zip > Restoring > /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/AWI/installInstructions.txt > Done searching /Hardware_Software/ABAQUS/AWI > checksum verify failed on 57937054842880 found 00B6 wanted > ERROR: lzo decompress failed: -4 > Error copying data for > /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/CMA_win86_32_2012.0928.3/setup.exe > Error searching > /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/CMA_win86_32_2012.0928.3/setup.exe > ERROR: lzo decompress failed: -4 > Error copying data for > /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/CMAInstaller.msi > ERROR: lzo decompress failed: -4 > Error copying data for > /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/setup.exe > Error searching > /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/setup.exe > > Most of the files are just listed as "Restoring ...". Still they are > severely damaged afterwards. They seem to contain "holes" filled with > 0x00 (this is from some rudimentary hexdump examination of the files.) > > Any chance to recover/restore from that? Thanks. I don't know. The exact nature of the damage of a failing controller is adding a significant unknown component to it. If it was just a matter of not writing anything at all, then there'd be no problem. But it sounds like it wrote spurious or corrupt data, possibly into locations that weren't even supposed to be written to. I think if the snapshot b-tree is ok, and the chunk b-tree is ok, then it should be possible to recover the data correctly without needing any other tree. I'm not sure if that's how btrfs restore already works. Kernel 5.11 has a new feature, mount -o ro,rescue=all that is more tolerant of mounting when there are various kinds of problems. But there's another thread where a failed controller is thwarting recovery, and that code is being looked at for further enhancement. https://lore.kernel.org/linux-btrfs/CAEg-Je-DJW3saYKA2OBLwgyLU6j0JOF7NzXzECi0HJ5hft_5=a...@mail.gmail.com/ -- Chris Murphy
Re: All files are damaged after btrfs restore
On Wed, Feb 24, 2021 at 10:40 PM Chris Murphy wrote: > > I think you best chance is to start out trying to restore from a > recent snapshot. As long as the failed controller wasn't writing > totally spurious data in random locations, that snapshot should be > intact. i.e. the strategy for this is btrfs restore -r option That only takes subvolid. You can get a subvolid listing with -l option but this doesn't show the subvolume names yet (patch is pending) https://github.com/kdave/btrfs-progs/issues/289 As an alternative to applying that and building yourself, you can approximate it with: sudo btrfs insp dump-t -t 1 /dev/sda6 | grep -A 1 ROOT_REF e.g. item 9 key (FS_TREE ROOT_REF 631) itemoff 14799 itemsize 26 root ref key dirid 256 sequence 54 name varlog34 The subvolume varlog34 is subvolid 631. It's the same for snapshots. So the restore command will use -r 631 to restore only from that subvolume. -- Chris Murphy
Re: All files are damaged after btrfs restore
108864 > > device name = /dev/sdh1 > superblock bytenr = 274877906944 > > [All bad supers]: > > All supers are valid, no need to recover > > > root@hikitty:/mnt$ btrfs rescue chunk-recover /dev/sdf1 > Scanning: DONE in dev0 > checksum verify failed on 99593231630336 found E4E3BDB6 wanted > checksum verify failed on 99593231630336 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > checksum verify failed on 124762809384960 found E4E3BDB6 wanted > bytenr mismatch, want=124762809384960, have=0 > open with broken chunk error > Chunk tree recovery failed > > ^^ This was btrfs v4.14 > > > root@hikitty:~$ install/btrfs-progs-5.9/btrfs check --readonly /dev/sdi1 > Opening filesystem to check... > checksum verify failed on 99593231630336 found 00B6 wanted > checksum verify failed on 124762809384960 found 00B6 wanted > checksum verify failed on 124762809384960 found 00B6 wanted > checksum verify failed on 124762809384960 found 00B6 wanted > bad tree block 124762809384960, bytenr mismatch, want=124762809384960, have=0 > ERROR: failed to read block groups: Input/output error > ERROR: cannot open file system > > > FIRST MOUNT AT BOOT TIME AFTER DESASTER > Feb 15 08:05:11 hikitty kernel: BTRFS info (device sdf1): disk space > caching is enabled > Feb 15 08:05:11 hikitty kernel: BTRFS info (device sdf1): has skinny extents > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944039161856 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944039161856 (dev /dev/sdf1 sector 3974114336) > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944039165952 (dev /dev/sdf1 sector 3974114344) > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944039170048 (dev /dev/sdf1 sector 3974114352) > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944039174144 (dev /dev/sdf1 sector 3974114360) > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944037851136 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944037851136 (dev /dev/sdf1 sector 3974111776) > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944037855232 (dev /dev/sdf1 sector 3974111784) > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944037859328 (dev /dev/sdf1 sector 3974111792) > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944037863424 (dev /dev/sdf1 sector 3974111800) > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944040767488 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944040767488 (dev /dev/sdf1 sector 3974117472) > Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error > corrected: ino 0 off 141944040771584 (dev /dev/sdf1 sector 3974117480) > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944035147776 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944035115008 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944035131392 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944036327424 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944036278272 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944035164160 have 0 > Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree > block start, want 141944036294656 have 0 > Feb 15 08:05:16 hikitty kernel: BTRFS error (device sdf1): failed to > verify dev extents against chunks: -5 > Feb 15 08:05:16 hikitty kernel: BTRFS error (device sdf1): open_ctree failed I think you best chance is to start out trying to restore from a recent snapshot. As long as the failed controller wasn't writing totally spurious data in random locations, that snapshot should be intact. If there are no recent snapshots, and it's unknown what the controller was doing while it was failing or how long it was failing for? Recovery can be difficult. Try using btrfs-find-root to find older roots, and use that value with btrfs restore -t option. These are not as tidy as snapshots though, the older they are, the more they dead end into more recent overwrites. So you want to start out with the most recent roots you can and work backwards in time. -- Chris Murphy
Re: 5.11 free space tree remount warning
On Sat, Feb 20, 2021 at 5:26 PM Wang Yugui wrote: > 1, this warning [*1] is not loged in /var/log/messages > because it happened after the ro remount of / >my server is a dell PowerEdge T640, this log can be confirmed by > iDRAC console. This is a fair point. The systemd journal is also not logging this for the same reason. I see it on the console on reboots when there's enough of a delay to notice it, and "warning" pretty much always catches my eye. -- Chris Murphy
5.11 free space tree remount warning
Hi, systemd does remount ro at reboot/shutdown time, and if free space tree exists, this is always logged: [ 27.476941] systemd-shutdown[1]: Unmounting file systems. [ 27.479756] [1601]: Remounting '/' read-only in with options 'seclabel,compress=zstd:1,space_cache=v2,subvolid=258,subvol=/root'. [ 27.489196] BTRFS info (device vda3): using free space tree [ 27.492009] BTRFS warning (device vda3): remount supports changing free space tree only from ro to rw Is there a way to better detect that this isn't an attempt to change to v2? If there's no v1 present, it's not a change. -- Chris Murphy
Re: ERROR: failed to read block groups: Input/output error
(421 ROOT_ITEM 0) 21060222500864 level 2 > > tree key (427 ROOT_ITEM 0) 21061262114816 level 2 > > tree key (428 ROOT_ITEM 0) 21061278040064 level 2 > > tree key (440 ROOT_ITEM 0) 21061362417664 level 2 > > tree key (451 ROOT_ITEM 0) 21061017174016 level 2 > > tree key (454 ROOT_ITEM 0) 21559581114368 level 1 > > tree key (455 ROOT_ITEM 0) 21079314776064 level 1 > > tree key (456 ROOT_ITEM 0) 21058026831872 level 2 > > tree key (457 ROOT_ITEM 0) 21060907909120 level 3 > > tree key (497 ROOT_ITEM 0) 21058120990720 level 2 > > tree key (571 ROOT_ITEM 0) 21058195668992 level 2 > > tree key (599 ROOT_ITEM 0) 21058818015232 level 2 > > tree key (635 ROOT_ITEM 0) 21056973766656 level 2 > > tree key (638 ROOT_ITEM 0) 21061023072256 level 0 > > tree key (676 ROOT_ITEM 0) 21061314330624 level 2 > > tree key (3937 ROOT_ITEM 0) 21061408686080 level 0 > > tree key (3938 ROOT_ITEM 0) 21079315841024 level 1 > > tree key (3957 ROOT_ITEM 0) 21061419139072 level 2 > > tree key (6128 ROOT_ITEM 0) 21061400018944 level 1 > > tree key (8575 ROOT_ITEM 0) 21061023055872 level 0 > > tree key (18949 ROOT_ITEM 1728623) 21080421875712 level 1 > > tree key (18950 ROOT_ITEM 1728624) 21080424726528 level 2 > > tree key (18951 ROOT_ITEM 1728625) 21080424824832 level 2 > > tree key (18952 ROOT_ITEM 1728626) 21080426004480 level 3 > > tree key (18953 ROOT_ITEM 1728627) 21080422105088 level 2 > > tree key (18954 ROOT_ITEM 1728628) 21080424497152 level 2 > > tree key (18955 ROOT_ITEM 1728629) 21080426332160 level 2 > > tree key (18956 ROOT_ITEM 1728631) 21080423645184 level 2 > > tree key (18957 ROOT_ITEM 1728632) 21080425316352 level 2 > > tree key (18958 ROOT_ITEM 1728633) 21080423972864 level 2 > > tree key (18959 ROOT_ITEM 1728634) 2108042240 level 2 > > tree key (18960 ROOT_ITEM 1728635) 21080422662144 level 2 > > tree key (18961 ROOT_ITEM 1728636) 21080423153664 level 2 > > tree key (18962 ROOT_ITEM 1728637) 21080425414656 level 2 > > tree key (18963 ROOT_ITEM 1728638) 21080421171200 level 1 > > tree key (18964 ROOT_ITEM 1728639) 21080423481344 level 2 > > tree key (19721 ROOT_ITEM 0) 21076937326592 level 2 > > checksum verify failed on 21057125580800 found 0026 wanted 0035 > > checksum verify failed on 21057108082688 found 0074 wanted FFC5 > > checksum verify failed on 21057108082688 found 00ED wanted FFC5 > > checksum verify failed on 21057108082688 found 0074 wanted FFC5 > > Csum didn't match > > From what I understand it seems that some EXTENT_ITEM is corrupted and > when mount tries to read block groups it encounters csum mismatch for > it and immediatly aborts. > Is there some tool I could use to check this EXTENT_ITEM and see if it > can be fixed or maybe just removed? > Basically I guess I need to find physical location on disk from this > block number. > Also I think ignoring csum for btrfs inspect would be useful. > > $ btrfs inspect dump-tree -b 21057050689536 /dev/sda > btrfs-progs v5.10.1 > node 21057050689536 level 1 items 281 free space 212 generation > 2262739 owner EXTENT_TREE > node 21057050689536 flags 0x1(WRITTEN) backref revision 1 > fs uuid 8aef11a9-beb6-49ea-9b2d-7876611a39e5 > chunk uuid 4ffec48c-28ed-419d-ba87-229c0adb2ab9 > [...] > key (19264654909440 EXTENT_ITEM 524288) block 21057101103104 gen 2262739 > [...] > > > $ btrfs inspect dump-tree -b 21057101103104 /dev/sda > btrfs-progs v5.10.1 > checksum verify failed on 21057101103104 found 00B9 wanted 0075 > checksum verify failed on 21057101103104 found 009C wanted 0075 > checksum verify failed on 21057101103104 found 00B9 wanted 0075 > Csum didn't match > ERROR: failed to read tree block 21057101103104 > > > Thanks! What do you get for btrfs rescue super -v /dev/ btrfs check -b /dev/ You might try kernel 5.11 which has a new mount option that will skip bad roots and csums. It's 'mount -o ro,rescue=all' and while it won't let you fix it, in the off chance it mounts, it'll let you get data out before trying to repair the file system, which sometimes makes things worse. -- Chris Murphy
Re: corrupt leaf, unexpected item end, unmountable
On Thu, Feb 18, 2021 at 6:12 PM Daniel Dawson wrote: > > On 2/18/21 3:57 PM, Chris Murphy wrote: > > metadata raid6 as well? > > Yes. Once everything else is figured out, you should consider converting metadata to raid1c3. https://lore.kernel.org/linux-btrfs/20200627032414.gx10...@hungrycats.org/ > > What replacement command(s) are you using? > > For this drive, it was "btrfs replace start -r 3 /dev/sda3 /" OK replace is good. > > Do a RAM test for as long as you can tolerate it, or it finds the > > defect. Sometimes they show up quickly, other times days. > I didn't think of a flipped bit. Thanks. > >> devid0 size 457.64GiB used 39.53GiB path /dev/sdc3 > >> devid1 size 457.64GiB used 39.56GiB path /dev/sda3 > >> devid2 size 457.64GiB used 39.56GiB path /dev/sdb3 > >> devid4 size 457.64GiB used 39.53GiB path /dev/sdd3 > > > > This is confusing. devid 3 is claimed to be missing, but fi show isn't > > showing any missing devices. If none of sd[abcd] are devid 3, then > > what dev node is devid 3 and where is it? > It looks to me like btrfs is temporarily assigning devid 0 to the new > device being used as a replacement.That is what I observed before; once > the replace operation was complete, it went back to the normal number. > Since the replacement didn't finish this time, sdc3 is still devid 0. The new replacement is devid 0 during the replacement. The drive being replaced keeps its devid until the end, and then there's a switch, that device is removed, and the signature on the old drive is wiped. Sooo something is still wrong with the above because there's no devid 3, there's kernel and btrfs check messages saying devid 3 is missing. It doesn't seem likely that /dev/sdc3 is devid 3 because it can't be both missing and be the mounted dev node. >[ 202.676601] BTRFS warning (device sdc3): devid 3 uuid >911a642e-0a4c-4483-9a1f-cde7b87c5519 is missing Try a reboot, and use blkid to check you've got all devices + 1 (the new one that failed replacement). Verify all supers as well with 'btrfs rescue super-recover -v' and that it all correlates with 'btrfs filesystem show' as well. What should be true is the replace will resume upon being normally mounted. But for that to happen, all the drives + 1 must be available. If a tree log is damaged and prevents mount then, you need to make a calculation. You can try to mount with ro,nologreplay and freshen backups for anything you'd rather not lose - just in case things get worse. And then you can zero the log and see if that'll let you normally mount the device (i.e. rw and not degraded). But some of it will depend on what's wrong. -- Chris Murphy
Re: corrupt leaf, unexpected item end, unmountable
On Wed, Feb 17, 2021 at 7:43 PM Daniel Dawson wrote: > > I was attempting to replace the drives in an array with RAID6 profile. metadata raid6 as well? What replacement command(s) are you using? > The first replacement was seemingly successful (and there was a scrub > afterward, with no errors). However, about 0.6% into the second > replacement (sdc), something went wrong, and it went read-only (I should > have copied the log of that somehow). Now it refuses to mount, and a > (readonly) check cannot get started. > > > # mount -o ro,degraded /dev/sda3 /mnt > mount: /mnt: can't read superblock on /dev/sda3. > # btrfs rescue super-recover /dev/sda3 > All supers are valid, no need to recover > > > For this, dmesg shows: > > [ 202.675384] BTRFS info (device sdc3): allowing degraded mounts > [ 202.675387] BTRFS info (device sdc3): disk space caching is enabled > [ 202.675389] BTRFS info (device sdc3): has skinny extents > [ 202.676302] BTRFS warning (device sdc3): devid 3 uuid > 911a642e-0a4c-4483-9a1f-cde7b87c5519 is missing > [ 202.676601] BTRFS warning (device sdc3): devid 3 uuid > 911a642e-0a4c-4483-9a1f-cde7b87c5519 is missing What device is devid 3? > [ 202.985528] BTRFS info (device sdc3): bdev /dev/sdb3 errs: wr 0, rd > 0, flush 0, corrupt 26, gen 0 > [ 202.985533] BTRFS info (device sdc3): bdev /dev/sdd3 errs: wr 0, rd > 0, flush 0, corrupt 98, gen 0 > [ 203.278131] BTRFS info (device sdc3): start tree-log replay > [ 203.454496] BTRFS critical (device sdc3): corrupt leaf: root=7 > block=371567214592 slot=0, unexpected item end, have 16315 expect 16283 > [ 203.454499] BTRFS error (device sdc3): block=371567214592 read time > tree block corruption detected > [ 203.454634] BTRFS critical (device sdc3): corrupt leaf: root=7 > block=371567214592 slot=0, unexpected item end, have 16315 expect 16283 > [ 203.454636] BTRFS error (device sdc3): block=371567214592 read time > tree block corruption detected > [ 203.455794] BTRFS critical (device sdc3): corrupt leaf: root=7 > block=371567214592 slot=0, unexpected item end, have 16315 expect 16283 16315=0x3fbb, 16283=0x3f9b, 16315^16283 = 32 or 0x20 1110111011 1110011011 ^ Do a RAM test for as long as you can tolerate it, or it finds the defect. Sometimes they show up quickly, other times days. > [ 203.455796] BTRFS error (device sdc3): block=371567214592 read time > tree block corruption detected > [ 203.455820] BTRFS: error (device sdc3) in __btrfs_free_extent:3105: > errno=-5 IO failure > [ 203.455823] BTRFS: error (device sdc3) in > btrfs_run_delayed_refs:2208: errno=-5 IO failure > [ 203.455833] BTRFS: error (device sdc3) in btrfs_replay_log:2287: > errno=-5 IO failure (Failed to recover log tree) > [ 203.747758] BTRFS error (device sdc3): open_ctree failed > > > I've looked for, but can't find, any bad blocks on the devices. Also, if > it adds any info... > > # btrfs check --readonly /dev/sda3 > Opening filesystem to check... > warning, device 3 is missing > checksum verify failed on 371587727360 found 00FF wanted 0049 > checksum verify failed on 371587727360 found 0005 wanted 0010 > checksum verify failed on 371587727360 found 0005 wanted 0010 > bad tree block 371587727360, bytenr mismatch, want=371587727360, > have=1076190010624 > ERROR: could not setup extent tree > ERROR: cannot open file system > > > Note: I'm running this off of System Rescue 7.01, which has earlier > versions of things than what the machine in question has installed (the > latter being Linux 5.10.16, with btrfs-progs v5.10.1). > > # uname -a > Linux sysrescue 5.4.78-1-lts #1 SMP Wed, 18 Nov 2020 19:51:49 + > x86_64 GNU/Linux > # btrfs --version > btrfs-progs v5.4.1 > # btrfs filesystem show > Label: 'vroot2020' uuid: 5214d903-783a-4d14-ac78-046da5ac1db7 > Total devices 4 FS bytes used 65.98GiB > devid0 size 457.64GiB used 39.53GiB path /dev/sdc3 > devid1 size 457.64GiB used 39.56GiB path /dev/sda3 > devid2 size 457.64GiB used 39.56GiB path /dev/sdb3 > devid4 size 457.64GiB used 39.53GiB path /dev/sdd3 This is confusing. devid 3 is claimed to be missing, but fi show isn't showing any missing devices. If none of sd[abcd] are devid 3, then what dev node is devid 3 and where is it? But yeah you're probably best off not trying to fix this file system until the memory is sorted out. -- Chris Murphy
Re: Recovering Btrfs from a freak failure of the disk controller
On Sun, Feb 14, 2021 at 4:24 PM Neal Gompa wrote: > > On Sun, Feb 14, 2021 at 5:11 PM Chris Murphy wrote: > > > > On Sun, Feb 14, 2021 at 1:29 PM Neal Gompa wrote: > > > > > > Hey all, > > > > > > So one of my main computers recently had a disk controller failure > > > that caused my machine to freeze. After rebooting, Btrfs refuses to > > > mount. I tried to do a mount and the following errors show up in the > > > journal: > > > > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): disk > > > > space caching is enabled > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): has > > > > skinny extents > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): > > > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid > > > > inode transid: has 96 expect [0, 95] > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > > > block=796082176 read time tree block corruption detected > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): > > > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid > > > > inode transid: has 96 expect [0, 95] > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > > > block=796082176 read time tree block corruption detected > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS warning (device sda3): > > > > couldn't read tree root > > > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > > > open_ctree failed > > > > > > I've tried to do -o recovery,ro mount and get the same issue. I can't > > > seem to find any reasonably good information on how to do recovery in > > > this scenario, even to just recover enough to copy data off. > > > > > > I'm on Fedora 33, the system was on Linux kernel version 5.9.16 and > > > the Fedora 33 live ISO I'm using has Linux kernel version 5.10.14. I'm > > > using btrfs-progs v5.10. > > > > Oh and also that block: > > > > btrfs insp dump-t -b 796082176 /dev/sda3 > > > > So, I've attached the output of the dump-s and dump-t commands. > > As for the other commands: > > # btrfs check --readonly /dev/sda3 > > Opening filesystem to check... > > parent transid verify failed on 796082176 wanted 94 found 96 Not good. So three different transids in play. Super says generation 94 Leaf block says its generation is 96, and two inodes have transid 96 including the one the tree checker is complaining about. Somehow the super has an older generation than both what's in the leaf and what's expected. > > parent transid verify failed on 796082176 wanted 94 found 96 > > parent transid verify failed on 796082176 wanted 94 found 96 > > Ignoring transid failure > > ERROR: could not setup extent tree > > ERROR: cannot open file system > > # mount -o ro,rescue=all /dev/sda3 /mnt > > mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda3, > > missing codepage or helper program, or other error. Do you get the same kernel messages as originally reported? Or something different? -- Chris Murphy
Re: Recovering Btrfs from a freak failure of the disk controller
On Sun, Feb 14, 2021 at 1:29 PM Neal Gompa wrote: > > Hey all, > > So one of my main computers recently had a disk controller failure > that caused my machine to freeze. After rebooting, Btrfs refuses to > mount. I tried to do a mount and the following errors show up in the > journal: > > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): disk space > > caching is enabled > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): has skinny > > extents > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode > > transid: has 96 expect [0, 95] > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > block=796082176 read time tree block corruption detected > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode > > transid: has 96 expect [0, 95] > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > block=796082176 read time tree block corruption detected > > Feb 14 15:20:49 localhost-live kernel: BTRFS warning (device sda3): > > couldn't read tree root > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > open_ctree failed > > I've tried to do -o recovery,ro mount and get the same issue. I can't > seem to find any reasonably good information on how to do recovery in > this scenario, even to just recover enough to copy data off. > > I'm on Fedora 33, the system was on Linux kernel version 5.9.16 and > the Fedora 33 live ISO I'm using has Linux kernel version 5.10.14. I'm > using btrfs-progs v5.10. Oh and also that block: btrfs insp dump-t -b 796082176 /dev/sda3 -- Chris Murphy
Re: Recovering Btrfs from a freak failure of the disk controller
Can you also include: btrfs insp dump-s I wonder if log replay is indicated by non-zero value for log_root in the super block. If so, you check if: ro,nologreplay or ro,nologreplay,usebackuproot work. -- Chris Murphy
Re: Recovering Btrfs from a freak failure of the disk controller
On Sun, Feb 14, 2021 at 1:29 PM Neal Gompa wrote: > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): disk space > > caching is enabled > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): has skinny > > extents > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode > > transid: has 96 expect [0, 95] > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > block=796082176 read time tree block corruption detected > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode > > transid: has 96 expect [0, 95] > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > block=796082176 read time tree block corruption detected > > Feb 14 15:20:49 localhost-live kernel: BTRFS warning (device sda3): > > couldn't read tree root > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): > > open_ctree failed > > I've tried to do -o recovery,ro mount and get the same issue. I can't > seem to find any reasonably good information on how to do recovery in > this scenario, even to just recover enough to copy data off. > > I'm on Fedora 33, the system was on Linux kernel version 5.9.16 and > the Fedora 33 live ISO I'm using has Linux kernel version 5.10.14. I'm > using btrfs-progs v5.10. > > Can anyone help? >has 96 expect [0, 95] Off by one error. I haven't previously seen this with 'invalid inode transid'. There's an old kernel bug (long since fixed) that can inject garbage into the inode transid but that's not what's going on here. What do you get for: btrfs check --readonly In the meantime, it might be worth trying 5.11-rc7 or rc8 with the new 'ro,rescue=all' mount option and see if it can skip over this kind of problem. The "parent transid verify failed" are pretty serious, again not the same thing here. But I'm not sure how resilient repair is for either off by one errors, or bitflips still. -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Wed, Feb 10, 2021 at 11:12 PM Zygo Blaxell wrote: > > If we want the data compressed (and who doesn't? journal data compresses > 8:1 with btrfs zstd) then we'll always need to make a copy at close. > Because systemd used prealloc, the copy is necessarily to a new inode, > as there's no way to re-enable compression on an inode once prealloc > is used (this has deep disk-format reasons, but not as deep as the > nodatacow ones). Pretty sure sd-journald still fallocates when datacow by touching /etc/tmpfiles.d/journal-nocow.conf And I know for sure those datacow files do compress on rotation. Preallocated datacow might not be so bad if it weren't for that one damn header or indexing block, whatever the proper term is, that sd-journald hammers every time it fsyncs. I don't know if I wanna know what it means to snapshot a datacow file that's prealloc. But in theory if the same blocks weren't all being hammered, a preallocated file shouldn't fragment like hell if each prealloc block gets just one write. > If we don't care about compression or datasums, then keep the file > nodatacow and do nothing at close. The defrag isn't needed and the > FS_NOCOW_FL flag change doesn't work. Agreed. > It makes sense for SSD too. It's 4K extents, so the metadata and small-IO > overheads will be non-trivial even on SSD. Deleting or truncating datacow > journal files will put a lot of tiny free space holes into the filesystem. > It will flood the next commit with delayed refs and push up latency. I haven't seen meaningful latency on a single journal file, datacow and heavily fragmented, on ssd. But to test on more than one file at a time i need to revert the defrag commits, and build systemd, and let a bunch of journals accumulate somehow. If I dump too much data artificially to try and mimic aging, I know I will get nowhere near as many of those 4KiB extents. So I dunno. > > > In that case the fragmentation is > > quite considerable, hundreds to thousands of extents. It's > > sufficiently bad that it'd be probably be better if they were > > defragmented automatically with a trigger that tests for number of > > non-contiguous small blocks that somehow cheaply estimates latency > > reading all of them. > > Yeah it would be nice of autodefrag could be made to not suck. It triggers on inserts, not appends. So it doesn't do anything for the sd-journald case. I would think the active journals are the one more likely to get searched for recent events than archived journals. So in the datacow case, you only get relief once it's rotated. It'd be nice to find an decent, not necessarily perfect, way for them to not get so fragmented in the first place. Or just defrag once a file has 16M of non-contiguous extents. Estimating extents though is another issue, especially with compression enabled. -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell wrote: > > > At file close, the systemd should copy the data to a new file with no > > special attributes and discard or recycle the old inode. This copy > > will be mostly contiguous and have desirable properties like csums and > > compression, and will have iops equivalent to btrfs fi defrag. Or switch to a cow-friendly format that's no worse on overwriting file systems, but improves things on Btrfs and ZFS. RocksDB does well. -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell wrote: > > Sorry, I busted my mail client. That was from me. :-P > > On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreij...@inwind.it wrote: > > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote: > > > Hi Chris, > > > > > > it seems that systemd-journald is more smart/complex than I thought: > > > > > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > > > closes the files, it mark again these as COW then defrag [1] > > > > > > 2) looking at the code, I suspect that systemd-journald closes the > > > file asynchronously [2]. This means that looking at the "live" journal > > > is not sufficient. In fact: > > > > > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > > > [...] > > > - > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd4f-0005baed61106a18.journal > > > - > > > system@3f2405cf9bcf42f0abe6de5bc702e394-bd64-0005baed659feff4.journal > > > - > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd67-0005baed65a0901f.journal > > > ---C- > > > system@3f2405cf9bcf42f0abe6de5bc702e394-cc63-0005bafed4f12f0a.journal > > > ---C- > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cc85-0005baff0ce27e49.journal > > > ---C- > > > system@3f2405cf9bcf42f0abe6de5bc702e394-cd38-0005baffe9080b4d.journal > > > ---C- > > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cd3b-0005baffe908f244.journal > > > ---C- user-1000.journal > > > ---C- system.journal > > > > > > The output above means that the last 6 files are "pending" for a > > > de-fragmentation. When these will be > > > "closed", the NOCOW flag will be removed and a defragmentation will start. > > > > Wait what? > > > > > Now my journals have few (2 or 3 extents). But I saw cases where the > > > extents > > > of the more recent files are hundreds, but after few "journalct --rotate" > > > the older files become less > > > fragmented. > > > > > > [1] > > > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383 > > > > That line doesn't work, and systemd ignores the error. > > > > The NOCOW flag cannot be set or cleared unless the file is empty. > > This is checked in btrfs_ioctl_setflags. > > > > This is not something that can be changed easily--if the NOCOW bit is > > cleared on a non-empty file, btrfs data read code will expect csums > > that aren't present on disk because they were written while the file was > > NODATASUM, and the reads will fail pretty badly. The entire file would > > have to have csums added or removed at the same time as the flag change > > (or all nodatacow file reads take a performance hit looking for csums > > that may or may not be present). > > > > At file close, the systemd should copy the data to a new file with no > > special attributes and discard or recycle the old inode. This copy > > will be mostly contiguous and have desirable properties like csums and > > compression, and will have iops equivalent to btrfs fi defrag. Journals implement their own checksumming. Yeah, if there's corruption, Btrfs raid can't do a transparent fixup. But the whole journal isn't lost, just the affected record. *shrug* I think if (a) nodatacow and/or (b) SSD, just leave it alone. Why add more writes? In particular the nodatacow case where I'm seeing consistently the file made from multiples of 8MB contiguous blocks, even on HDD the seek latency here can't be worth defraging the file. I think defrag makes sense (a) datacow journals, i.e. the default nodatacow is inhibited (b) HDD. In that case the fragmentation is quite considerable, hundreds to thousands of extents. It's sufficiently bad that it'd be probably be better if they were defragmented automatically with a trigger that tests for number of non-contiguous small blocks that somehow cheaply estimates latency reading all of them. Since the files are interleaved, doing something like "systemctl status dbus" might actually read many blocks even if the result isn't a whole heck of alot of visible data. But on SSD, cow or nocow, and HDD nocow - I think just leave them alone. -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Wed, Feb 10, 2021 at 12:14 PM Goffredo Baroncelli wrote: > > Hi Chris, > > it seems that systemd-journald is more smart/complex than I thought: > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it > closes the files, it mark again these as COW then defrag [1] Found that in commit 11689d2a021d95a8447d938180e0962cd9439763 from 2015. But archived journals are still all nocow for me on systemd 247. Is it because the enclosing directory has file attribute 'C' ? Another example: Active journal "system.journal" INODE_ITEM contains sequence 4515 flags 0x13(NODATASUM|NODATACOW|PREALLOC) 7 day old archived journal "systemd.journal" INODE_ITEM shows: sequence 227 flags 0x13(NODATASUM|NODATACOW|PREALLOC) So if it ever was COW, it flipped to NOCOW before the defrag. Is it expected? and also this archived file's INODE_ITEM shows generation 1748644 transid 1760983 size 16777216 nbytes 16777216 with EXTENT_ITEMs show generation 1755533 type 1 (regular) generation 1753668 type 1 (regular) generation 1755533 type 1 (regular) generation 1753989 type 1 (regular) generation 1755533 type 1 (regular) generation 1753526 type 1 (regular) generation 1755533 type 1 (regular) generation 1755531 type 1 (regular) generation 1755533 type 1 (regular) generation 1755531 type 2 (prealloc) file tree output for this file https://pastebin.com/6uDFNDdd > 2) looking at the code, I suspect that systemd-journald closes the > file asynchronously [2]. This means that looking at the "live" journal > is not sufficient. In fact: > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *) > [...] > - > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd4f-0005baed61106a18.journal > - > system@3f2405cf9bcf42f0abe6de5bc702e394-bd64-0005baed659feff4.journal > - > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd67-0005baed65a0901f.journal > ---C- > system@3f2405cf9bcf42f0abe6de5bc702e394-cc63-0005bafed4f12f0a.journal > ---C- > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cc85-0005baff0ce27e49.journal > ---C- > system@3f2405cf9bcf42f0abe6de5bc702e394-cd38-0005baffe9080b4d.journal > ---C- > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cd3b-0005baffe908f244.journal > ---C- user-1000.journal > ---C- system.journal > > The output above means that the last 6 files are "pending" for a > de-fragmentation. When these will be > "closed", the NOCOW flag will be removed and a defragmentation will start. > > Now my journals have few (2 or 3 extents). But I saw cases where the extents > of the more recent files are hundreds, but after few "journalct --rotate" the > older files become less > fragmented. Josef explained to me that BTRFS_IOC_DEFRAG is pretty simple and just dirties extents it considers too small, and they end up just going through the normal write path, along with anything else pending. And also that fsync() will set the extents on disk so that the defrag ioctl know what to dirty, but that ordinarily it's not required and might have to do with the interleaving write pattern for the journals. I'm not sure what this ioctl considers big enough that it's worth just leaving alone. But in any case it sounds like the current write workload at the time of defrag could affect the allocation, unlike BTRFS_IOC_DEFRAG_RANGE which has a few knobs to control the outcome. Or maybe the knobs just influence the outcome. Not sure. If the device is HDD, it might be nice if the nodatacow journals are datacow again so they could be compressed. But my evaluation shows that nodatacow journals stick to an 8MB extent pattern, correlating to fallocated append as they grow. It's not significantly fragmented to start out with, whether HDD or SSD. -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
This is an active (but idle) system.journal file. That is, it's open but not being written to. I did a sync right before this: https://pastebin.com/jHh5tfpe And then: btrfs fi defrag -l 8M system.journal https://pastebin.com/Kq1GjJuh Looks like most of it was a no op. So it seems btrfs in this case is not confused by so many small extent items, it know they are contiguous? It doesn't answer the question what the "too small" threshold is for BTRFS_IOC_DEFRAG, which is what sd-journald is using, though. Another sync, and then, 'journalctl --rotate' and the resulting archived file is now: https://pastebin.com/aqac0dRj These are not the same results between the two ioctls for the same file, and not the same result as what you get with -l 32M (which I do get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result is peculiar, but I don't think we can say it's ineffective, it might be an intentional no op either because it's nodatacow or it sees that these many extents are mostly contiguous and not worth defragmenting (which would be good for keeping write amplification down). So I don't know, maybe it's not wrong. -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli wrote: > > On 2/9/21 1:42 AM, Chris Murphy wrote: > > Perhaps. Attach strace to journald before --rotate, and then --rotate > > > > https://pastebin.com/UGihfCG9 > > I looked to this strace. > > in line 115: it is called a ioctl() > in line 123: it is called a ioctl() > > However the two descriptors for which the defrag is invoked are never sync-ed > before. > > I was expecting is to see a sync (flush the data on the platters) and then a > ioctl(. This doesn't seems to be looking from the strace. > > I wrote a script (see below) which basically: > - create a fragmented file > - run filefrag on it > - optionally sync the file <- > - run btrfs fi defrag on it > - run filefrag on it > > If I don't perform the sync, the defrag is ineffective. But if I sync the > file BEFORE doing the defrag, I got only one extent. > Now my hypothesis is: the journal log files are bad de-fragmented because > these > are not sync-ed before. > This could be tested quite easily putting an fsync() before the > ioctl(). > > Any thought ? No idea. If it's a full sync then it could be expensive on either slower devices or heavier workloads. On the one hand, there's no point of doing an ineffective defrag so maybe the defrag ioctl should just do the sync first? On the other hand, this would effectively make the defrag ioctl a full file system sync which might be unexpected. It's a set of tradeoffs and I don't know what the expectation is. What about fdatasync() on the journal file rather than a full sync? -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Tue, Feb 9, 2021 at 12:45 PM Goffredo Baroncelli wrote: > > On 2/9/21 8:01 PM, Chris Murphy wrote: > > On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli > > wrote: > >> > >> On 2/9/21 1:42 AM, Chris Murphy wrote: > >>> Perhaps. Attach strace to journald before --rotate, and then --rotate > >>> > >>> https://pastebin.com/UGihfCG9 > >> > >> I looked to this strace. > >> > >> in line 115: it is called a ioctl() > >> in line 123: it is called a ioctl() > >> > >> However the two descriptors for which the defrag is invoked are never > >> sync-ed before. > >> > >> I was expecting is to see a sync (flush the data on the platters) and then > >> a > >> ioctl(. This doesn't seems to be looking from the strace. > >> > >> I wrote a script (see below) which basically: > >> - create a fragmented file > >> - run filefrag on it > >> - optionally sync the file <- > >> - run btrfs fi defrag on it > >> - run filefrag on it > >> > >> If I don't perform the sync, the defrag is ineffective. But if I sync the > >> file BEFORE doing the defrag, I got only one extent. > >> Now my hypothesis is: the journal log files are bad de-fragmented because > >> these > >> are not sync-ed before. > >> This could be tested quite easily putting an fsync() before the > >> ioctl(). > >> > >> Any thought ? > > > > No idea. If it's a full sync then it could be expensive on either > > slower devices or heavier workloads. On the one hand, there's no point > > of doing an ineffective defrag so maybe the defrag ioctl should just > > do the sync first? On the other hand, this would effectively make the > > defrag ioctl a full file system sync which might be unexpected. It's a > > set of tradeoffs and I don't know what the expectation is. > > > > What about fdatasync() on the journal file rather than a full sync? > > I tried a fsync(2) call, and the results is the same. > Only after reading your reply I realized that I used a sync(2), when > I meant to use fsync(2). > > I update my python test code Ok fsync should be least costly of the three. The three unique things about systemd-journald that might be factors: * nodatacow file * fallocated file in 8MB increments multiple times up to 128M * BTRFS_IOC_DEFRAG, whereas btrfs-progs uses BTRFS_IOC_DEFRAG_RANGE So maybe it's all explained by lack of fsync, I'm not sure. But the commit that added this doesn't show any form of sync. https://github.com/systemd/systemd/commit/f27a386430cc7a27ebd06899d93310fb3bd4cee7 -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Mon, Feb 8, 2021 at 3:21 PM Zygo Blaxell wrote: > defrag will put the file's contents back into delalloc, and it won't be > allocated until a flush (fsync, sync, or commit interval). Defrag is > roughly equivalent to simply copying the data to a new file in btrfs, > except the logical extents are atomically updated to point to the new > location. BTRFS_IOC_DEFRAG results: https://pastebin.com/1ufErVMs BTRFS_IOC_DEFRAG_RANGE results: https://pastebin.com/429fZmNB They're different. Questions: is this a bug? it is intentional? does the interleaved BTRFS_IOC_DEFRAG version improve things over the non-defragmented file, which had only 3 8MB extents for a 24MB file, plus 1 4KiB block? Should BTRFS_IOC_DEFRAG be capable of estimating fragmentation and just do a no op in that case? > FIEMAP has an option flag to sync the data before returning a map. > DEFRAG has an option to start IO immediately so it will presumably be > done by the time you look at the extents with FIEMAP. I waited for the defrag result to settle, so the results I've posted are stable. > Be very careful how you set up this test case. If you use fallocate on > a file, it has a _permanent_ effect on the inode, and alters a lot of > normal btrfs behavior downstream. You won't see these effects if you > just write some data to a file without using prealloc. OK. That might answer the idempotent question. Following BTRFS_IOC_DEFRAG most unwritten exents are no longer present. I can't figure out the pattern. Some of the archived journals have them, others have one, but none have the four or more that I see in active use journals. And then when defragged with BTRFS_IOC_DEFRAG_RANGE none of those have unwritten extents. Since the file is changing each time it goes through the ioctl it makes sense what comes out the back end is different. While BTRFS_IOC_DEFRAG_RANGE has a no op if an extent is bigger than the -l (len=) value, I can't tell that BTRFS_IOC_DEFRAG has any sort of no op unless there's no fragments at all *shrug*. Maybe they should use BTRFS_IOC_DEFRAG_RANGE and specify an 8MB exent? Because in the nodatacow case, that's what they already have and it'd be a no op. And then for datacow case... well I don't like unconditional write amplification on SSDs just to satisfy the HDD case. But it'd be avoidable by just using default (nodatacow for the journals). -- Chris Murphy
Re: is BTRFS_IOC_DEFRAG behavior optimal?
On Mon, Feb 8, 2021 at 3:11 PM Goffredo Baroncelli wrote: > > On 2/7/21 11:06 PM, Chris Murphy wrote: > > systemd-journald journals on Btrfs default to nodatacow, upon log > > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The > > result looks curious. I can't tell what the logic is from the results. > > > > The journal file starts out being fallocated with a size of 8MB, and > > as it grows there is an append of 8MB increments, also fallocated. > > This leads to a filefrag -v that looks like this (ext4 and btrfs > > nodatacow follow the same behavior, both are provided for reference): > > > > ext4 > > https://pastebin.com/6vuufwXt > > > > btrfs > > https://pastebin.com/Y18B2m4h > > > > Following defragment with BTRFS_IOC_DEFRAG it looks like this: > > https://pastebin.com/1ufErVMs > > > > It appears at first glance to be significantly more fragmented. Closer > > inspection shows that most of the extents weren't relocated. But > > what's up with the peculiar interleaving? Is this an improvement over > > the original allocation? > > I am not sure how read the filefrag output: I see several lines like > [...] > 5: 1691..1693: 125477..125479: 3: > 6: 1694..1694: 125480..125480: 1: > unwritten > [...] > > What means "unwritten" ? The kernel documentation [*] says: My understanding is it's an exent that's been fallocated but not yet written to. What I don't know is whether they are possibly tripping up BTRFS_IOC_DEFRAG. I'm not skilled enough to create a bunch of these journal logs quickly (I'd have to just let a system run and age its own journals, which sucks, it takes forever) and then a small program that runs the same file through BTRFS_IOC_DEFRAG twice to see if it's idempotent. The resulting file after one submission does not have unwritten extents. Another thing I'm not sure of is whether ssd vs nossd affects the defrag results. Or datacow versus nodatacow. Another thing I'm not sure of is if autodefrag is a better solution to the problem. Whereby it acts as a no op when the file is nodatacow, and does the expected thing if it's datacow. But then we'd need an autodefrag xattr to set on the enclosing directory for these journals because there's no reliable way to set autodefrag mount option globally, not knowing all the work loads. It can make some workloads worse. > My educate guess is that there is something strange in the sequence: > - write > - sync > - close log > - move log > - defrag log > > May be the defrag starts before all the data reach the platters ? Perhaps. Attach strace to journald before --rotate, and then --rotate https://pastebin.com/UGihfCG9 > > For what matters, I create a file with the same fragmentation like your one > > $ sudo filefrag -v data.txt > Filesystem type is: 9123683e > File size of data.txt is 25165824 (6144 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: > 0:0.. 0:1597171.. 1597171: 1: > 1:1..1599: 163433285.. 163434883: 1599:1597172: > 2: 1600..1607:1601255.. 1601262: 8: 163434884: > 3: 1608..1689:1604137.. 1604218: 82:1601263: > 4: 1690..1690:1597484.. 1597484: 1:1604219: > 5: 1691..1693:1597465.. 1597467: 3:1597485: > 6: 1694..1694:1597966.. 1597966: 1:1597468: > 7: 1695..1722:1599557.. 1599584: 28:1597967: > 8: 1723..1723:1599211.. 1599211: 1:1599585: > 9: 1724..1955:1648394.. 1648625:232:1599212: >10: 1956..1956:1599695.. 1599695: 1:1648626: >11: 1957..2047:1625881.. 1625971: 91:1599696: >12: 2048..2417:1648804.. 1649173:370:1625972: >13: 2418..2420:1597468.. 1597470: 3:1649174: >14: 2421..2478:1624667.. 1624724: 58:1597471: >15: 2479..2479:1596416.. 1596416: 1:1624725: >16: 2480..2482:1601045.. 1601047: 3:1596417: >17: 2483..2483:1596854.. 1596854: 1:1601048: >18: 2484..2523:1602715.. 1602754: 40:1596855: >19: 2524..2527:1597471.. 1597474: 4:1602755: >20: 2528..2598:1624725.. 1624795: 71:1597475: >21: 2599..2599:1596858.. 1596858: 1:1624796: >22: 2600..2607:1601263.
is BTRFS_IOC_DEFRAG behavior optimal?
systemd-journald journals on Btrfs default to nodatacow, upon log rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The result looks curious. I can't tell what the logic is from the results. The journal file starts out being fallocated with a size of 8MB, and as it grows there is an append of 8MB increments, also fallocated. This leads to a filefrag -v that looks like this (ext4 and btrfs nodatacow follow the same behavior, both are provided for reference): ext4 https://pastebin.com/6vuufwXt btrfs https://pastebin.com/Y18B2m4h Following defragment with BTRFS_IOC_DEFRAG it looks like this: https://pastebin.com/1ufErVMs It appears at first glance to be significantly more fragmented. Closer inspection shows that most of the extents weren't relocated. But what's up with the peculiar interleaving? Is this an improvement over the original allocation? https://pastebin.com/1ufErVMs If I unwind the interleaving, it looks like all the extents fall into two localities and within each locality the extents aren't that far apart - so my guess is that this file is also not meaningfully fragmented, in practice. Surely the drive firmware will reorder the reads to arrive at the least amount of seeks? -- Chris Murphy
Re: btrfs becomes read only on removal of folders
On Thu, Feb 4, 2021 at 4:04 AM mig...@rozsas.eng.br wrote: > https://susepaste.org/51166386 It's raid1 metadata on the same physical device, so depending on the device, if the metadata writes are concurrent they may end up being deduped by the drive firmware no matter that they're supposed to go to separate partitions. Feb 02 13:43:37 kimera.rozsas.eng.br kernel: BTRFS error (device sdc2): unable to fixup (regular) error at logical 557651984384 on dev /dev/sdc1 Feb 02 13:43:37 kimera.rozsas.eng.br kernel: BTRFS error (device sdc2): unable to fixup (regular) error at logical 557651869696 on dev /dev/sdc1 This suggests both copies are bad. > So, what is going here ? > How can I fix this FS ? I would do a memory test, the longer the better. Memory defects can be evasive. Take the opportunity to freshen backups while the file system still mounts read-only. And then also provide the output from btrfs check --readonly It might be something that can be repaired, but until you've isolated memory, any repair or new writes can end up with the same problem. But if it's not just a bit flip, and both copies are bad, then it's usually a case of backup, reformat, restore. Hence the backup needs to be the top priority; and checking the memory the second priority. -- Chris Murphy
Re: Need help for my Unraid cache drive
On Sat, Jan 30, 2021 at 1:59 AM Patrick Bihlmayer wrote: > > Hello together, > > today i had an issue with my cache drive on my Unraid Server. > I used a 500GB SSD as cache drive. > > Unfortunately i added another cache drive (wanted a separate drive for my VMs > and accidentally added into the cache device pool) > After starting the array and all the setup for the cache device pool was done > i stopped the array again. > I removed the second drive from my cache device pool again. > > I started the array again - formatted the removed drive mounted it with > unassigned devices.# > And then i realized the following error in my Unraid Cache Devices > > > > Unfortunately i cannot mount it again. > Can you please help me? I don't know anything about unraid. The attached dmesg contains: [ 3660.395013] BTRFS info (device sdb1): allowing degraded mounts [ 3660.395014] BTRFS info (device sdb1): disk space caching is enabled [ 3660.395014] BTRFS info (device sdb1): has skinny extents [ 3660.395733] BTRFS error (device sdb1): failed to read chunk root [ 3660.404212] BTRFS error (device sdb1): open_ctree failed Is that sdb1 device part of the unraid? Is there a device missing? The 'allowing degraded mounts' message along with 'open_ctree failed' suggests that there's still too many devices missing. I suggest a relatively recent btrfs-progs, 5.7 or higher, and provide the output from: btrfs insp dump-s /dev/sdb1 -- Chris Murphy
Re: is back and forth incremental send/receive supported/stable?
It needs testing but I think -c option can work for this case, because the parent on both source and destination are identical, even if the new destination (the old source) has an unexpected received subvolume uuid. At least for me, it worked once and I didn't explore it further. I also don't know if it'll set received uuid, such that subsequent send can use -p instead of -c. -c generally still confuses me... in particular multiple instances of -c -- Chris Murphy
Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd
gt; > Overall: > Device size: 931.49GiB > Device allocated:931.49GiB > Device unallocated:1.00MiB > Device missing: 0.00B > Used:786.39GiB > Free (estimated):107.69GiB (min: 107.69GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:884.48GiB, Used:776.79GiB (87.82%) >/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533 884.48GiB > > Metadata,single: Size:47.01GiB, Used:9.59GiB (20.41%) >/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533 47.01GiB > > System,single: Size:4.00MiB, Used:144.00KiB (3.52%) >/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce5334.00MiB > > Unallocated: >/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce5331.00MiB Can you mount or remount with enospc_debug, and reproduce the problem? That'll include some debug info that might be helpful to a developing coming across this report. Also it might help: cd /sys/fs/btrfs/$UUID/allocation grep -R . And post that too. The $UUID is the file system UUID for this specific file system, as reported by blkid or lsblk -f. -- Chris Murphy
Re: Only one subvolume can be mounted after replace/balance
On Wed, Jan 27, 2021 at 6:10 AM Jakob Schöttl wrote: > > Thank you Chris, it's resolved now, see below. > > Am 25.01.21 um 23:47 schrieb Chris Murphy: > > On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl wrote: > >> Hi, > >> > >> In short: > >> When mounting a second subvolume from a pool, I get this error: > >> "mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda, > >> missing code page or helper program, or other." > >> dmesg | grep BTRFS only shows this error: > >> info (device sda): disk space caching is enabled > >> error (device sda): Remounting read-write after error is not allowed > > It went read-only before this because it's confused. You need to > > unmount it before it can be mounted rw. In some cases a reboot is > > needed. > Oh, I didn't notice that the pool was already mounted (via fstab). > The filesystem where out of space and I had to resize both disks > separately. And I had to mount with -o skip_balance for that. Now it > works again. > > >> What happened: > >> > >> In my RAID1 pool with two disk, I successfully replaced one disk with > >> > >> btrfs replace start 2 /dev/sdx > >> > >> After that, I mounted the pool and did > > I don't understand this sequence. In order to do a replace, the file > > system is already mounted. > That was, what I did before my actual problem occurred. But it's > resolved now. > > >> btrfs fi show /mnt > >> > >> which showed WARNINGs about > >> "filesystems with multiple block group profiles detected" > >> (don't remember exactly) > >> > >> I thought it is a good idea to do > >> > >> btrfs balance start /mnt > >> > >> which finished without errors. > > Balance alone does not convert block groups to a new profile. You have > > to explicitly select a conversion filter, e.g. > > > > btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt > I didn't want to convert to a new profile. I thought btrfs replace > automatically uses the same profile as the pool? Btrfs replace does not change the profile. But you reported mixed profile block groups, which means conversion is indicated to make sure they're al the same. Please post: sudo btrfs fi us /mnt Let's see what the block groups are and what you want them to be and then see what conversion command might be indicated. -- Chris Murphy
Re: btrfs becomes read-only
On Wed, Jan 27, 2021 at 6:05 AM Alexey Isaev wrote: > > I managed to run btrs check, but it didn't solve the problem: > > aleksey@host:~$ sudo btrfs check --repair /dev/sdg OK it's risky to run --repair without a developer giving a go ahead, in particular with older versions of btrfs-progs. There are warnings in the man page about it. > [sudo] password for aleksey: > enabling repair mode > Checking filesystem on /dev/sdg > UUID: 070ce9af-6511-4b89-a501-0823514320c1 > checking extents > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > parent transid verify failed on 52180048330752 wanted 132477 found 132432 > Ignoring transid failure > leaf parent key incorrect 52180048330752 > bad block 52180048330752 > Errors found in extent allocation tree or chunk allocation > parent transid verify failed on 52180048330752 wanted 132477 found 132432 Yeah it's not finding what it's expecting to find there. Any power fail or crash in the history of the file system? What do you get for: btrfs insp dump-s -f /dev/sdg -- Chris Murphy
Re: btrfs becomes read-only
On Wed, Jan 27, 2021 at 1:57 AM Alexey Isaev wrote: > > kernel version: > > aleksey@host:~$ sudo uname --all > Linux host 4.15.0-132-generic #136~16.04.1-Ubuntu SMP Tue Jan 12 > 18:22:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux This is an old and EOL kernel. It could be a long fixed Btrfs bug that caused this problem, I'm not sure. I suggest 5.4.93+ if you need a longterm kernel, otherwise 5.10.11 is the current stable kernel. > > drive make/model: > > Drive is external 5 bay HDD enclosure with raid-5 connected via usb-3 > (made by Orico https://www.orico.cc/us/product/detail/3622.html) > with 5 WD Red 10 Tb. We use this drive for backups. > > When i try to run btrfs check i get error message: > > aleksey@host:~$ sudo btrfs check --readonly /dev/sdg > Couldn't open file system OK is it now on some other dev node? A relatively recent btrfs-progs is also recommended, 5.10 is current and I probably wouldn't use anything older than 5.6.1. > aleksey@host:~$ sudo smartctl -x /dev/sdg Yeah probably won't work since it's behind a raid5 controller. I think there's smartctl commands to enable passthrough and get information for each drive, so that you don't have to put it in JBOD mode. But I'm not familiar with how to do that. Anyway it's a good idea to find out if there's SMART reporting any problems about any drive, but not urgent. -- Chris Murphy
Re: btrfs becomes read-only
On Wed, Jan 27, 2021 at 12:22 AM Alexey Isaev wrote: > > Hello! > > BTRFS volume becomes read-only with this messages in dmesg. > What can i do to repair btrfs partition? > > [Jan25 08:18] BTRFS error (device sdg): parent transid verify failed on > 52180048330752 wanted 132477 found 132432 > [ +0.007587] BTRFS error (device sdg): parent transid verify failed on > 52180048330752 wanted 132477 found 132432 > [ +0.000132] BTRFS error (device sdg): qgroup scan failed with -5 > > [Jan25 19:52] BTRFS error (device sdg): parent transid verify failed on > 52180048330752 wanted 132477 found 132432 > [ +0.009783] BTRFS error (device sdg): parent transid verify failed on > 52180048330752 wanted 132477 found 132432 > [ +0.000132] BTRFS: error (device sdg) in __btrfs_cow_block:1176: > errno=-5 IO failure > [ +0.60] BTRFS info (device sdg): forced readonly > [ +0.04] BTRFS info (device sdg): failed to delete reference to > ftrace.h, inode 2986197 parent 2989315 > [ +0.02] BTRFS: error (device sdg) in __btrfs_unlink_inode:4220: > errno=-5 IO failure > [ +0.006071] BTRFS error (device sdg): pending csums is 430080 What kernel version? What drive make/model? wanted 132477 found 132432 indicates the drive has lost ~45 transactions, that's not good and also weird. There's no crash or any other errors? A complete dmesg might be more revealing. And also smartctl -x /dev/sdg btrfs check --readonly /dev/sdg After that I suggest https://btrfs.wiki.kernel.org/index.php/Restore And try to get any important data out if it's not backed up. You can try btrfs-find-root to get a listing of roots, most recent to oldest. Start at the top, and plug that address in as 'btrfs restore -t' and see if it'll pull anything out. You likely need -i and -v options as well. -- Chris Murphy
Re: Only one subvolume can be mounted after replace/balance
On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl wrote: > > Hi, > > In short: > When mounting a second subvolume from a pool, I get this error: > "mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda, > missing code page or helper program, or other." > dmesg | grep BTRFS only shows this error: > info (device sda): disk space caching is enabled > error (device sda): Remounting read-write after error is not allowed It went read-only before this because it's confused. You need to unmount it before it can be mounted rw. In some cases a reboot is needed. > > What happened: > > In my RAID1 pool with two disk, I successfully replaced one disk with > > btrfs replace start 2 /dev/sdx > > After that, I mounted the pool and did I don't understand this sequence. In order to do a replace, the file system is already mounted. > > btrfs fi show /mnt > > which showed WARNINGs about > "filesystems with multiple block group profiles detected" > (don't remember exactly) > > I thought it is a good idea to do > > btrfs balance start /mnt > > which finished without errors. Balance alone does not convert block groups to a new profile. You have to explicitly select a conversion filter, e.g. btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt > Now, I can only mount one (sub)volume of the pool at a time. Others can > only be mounted read-only. See error messages at top of this mail. > > Do you have any idea what happened or how to fix it? > > I already tried rescue zero-log and super-recovery which was successful > but didn't help. I advise anticipating the confusion will get worse, and take the opportunity to refresh the backups. That's the top priority, not fixing the file system. Next let us know the following: kernel version btrfs-progs version Output from commands: btrfs fi us /mnt btrfs check --readonly -- Chris Murphy
Re: Recover data from damage disk in "array"
On Mon, Jan 18, 2021 at 5:02 PM Hérikz Nawarro wrote: > > Hello everyone, > > I got an array of 4 disks with btrfs configured with data single and > metadata dup, one disk of this array was plugged with a bad sata cable > that broke the plastic part of the data port (the pins still intact), > i still can read the disk with an adapter, but there's a way to > "isolate" this disk, recover all data and later replace the fault disk > in the array with a new one? I'm not sure what you mean by isolate, or what's meant by recover all data. To recover all data on all four disks suggests replicating all of it to another file system - i.e. backup, rsync, snapshot(s) + send/receive. Are there any kernel messages reporting btrfs problems with this file system? That should be resolved as a priority before anything else. Also, DUP metadata for multiple device btrfs is suboptimal. It's a single point of failure. I suggest converting to raid1 metadata so the file system can correct for drive specific problems/bugs by getting a good copy from another drive. If it's the case DUP metadata is on the drive with the bad sata cable, that could easily result in loss or corruption of both copies of metadata and the whole file system can implode. -- Chris Murphy
Re: nodatacow mount option is disregarded when mounting subvolume into same filesystem
On Sun, Jan 17, 2021 at 2:07 PM Damian Höster wrote: > > The nodatacow mount option seems to have no effect when mounting a > subvolume into the same filesystem. > > I did some testing: > > sudo mount -o compress=zstd /dev/sda /mnt -> compression enabled > sudo mount -o compress=zstd,nodatacow /dev/sda /mnt -> compression disabled > sudo mount -o nodatacow,compress=zstd /dev/sda /mnt -> compression enabled > All as I would expect, setting compress or nodatacow disables the other. > > Compression gets enabled without problems when mounting a subvolume into > the same filesystem: > sudo mount /dev/sda /mnt; sudo mount -o subvol=@test,compress=zstd > /dev/sda /mnt/test -> compression enabled > sudo mount /dev/sda /mnt; sudo mount -o subvol=@/testsub,compress=zstd > /dev/sda /mnt/testsub -> compression enabled > > But nodatacow apparently doesn't: > sudo mount -o compress=zstd /dev/sda /mnt; sudo mount -o > subvol=@test,nodatacow /dev/sda /mnt/test -> compression enabled > sudo mount -o compress=zstd /dev/sda /mnt; sudo mount -o > subvol=@/testsub,nodatacow /dev/sda /mnt/testsub -> compression enabled > > And I don't think it's because of the compress mount option, some > benchmarks I did indicate that nodatacow never gets set when mounting a > subvolume into the same filesystem. > Most btrfs mount options are file system wide, they're not per subvolume options. In case of conflict, the most recent option is what's used. i.e. the mount options have an order and are followed in order, with the latest one having precedence in a conflict: compress,nodatacow means nodatacow nodatacow,compress means compress nodatacow implies nodatasum and no compress. If you want per subvolume options then you need to use 'chattr +C' per subvolume or directory for nodatacow. And for compression you can use +c (small c) which implies zlib, or use 'btrfs property set /path/to/sub-dir-file compression zstd' -- Chris Murphy
Re: received uuid not set btrfs send/receive
On Sun, Jan 17, 2021 at 11:51 AM Anders Halman wrote: > > Hello, > > I try to backup my laptop over an unreliable slow internet connection to > a even slower Raspberry Pi. > > To bootstrap the backup I used the following: > > # local > btrfs send root.send.ro | pigz | split --verbose -d -b 1G > rsync -aHAXxv --numeric-ids --partial --progress -e "ssh -T -o > Compression=no -x" x* remote-host:/mnt/backup/btrfs-backup/ > > # remote > cat x* > split.gz > pigz -d split.gz > btrfs receive -f split > > worked nicely. But I don't understand why the "received uuid" on the > remote site in blank. > I tried it locally with smaller volumes and it worked. I suggest using -v or -vv on the receive side to dig into why the receive is failing. Setting the received uuid is one of the last things performed on receive, so if it's not set it suggests the receive isn't finished. -- Chris Murphy
btrfs: shrink delalloc pages instead of full inodes, for 5.10.8?
Hi, It looks like this didn't make it to 5.10.7. I see the PR for 5.11-rc4. Is it likely it'll make it into 5.10.8? e076ab2a2ca70a0270232067cd49f76cd92efe64 btrfs: shrink delalloc pages instead of full inodes Thanks, -- Chris Murphy
Re: Reading files with bad data checksum
On Sun, Jan 10, 2021 at 4:54 AM David Woodhouse wrote: > > I filed https://bugzilla.redhat.com/show_bug.cgi?id=1914433 > > What I see is that *both* disks of the RAID-1 have data which is > consistent, and does not match the checksum that btrfs expects: Yeah either use nodatacow (chattr +C) or don't use O_DIRECT until there's a proper fix. > What's the best way to recover the data? I'd say, kernel 5.11's new "mount -o ro,rescue=ignoredatacsums" feature. You can copy it out normally, no special tools. The alternative is 'btrfs restore'. -- Chris Murphy
Re: btrfs receive eats CoW attributes
On Mon, Jan 4, 2021 at 7:42 PM Cerem Cem ASLAN wrote: > > I need my backups exactly same data, including the file attributes. > Apparently "btrfs receive" ignores the CoW attribute. Here is the > reproduction: > > btrfs sub create ./a > mkdir a/b > chattr +C a/b > echo "hello" > a/b/file > btrfs sub snap -r ./a ./a.ro > mkdir x > btrfs send a.ro | btrfs receive x > lsattr a.ro > lsattr x/a.ro > > Result is: > > # lsattr a.ro > ---C--- a.ro/b > # lsattr x/a.ro > --- x/a.ro/b > > Expected: x/a.ro/b folder should have CoW disabled (same as a.ro/b folder) > > How can I workaround this issue in order to have correct attributes in > my backups? It's the exact opposite issue with chattr +c (or btrfs property set compression), you can't shake it off :) I think we might need 'btrfs receive' to gain a new flag that filters some or all of these? And the filter would be something like --exclude=$1,$2,$3 and --exclude=all I have no strong opinion on what should be the default. But I think probably the default should be "do not preserve any" because these features aren't mkfs or mount time defaults, so I'd make preservation explicitly opt in like they were on the original file system. -- Chris Murphy
Re: tldr; no BTRFS on dev, after a forced shutdown, help
On Mon, Jan 4, 2021 at 11:09 AM André Isidro da Silva wrote: > > I'm sure it used to be one, but indeed it seems that a TYPE is missing > in /dev/sda10; gparted says it's unknown. > It seems there is no trace of the fs. I'm trying to recall any other > operations I might have done, but if it was something else I can't > remember what could have been. I used cfdisk, to resize another > partition, also tried to do a 'btrfs device add' with this missing one > (to solve the no space left in another one), otherwise it was mount /, > mount /home (/dev/sda10), umount, repeat. Oh well. > > [sudo blkid] > > /dev/sda1: UUID="03ff3132-dfc5-4dce-8add-cf5a6c854313" BLOCK_SIZE="4096" > TYPE="ext4" PARTLABEL="LINUX" > PARTUUID="a6042b9f-a3fe-49e2-8dc5-98a818454b6d" > > /dev/sdb4: UUID="5c7201df-ff3e-4cb7-8691-8ef0c6c806ed" > UUID_SUB="bb677c3a-6270-420f-94ce-f5b89f2c40d2" BLOCK_SIZE="4096" > TYPE="btrfs" PARTUUID="be4190e4-8e09-4dfc-a901-463f3e162727" > > /dev/sda10: PARTLABEL="HOME" > PARTUUID="6045f3f0-47a7-4b38-a392-7bebb7f654bd" > > [sudo btrfs insp dump-s -F /dev/sda10] > > superblock: bytenr=65536, device=/dev/sda10 > - > csum_type 0 (crc32c) > csum_size 4 > csum0x [DON'T MATCH] > bytenr 0 > flags 0x0 > magic [DON'T MATCH] > fsid---- > metadata_uuid ---- > label > generation 0 > root0 > sys_array_size 0 > chunk_root_generation 0 > root_level 0 > chunk_root 0 > chunk_root_level0 > log_root0 > log_root_transid0 > log_root_level 0 > total_bytes 0 > bytes_used 0 > sectorsize 0 > nodesize0 > leafsize (deprecated) 0 > stripesize 0 > root_dir0 > num_devices 0 > compat_flags0x0 > compat_ro_flags 0x0 > incompat_flags 0x0 > cache_generation0 > uuid_tree_generation0 > dev_item.uuid ---- > dev_item.fsid ---- [match] > dev_item.type 0 > dev_item.total_bytes0 > dev_item.bytes_used 0 > dev_item.io_align 0 > dev_item.io_width 0 > dev_item.sector_size0 > dev_item.devid 0 > dev_item.dev_group 0 > dev_item.seek_speed 0 > dev_item.bandwidth 0 > dev_item.generation 0 > > This as nothing to do with btrfs anymore, but: do you think a tool like > foremost can recover the files, it'll be a mess, but better then nothing > and I've used it before in a ntfs. No idea. You could scan the entire drive for the Btrfs magic, which is inside the superblock. It will self identify its offset, the first superblock is the one you want, which is offset 65536 (64KiB) from the start of the block device/partition. And that superblock also says the device size. -- Chris Murphy
Re: tldr; no BTRFS on dev, after a forced shutdown, help
On Sun, Jan 3, 2021 at 9:30 PM André Isidro da Silva wrote: > > I might be in some panic, I'm sorry for the info I'm not experienced > enough to give. > > I was in a live iso trying really hard to repair my root btrfs from > which I had used all the space avaiable.. I was trying to move a /usr > partition into the btrfs system, but I didn't check the space available > with the tool, instead used normal tools, because I didn't understand or > actually thought about how the subvolumes would change... sorry this > isn't even the issue anymore; to move /usr I had a temporary /usr copy > in another btrfs system (my /home data partition) and so mounted both > partitions. However this was done in a linux "boot fail console" from > which I didn't know how to proper shutdown.. so I eventually forced the > shutdown withou umounting stuff (...), I think that forced shutdown > might have broken the second partition that now isn't recognized with > btrfs check or mountable. It might also have happen when using the live > iso, but the forced shutdown seemed more likely, since I did almost no > operations but mount/cp. This partition was my data partition, I thought > it was safe to use for this process, since I was just copying files from > it. I do have a backup, but it's old so I'll still lose a lot.. help. First, make no changes, attempt no repairs. Next save history of what you did. A forced shutdown does not make Btrfs unreadable, although if writes are happening at the time of the shutdown and the drive firmware doesn't properly honor write order, then it might be 'btrfs restore' territory. What do you get for: btrfs filesystem show kernel messages (dmesg) that appear when you try to mount the volume but it fails. -- Chris Murphy
Re: [BUG] 500-2000% performance regression w/ 5.10
The problem is worse on SSD than on HDD. It actually makes the SSD *slower* than an HDD, on 5.10. For this workload HDD 5.9.16-200.fc33.x86_64 mq-deadline kyber [bfq] none $ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync real1m27.299s user0m27.294s sys0m14.134s real0m8.890s user0m0.001s sys0m0.344s HDD 5.10.4-200.fc33.x86_64 mq-deadline kyber [bfq] none $ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync real2m14.936s user0m54.396s sys0m47.082s real0m7.726s user0m0.001s sys0m0.382s SSD, compress=zstd:1 5.9.16-200.fc33.x86_64 [mq-deadline] kyber bfq none $ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync real0m41.947s user0m29.359s sys0m18.088s real0m2.042s user0m0.000s sys0m0.065s SSD, compress=zstd:1 5.10.4-200.fc33.x86_64 [mq-deadline] kyber bfq none $ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync real2m59.581s user1m4.097s sys0m56.323s real0m1.492s user0m0.000s sys0m0.077s
Re: cp --reflink of inline extent results in two DATA_EXTENT entries
On Tue, Dec 22, 2020 at 11:05 PM Andrei Borzenkov wrote: > > 23.12.2020 06:48, Chris Murphy пишет: > > Hi, > > > > kernel is 5.10.2 > > > > cp --reflink hi hi2 > > > > This results in two EXTENT_DATA items with different offsets, > > therefore I think the data is duplicated in the leaf? Correct? Is it > > expected? > > > > I'd say yes. Inline data is contained in EXTEND_DATA item and > EXTENT_DATA item cannot be shared by two different inodes (it is keyed > by inode number). > > Even when cloning regular extent you will have two independent > EXTENT_DATA items pointing to the same physical extent. Thanks. I saw this commit a long time ago and sorta just figured it meant maybe inline extents would be cloned within a given leaf. 05a5a7621ce6 Btrfs: implement full reflink support for inline extents But I only just now read the commit message, and it reads like cloning now will be handled without error. It's not saying that it results in shared inline data extents. -- Chris Murphy
cp --reflink of inline extent results in two DATA_EXTENT entries
Hi, kernel is 5.10.2 cp --reflink hi hi2 This results in two EXTENT_DATA items with different offsets, therefore I think the data is duplicated in the leaf? Correct? Is it expected? item 9 key (257 EXTENT_DATA 0) itemoff 15673 itemsize 53 generation 435179 type 0 (inline) inline extent data size 32 ram_bytes 174 compression 3 (zstd) ... item 13 key (258 EXTENT_DATA 0) itemoff 15364 itemsize 53 generation 435179 type 0 (inline) inline extent data size 32 ram_bytes 174 compression 3 (zstd) The entire file tree containing only these two files follows: file tree key (394 ROOT_ITEM 0) leaf 26442252288 items 14 free space 15014 generation 435212 owner 394 leaf 26442252288 flags 0x1(WRITTEN) backref revision 1 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 435123 transid 435212 size 10 nbytes 0 block group 0 mode 40755 links 1 uid 1000 gid 1000 rdev 0 sequence 5267 flags 0x0(none) atime 1608689569.708325037 (2020-12-22 19:12:49) ctime 1608694856.721370147 (2020-12-22 20:40:56) mtime 1608694856.721370147 (2020-12-22 20:40:56) otime 1608689569.708325037 (2020-12-22 19:12:49) item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 index 0 namelen 2 name: .. item 2 key (256 DIR_ITEM 432062026) itemoff 16079 itemsize 32 location key (257 INODE_ITEM 0) type FILE transid 435124 data_len 0 name_len 2 name: hi item 3 key (256 DIR_ITEM 4216900732) itemoff 16046 itemsize 33 location key (258 INODE_ITEM 0) type FILE transid 435196 data_len 0 name_len 3 name: hi2 item 4 key (256 DIR_INDEX 2) itemoff 16014 itemsize 32 location key (257 INODE_ITEM 0) type FILE transid 435124 data_len 0 name_len 2 name: hi item 5 key (256 DIR_INDEX 4) itemoff 15981 itemsize 33 location key (258 INODE_ITEM 0) type FILE transid 435196 data_len 0 name_len 3 name: hi2 item 6 key (257 INODE_ITEM 0) itemoff 15821 itemsize 160 generation 435124 transid 435212 size 174 nbytes 174 block group 0 mode 100644 links 1 uid 1000 gid 1000 rdev 0 sequence 19 flags 0x0(none) atime 1608689574.39809 (2020-12-22 19:12:54) ctime 1608694856.721370147 (2020-12-22 20:40:56) mtime 1608692923.231038818 (2020-12-22 20:08:43) otime 1608689574.39809 (2020-12-22 19:12:54) item 7 key (257 INODE_REF 256) itemoff 15809 itemsize 12 index 2 namelen 2 name: hi item 8 key (257 XATTR_ITEM 3817753667) itemoff 15726 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 435124 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 9 key (257 EXTENT_DATA 0) itemoff 15673 itemsize 53 generation 435179 type 0 (inline) inline extent data size 32 ram_bytes 174 compression 3 (zstd) item 10 key (258 INODE_ITEM 0) itemoff 15513 itemsize 160 generation 435196 transid 435196 size 174 nbytes 174 block group 0 mode 100644 links 1 uid 1000 gid 1000 rdev 0 sequence 34 flags 0x0(none) atime 1608693921.97510335 (2020-12-22 20:25:21) ctime 1608693921.97510335 (2020-12-22 20:25:21) mtime 1608693921.97510335 (2020-12-22 20:25:21) otime 1608693921.97510335 (2020-12-22 20:25:21) item 11 key (258 INODE_REF 256) itemoff 15500 itemsize 13 index 4 namelen 3 name: hi2 item 12 key (258 XATTR_ITEM 3817753667) itemoff 15417 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 435196 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 13 key (258 EXTENT_DATA 0) itemoff 15364 itemsize 53 generation 435179 type 0 (inline) inline extent data size 32 ram_bytes 174 compression 3 (zstd) total bytes 31005392896 bytes used 20153282560 -- Chris Murphy
memory bit flip not detected by write time tree check
Hi, mount failure, WARNING at fs/btrfs/extent-tree.c:3060 __btrfs_free_extent.isra.0+0x5fd/0x8d0 https://bugzilla.redhat.com/show_bug.cgi?id=1905618#c9 In this bug, the user reports what looks like undetected memory bit flip corruption, that makes it to disk, and then is caught at mount time, resulting in mount failure. I'm double checking with the user, but I'm pretty sure it had only seen writes with relatively recent (5.8+) kernels. -- Chris Murphy
what determines what /dev/ is mounted?
When I have a 2-device btrfs: devid 1 = /dev/vdb1 devid 2 = /dev/vdc1 Regardless of the mount command, df and /proc/mounts shows /dev/vdb1 is mounted. If I flip the backing assignments in qemu, such that: devid 2 = /dev/vdb1 devid 1 = /dev/vdc1 Now, /dev/vdc1 is shown as mounted by df and /proc/mounts. But this isn't scientific. Is there a predictable logic? Is it always the lowest devid? -- Chris Murphy
Re: feature request, explicit mount and unmount kernel messages
On Tue, Oct 22, 2019 at 1:33 PM Roman Mamedov wrote: > > On Tue, 22 Oct 2019 11:00:07 +0200 > Chris Murphy wrote: > > > Hi, > > > > So XFS has these > > > > [49621.415203] XFS (loop0): Mounting V5 Filesystem > > [49621.58] XFS (loop0): Ending clean mount > > ... > > [49621.58] XFS (loop0): Ending clean mount > > [49641.459463] XFS (loop0): Unmounting Filesystem > > > > It seems to me linguistically those last two should be reversed, but > > whatever. > > Just a minor note, there is no "last two", but only one "Unmounting" message > on unmount: you copied the "Ending" mount-time message twice for the 2nd quote > (as shown by the timestamp). That's funny, I duplicated that line by mistake. User error! -- Chris Murphy
Re: feature request, explicit mount and unmount kernel messages
On Tue, Oct 22, 2019 at 11:56 AM Anand Jain wrote: > > > I agree, I sent patches for it in 2017. > > VFS version. > https://patchwork.kernel.org/patch/9745295/ > > btrfs version: > https://patchwork.kernel.org/patch/9745295/ > > There wasn't response on btrfs-v2-patch. > > This is not the first time that I am writing patches ahead of > users asking for it, but unfortunately there is no response or > there are disagreements on those patches. I guess it could be a low priority for developers. But that's a big part of why doing this in VFS might be useful, generically, for all file systems? I have no idea what that boundary looks like between native file system and VFS. But if the mount related messages were removed from ext4, XFS, Btrfs, f2fs, FAT, that developers don't find that useful, and add in a proper plain language "(u)mount completed" in VFS, that would be, I think, useful for not just regular users, but users like systemd/init users, and others who have to sort out mount hangs and failures. Just exactly where did this hang up? I can't tell and it's different behavior for every file system. I'm not opposed to each file system having their own (u)mount completed message, indicating a boundary where the native code ends, and VFS code begins. But again that's up to developers. I just want to know if the hang means we're stuck somewhere in *kernel* mount code. >From the prior example, I can't tell that at all, there just isn't enough information. -- Chris Murphy
Re: feature request, explicit mount and unmount kernel messages
(resending to list, I don't know why but I messed up the reply directly to Nikolay) On Tue, Oct 22, 2019 at 11:16 AM Nikolay Borisov wrote: > > On 22.10.19 =D0=B3. 12:00 =D1=87., Chris Murphy wrote: > > Hi, > > > > So XFS has these > > > > [49621.415203] XFS (loop0): Mounting V5 Filesystem > > [49621.58] XFS (loop0): Ending clean mount > > ... > > [49621.58] XFS (loop0): Ending clean mount > > [49641.459463] XFS (loop0): Unmounting Filesystem > > > > It seems to me linguistically those last two should be reversed, but wh= atever. > > > > The Btrfs mount equivalent messages are: > > [49896.176646] BTRFS: device fsid f7972e8c-b58a-4b95-9f03-1a08bbcb62a7 > > devid 1 transid 5 /dev/loop0 > > [49901.739591] BTRFS info (device loop0): disk space caching is enabled > > [49901.739595] BTRFS info (device loop0): has skinny extents > > [49901.767447] BTRFS info (device loop0): enabling ssd optimizations > > [49901.767851] BTRFS info (device loop0): checking UUID tree > > > > So is it true that for sure there is nothing happening after the UUID > > tree is checked, that the file system is definitely mounted at this > > point? And always it's the UUID tree being checked that's the last > > thing that happens? Or is it actually already mounted just prior to > > disk space caching enabled message, and the subsequent messages are > > not at all related to the mount process? See? I can't tell. > > > > For umount, zero messages at all. > > You are doing it wrong. I'm doing what wrong? > Those messages are sent from the given subsys to > the console and printed whenever. You can never rely on the fact that > those messages won't race with some code. That possibility is implicit in all of the questions I asked. > For example the checking UUID tree happens _before_ > btrfs_check_uuid_tree is called and there is no guarantee when it's > finished. Are these messages useful for developers? I don't see them as being useful for users. They're kinda superfluous for them. > > The feature request is something like what XFS does, so that we know > > exactly when the file system is mounted and unmounted as far as Btrfs > > code is concerned. > > > > I don't know that it needs the start and end of the mount and > > unmounted (i.e. two messages). I'm mainly interested in having a > > notification for "mount completed successfully" and "unmount completed > > successfully". i.e. the end of each process, not the start of each. > > mount is a blocking syscall, same goes for umount your notifications are > when the respective syscalls / system utilities return. Right. Here is the example bug from 2015, that I just became aware of as the impetus for posting the request; but I've wanted this explicit notification for a while. https://bugzilla.redhat.com/show_bug.cgi?id=3D1206874#c7 In that example, there's one Btrfs info message at [2.727784] localhost.localdomain kernel: BTRFS info (device sda3): disk space caching is enabled And yet systemd times out on the mount unit. If it's true that only mount blocking systemd could be the cause, then this is a Btrfs, VFS, or mount related bug (however old it is by now and doesn't really matter other than conceptually). But there isn't enough granularity in the kernel messages to understand why the mount is taking so long. If there were a Btrfs mount succeeded message, we'd know whether the Btrfs portion of the mount process successfully completed or not, and perhaps have a better idea where the hang is happening. On Tue, Oct 22, 2019 at 11:16 AM Nikolay Borisov wrote: > > > > On 22.10.19 г. 12:00 ч., Chris Murphy wrote: > > Hi, > > > > So XFS has these > > > > [49621.415203] XFS (loop0): Mounting V5 Filesystem > > [49621.58] XFS (loop0): Ending clean mount > > ... > > [49621.58] XFS (loop0): Ending clean mount > > [49641.459463] XFS (loop0): Unmounting Filesystem > > > > It seems to me linguistically those last two should be reversed, but > > whatever. > > > > The Btrfs mount equivalent messages are: > > [49896.176646] BTRFS: device fsid f7972e8c-b58a-4b95-9f03-1a08bbcb62a7 > > devid 1 transid 5 /dev/loop0 > > [49901.739591] BTRFS info (device loop0): disk space caching is enabled > > [49901.739595] BTRFS info (device loop0): has skinny extents > > [49901.767447] BTRFS info (device loop0): enabling ssd optimizations > > [49901.767851] BTRFS info (device loop0): checking UUID tree > > > > So is it true that for sure there is nothing happening after the UUID > > tree
feature request, explicit mount and unmount kernel messages
Hi, So XFS has these [49621.415203] XFS (loop0): Mounting V5 Filesystem [49621.58] XFS (loop0): Ending clean mount ... [49621.58] XFS (loop0): Ending clean mount [49641.459463] XFS (loop0): Unmounting Filesystem It seems to me linguistically those last two should be reversed, but whatever. The Btrfs mount equivalent messages are: [49896.176646] BTRFS: device fsid f7972e8c-b58a-4b95-9f03-1a08bbcb62a7 devid 1 transid 5 /dev/loop0 [49901.739591] BTRFS info (device loop0): disk space caching is enabled [49901.739595] BTRFS info (device loop0): has skinny extents [49901.767447] BTRFS info (device loop0): enabling ssd optimizations [49901.767851] BTRFS info (device loop0): checking UUID tree So is it true that for sure there is nothing happening after the UUID tree is checked, that the file system is definitely mounted at this point? And always it's the UUID tree being checked that's the last thing that happens? Or is it actually already mounted just prior to disk space caching enabled message, and the subsequent messages are not at all related to the mount process? See? I can't tell. For umount, zero messages at all. The feature request is something like what XFS does, so that we know exactly when the file system is mounted and unmounted as far as Btrfs code is concerned. I don't know that it needs the start and end of the mount and unmounted (i.e. two messages). I'm mainly interested in having a notification for "mount completed successfully" and "unmount completed successfully". i.e. the end of each process, not the start of each. In particular the unmount notice is somewhat important because as far as I know there's no Btrfs dirty flag from which to infer whether it was really unmounted cleanly. But I'm also not sure what the insertion point for these messages would be. Looking at the mount code in particular, it's a little complicated. And maybe with some of the sanity checking and debug options it could get more complicated, and wouldn't want to conflict with that - or any multiple device use case either. -- Chris Murphy
Re: MD RAID 5/6 vs BTRFS RAID 5/6
On Sat, Oct 19, 2019 at 12:18 AM Supercilious Dude wrote: > > It would be be useful to have the ability to scrub only the metadata. In many > cases the data is so large that a full scrub is not feasible. In my "little" > test system of 34TB a full scrub takes many hours and the IOPS saturate the > disks to the extent that the volume is unusable due to the high latencies. > Ideally there should be a way to rate limit the scrub operation so that it > can happen in the background without impacting the normal workload. In effect a 'btrfs check' is a read only scrub of metadata, as all metadata is needed to be read for that. Of course it's more expensive than just confirm checksums are OK, because it's also doing a bunch of sanity and logical tests that take much longer. -- Chris Murphy
Re: MD RAID 5/6 vs BTRFS RAID 5/6
On Thu, Oct 17, 2019 at 8:23 PM Graham Cobb wrote: > > On 17/10/2019 16:57, Chris Murphy wrote: > > On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB > > wrote: > >> > >> It would be interesting to know the pros and cons of this setup that > >> you are suggesting vs zfs. > >> +zfs detects and corrects bitrot ( > >> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ ) > >> +zfs has working raid56 > >> -modules out of kernel for license incompatibilities (a big minus) > >> > >> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem > >> to find any conclusive doc about it right now) > > > > Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12. > > Presumably this is dependent on checksums? So neither detection nor > fixup happen for NOCOW files? Even a scrub won't notice because scrub > doesn't attempt to compare both copies unless the first copy has a bad > checksum -- is that correct? Normal read (passive) it can't be detected if nocow, since nocow means nodatasum. If the problem happens in metadata, it's detected because metadata is always cow and always has csum. I'm not sure what the scrub behavior is for nocow. There's enough information to detect a mismatch if in normal (not degraded) operation. But I don't know if Btrfs scrub warns on this case. > If I understand correctly, metadata always has checksums so that is true > for filesystem structure. But for no-checksum files (such as nocow > files) the corruption will be silent, won't it? Corruption is always silent for nocow data. Same as any other filesystem, it's up to the application layer to detect it. -- Chris Murphy
Re: MD RAID 5/6 vs BTRFS RAID 5/6
On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB wrote: > > It would be interesting to know the pros and cons of this setup that > you are suggesting vs zfs. > +zfs detects and corrects bitrot ( > http://www.zfsnas.com/2015/05/24/testing-bit-rot/ ) > +zfs has working raid56 > -modules out of kernel for license incompatibilities (a big minus) > > BTRFS can detect bitrot but... are we sure it can fix it? (can't seem > to find any conclusive doc about it right now) Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12. > I'm one of those that is waiting for the write hole bug to be fixed in > order to use raid5 on my home setup. It's a shame it's taking so long. For what it's worth, the write hole is considered to be rare. https://lwn.net/Articles/665299/ Further, the write hole means a) parity is corrupt or stale compared to data stripe elements which is caused by a crash or powerloss during writes, and b) subsequently there is a missing device or bad sector in the same stripe as the corrupt/stale parity stripe element. The effect of b) is that reconstruction from parity is necessary, and the effect of a) is that it's reconstructed incorrectly, thus corruption. But Btrfs detects this corruption, whether it's metadata or data. The corruption isn't propagated in any case. But it makes the filesystem fragile if this happens with metadata. Any parity stripe element staleness likely results in significantly bad reconstruction in this case, and just can't be worked around, even btrfs check probably can't fix it. If the write hole problem happens with data block group, then EIO. But the good news is that this isn't going to result in silent data or file system metadata corruption. For sure you'll know about it. This is why scrub after a crash or powerloss with raid56 is important, while the array is still whole (not degraded). The two problems with that are: a) the scrub isn't initiated automatically, nor is it obvious to the user it's necessary b) the scrub can take a long time, Btrfs has no partial scrubbing. Wheras mdadm arrays offer a write intent bitmap to know what blocks to partially scrub, and to trigger it automatically following a crash or powerloss. It seems Btrfs already has enough on-disk metadata to infer a functional equivalent to the write intent bitmap, via transid. Just scrub the last ~50 generations the next time it's mounted. Either do this every time a Btrfs raid56 is mounted. Or create some flag that allows Btrfs to know if the filesystem was not cleanly shutdown. It's possible 50 generations could be a lot of data, but since it's an online scrub triggered after mount, it wouldn't add much to mount times. I'm also picking 50 generations arbitrarily, there's no basis for that number. The above doesn't cover the case where partial stripe write (which leads to write hole problem), and a crash or powerloss, and at the same time one or more device failures. In that case there's no time for a partial scrub to fix the problem leading to the write hole. So even if the corruption is detected, it's too late to fix it. But at least an automatic partial scrub, even degraded, will mean the user would be flagged of the uncorrectable problem before they get too far along. -- Chris Murphy
Re: 5.3.0 deadlock: btrfs_sync_file / btrfs_async_reclaim_metadata_space / btrfs_page_mkwrite
On Mon, Oct 14, 2019 at 7:05 PM James Harvey wrote: > > On Sun, Oct 13, 2019 at 9:46 PM Chris Murphy wrote: > > > > On Sat, Oct 12, 2019 at 5:29 PM James Harvey > > wrote: > > > > > > Was using a temporary BTRFS volume to compile mongodb, which is quite > > > intensive and takes quite a bit of time. The volume has been > > > deadlocked for about 12 hours. > > > > > > Being a temporary volume, I just used mount without options, so it > > > used the defaults: rw,relatime,ssd,space_cache,subvolid=5,subvol=/ > > > > > > Apologies if upgrading to 5.3.5+ will fix this. I didn't see > > > discussions of a deadlock looking like this. > > > > I think it's a bug in any case, in particular because its all default > > mount options, but it'd be interesting if any of the following make a > > difference: > > > > - space_cache=v2 > > - noatime > > Interesting. > > This isn't 100% reproducible. Before my original post, after my > initial deadlock, I tried again and immediately hit another deadlock. > But, yesterday, in response to your email, I tried again still without > "space_cache=v2,noatime" to re-confirm the deadlock. I had to > re-compile mongodb about 6 times to hit another deadlock. I was > almost at the point of thinking I wouldn't see it again. > > After re-confirming it, I re-created the BTRFS volume to use > "space_cache=v2,noatime" mount options. It deadlocked during the > first mongodb compilation. w > sysrq_trigger is a little bit > different. No trace including "btrfs_sync_log" or > "btrfs_async_reclaim_metadata_space". Only traces including the > "btrfs_btrfs_async_reclaim_metadata_space". Viewable here: > http://ix.io/1YGe I think it's some kind of disk or lock contention, but I don't really know much about it. The v1 space_cache is basically data extents, so they use data chunks and I guess can conflict with heavy data writes. Whereas v2 space_cache is a dedicated metadata btree. So yeah - and I'm not sure if mongo builds use atime at all so the noatime could be a goose chase, but figured it might help reduce unnecessary metadata updates. > Also, as I'm testing some issues with the mongodb compilation process > (upstream always forces debug symbols...), as a workaround to be able > to test its issues, I've used a temporary ext4 volume for it, which I > haven't had a single issue with. Adds to the notion this is some kind of bug. -- Chris Murphy
Re: Massive filesystem corruption since kernel 5.2 (ARCH)
On Sun, Oct 13, 2019 at 8:07 PM Adam Bahe wrote: > > > Until the fix gets merged to 5.2 kernels (and 5.3), I don't really > > recommend running 5.2 or 5.3. > > I know fixes went in to distro specific kernels. But wanted to verify > if the fix went into the vanilla kernel.org kernel? If so, what > version should be safe? ex: > https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.3.6 > > With 180 raw TB in raid1 I just want to be explicit. Thanks! It's fixed upstream stable since 5.2.15, and includes all 5.3.x series. -- Chris Murphy
Re: 5.3.0 deadlock: btrfs_sync_file / btrfs_async_reclaim_metadata_space / btrfs_page_mkwrite
On Sat, Oct 12, 2019 at 5:29 PM James Harvey wrote: > > Was using a temporary BTRFS volume to compile mongodb, which is quite > intensive and takes quite a bit of time. The volume has been > deadlocked for about 12 hours. > > Being a temporary volume, I just used mount without options, so it > used the defaults: rw,relatime,ssd,space_cache,subvolid=5,subvol=/ > > Apologies if upgrading to 5.3.5+ will fix this. I didn't see > discussions of a deadlock looking like this. I think it's a bug in any case, in particular because its all default mount options, but it'd be interesting if any of the following make a difference: - space_cache=v2 - noatime -- Chris Murphy
Re: BTRFS Raid5 error during Scrub.
On Thu, Oct 3, 2019 at 6:18 AM Robert Krig wrote: > > By the way, how serious is the error I've encountered? > I've run a second scrub in the meantime, it aborted when it came close > to the end, just like the first time. > If the files that are corrupt have been deleted is this error going to > go away? Maybe. > > > > Opening filesystem to check... > > > > Checking filesystem on /dev/sda > > > > UUID: f7573191-664f-4540-a830-71ad654d9301 > > > > [1/7] checking root items (0:01:17 elapsed, > > > > 5138533 items checked) > > > > parent transid verify failed on 48781340082176 wanted 109181 > > > > found > > > > 109008items checked) > > > > parent transid verify failed on 48781340082176 wanted 109181 > > > > found > > > > 109008 > > > > parent transid verify failed on 48781340082176 wanted 109181 > > > > found > > > > 109008 These look suspiciously like the 5.2 regression: https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdman...@kernel.org/T/#u You should either revert to a 5.1 kernel, or use 5.2.15+. As far as I'm aware it's not possible to fix this kind of corruption, so I suggest refreshing your backups while you can still mount this file system, and prepare to create it from scratch. > > > > Ignoring transid failure > > > > leaf parent key incorrect 48781340082176 > > > > bad block 48781340082176 > > > > [2/7] checking extents (0:03:22 elapsed, > > > > 1143429 items checked) > > > > ERROR: errors found in extent allocation tree or chunk allocation That's usually not a good sign. > > > > [3/7] checking free space cache(0:05:10 elapsed, > > > > 7236 > > > > items checked) > > > > parent transid verify failed on 48781340082176 wanted 109181 > > > > found > > > > 109008ems checked) > > > > Ignoring transid failure > > > > root 15197 inode 81781 errors 1000, some csum missing48 elapsed, That's inode 81781 in the subvolume with ID 15197. I'm not sure what error 1000 is, but btrfs check is a bit fussy when it enounters files that are marked +C (nocow) but have been compressed. This used to be possible with older kernels when nocow files were defragmented while the file system is mounted with compression enabled. If that sounds like your use case, that might be what's going on here, and it's actually a benign message. It's normal for nocow files to be missing csums. To confirm you can use 'find /pathtosubvol/ -inum 81781' to find the file, then lsattr it and see if +C is set. You have a few options but the first thing is to refresh backups and prepare to lose this file system: a. bail now, and just create a new Btrfs from scratch and restore from backup b. try 'btrfs check --repair' to see if the transid problems are fixed; if not c. try 'btrfs check --repair --init-extent-tree' there's a good chance this fails and makes things worse but probably faster to try than restoring from backup -- Chris Murphy
Re: BTRFS Raid5 error during Scrub.
On Mon, Sep 30, 2019 at 3:37 AM Robert Krig wrote: > > I've upgraded to btrfs-progs v5.2.1 > Here is the output from btrfs check -p --readonly /dev/sda > > > Opening filesystem to check... > Checking filesystem on /dev/sda > UUID: f7573191-664f-4540-a830-71ad654d9301 > [1/7] checking root items (0:01:17 elapsed, > 5138533 items checked) > parent transid verify failed on 48781340082176 wanted 109181 found > 109008items checked) > parent transid verify failed on 48781340082176 wanted 109181 found > 109008 > parent transid verify failed on 48781340082176 wanted 109181 found > 109008 > Ignoring transid failure > leaf parent key incorrect 48781340082176 > bad block 48781340082176 > [2/7] checking extents (0:03:22 elapsed, > 1143429 items checked) > ERROR: errors found in extent allocation tree or chunk allocation > [3/7] checking free space cache(0:05:10 elapsed, 7236 > items checked) > parent transid verify failed on 48781340082176 wanted 109181 found > 109008ems checked) > Ignoring transid failure > root 15197 inode 81781 errors 1000, some csum missing48 elapsed, 33952 > items checked) > [4/7] checking fs roots(0:42:53 elapsed, 34145 > items checked) > ERROR: errors found in fs roots > found 22975533985792 bytes used, error(s) found > total csum bytes: 16806711120 > total tree bytes: 18733842432 > total fs tree bytes: 130121728 > total extent tree bytes: 466305024 > btree space waste bytes: 1100711497 > file data blocks allocated: 3891333279744 > referenced 1669470507008 What do you get for # btrfs insp dump-t -b 48781340082176 /dev/ It's possible there will be filenames, it's OK to sanitize them by just deleting the names from the output before posting it. -- Chris Murphy
Re: BTRFS checksum mismatch - false positives
>From the log offlist 2019-09-08T17:27:02+02:00 MHPNAS kernel: [ 22.396165] md: invalid raid superblock magic on sda5 2019-09-08T17:27:02+02:00 MHPNAS kernel: [ 22.401816] md: sda5 does not have a valid v0.90 superblock, not importing! That doesn't sound good. It's not a Btrfs problem but a md/mdadm problem. You'll have to get support for this from Synology, only they understand the design of the storage stack layout and whether these error messages are important or not and how to fix them. Anyone else speculating could end up causing damage to the NAS and data to be lost. 2019-09-08T17:27:02+02:00 MHPNAS kernel: [ 22.913298] md: sda2 has different UUID to sda1 There are several messages like this. I can't tell if they're just informational and benign or a problem. Also not related to Btrfs. 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.419199] BTRFS warning (device dm-1): BTRFS: dm-1 checksum verify failed on 375259512832 wanted EA1A10E3 found 3080B64F level 0 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.419199] 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.458453] BTRFS warning (device dm-1): BTRFS: dm-1 checksum verify failed on 375259512832 wanted EA1A10E3 found 3080B64F level 0 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.458453] 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.528385] BTRFS: read error corrected: ino 1 off 375259512832 (dev /dev/vg1/volume_1 sector 751819488) 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.539631] BTRFS: read error corrected: ino 1 off 375259516928 (dev /dev/vg1/volume_1 sector 751819496) 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.550785] BTRFS: read error corrected: ino 1 off 375259521024 (dev /dev/vg1/volume_1 sector 751819504) 2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.561990] BTRFS: read error corrected: ino 1 off 375259525120 (dev /dev/vg1/volume_1 sector 751819512) There are a bunch of messages like this. Btrfs is finding metadata checksum errors, some kind of corruption has happened with one of the copies, and it's been fixed up. But why are things being corrupt in the first place? Ordinary bad sectors maybe? There's a lot of these - like really a lot. Hundreds of affected sectors. There are too many for me to read through and see if all of them were corrected by DUP metadata. 2019-09-22T21:24:27+02:00 MHPNAS kernel: [1224856.764098] md2: syno_self_heal_is_valid_md_stat(496): md's current state is not suitable for data correction What does that mean? Also not a Btrfs problem. There are quite a few of these. 2019-09-23T11:49:20+02:00 MHPNAS kernel: [1276791.652946] BTRFS error (device dm-1): BTRFS: dm-1 failed to repair btree csum error on 1353162506240, mirror = 1 OK and a few of these also. This means that some metadata could not be repaired, likely because both copies are corrupt. My recommendation is to freshen your backups now while you still can, and prepare to rebuild the NAS. i.e. these are not likely repairable problems. Once both copies of Btrfs metadata are bad, it's usually not fixable you just have to recreate the file system from scratch. You'll have to move everything off the NAS - and anything that's really important you will want at least two independent copies of, of course, and then you're going to obliterate the array and start from scratch. While you're at it, you might as well make sure you've got the latest supported version of the software for this product. Start with that. Then follow the Synology procedure to wipe the NAS totally and set it up again. You'll want to make sure the procedure you use writes out all new metadata for everything: mdadm, lvm, Btrfs. Nothing stale or old reused. And then you'll copy you data back over to the NAS. There's nothing in the provided log that helps me understand why this is happening. I suspect hardware problems of some sort - maybe one of the drives is starting to slowly die, by spitting out bad sectors. To know more about that we'd need to see 'smartctl -x /dev/' for each drive in the NAS and see if smart gives a clue. Somewhere around 50/50 shot that smart will predict a drive failure in advance. So my suggestion again, without delay, is to make sure the NAS is backed up, and keep those backups fresh. You can recreate the NAS when you have free time - but these problems likely will get worse. --- Chris Murphy