from:"Chris Murphy"

Re: Help recover from btrfs error

2021-04-19 Thread Chris Murphy

On Sat, Apr 17, 2021 at 4:03 PM Florian Franzeck  wrote:
>
> Dear users,
>
> I need help to recover from a btrfs error after a power cut
>
> btrfs-progs v5.4.1
>
> Linux banana 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC
> 2021 x86_64 x86_64 x86_64 GNU/Linux
>
> dmesg output:
>
> [   30.330824] BTRFS info (device md1): disk space caching is enabled
> [   30.330826] BTRFS info (device md1): has skinny extents
> [   30.341269] BTRFS error (device md1): parent transid verify failed on
> 201818112 wanted 147946 found 147960
> [   30.342887] BTRFS error (device md1): parent transid verify failed on
> 201818112 wanted 147946 found 147960
> [   30.344154] BTRFS warning (device md1): failed to read root
> (objectid=4): -5
> [   30.375400] BTRFS error (device md1): open_ctree failed
>
> Please advise what to do next to recover data on this disk
>
> Thank a lot
>

https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#parent_transid_verify_failed

This might be repairable with 'btrfs check --repair
--init-extent-tree' but it's really slow. It's almost always faster to
just mkfs and restore from backups. If you don't have current backups,
you shouldn't use this option first because there's a chance it makes
things worse and then it's harder to recover the data.

These are safer if you need to first update backups:

Try 'mount -o usebackuproot'

If that doesn't work, there is a very small chance 5.11 or newer will
allow you to mount the file system using 'mount -o
rescue=usebackuproot,ignorebadroots'  which is a lot easier to do
recovery on because you can use normal tools to update your backups.

Try btrfs restore:

https://btrfs.wiki.kernel.org/index.php/Restore

This tool is quite dense with features to help isolate what you want
to recover. But the most simple command that tries to recover
everything that isn't a snapshot:

btrfs restore -vi -D /dev/ /path/to/save/files

It is also possible to use 'btrfs-find-root' and plug in the address
for roots (try most recent first, and then go older) into the 'btrfs
restore -t' option. Basically you're pointing it to an older root that
hopefully doesn't have damage. The older back you go though, the more
stale the trees are and they could have been overwritten. So you
pretty much have to try roots in order from most recent, one by one.

Might be easier to ask on irc.freenode.net, #btrfs.

-- 
Chris Murphy

5.12-rc7 occasional btrfs splat when rebooting

2021-04-18 Thread Chris Murphy

I'm not sure with which rc I first saw this appear. I don't recalling
seeing it with 5.11 series. There's nothing unusual reported during
the subsequent reboot.

[16212.441466] kernel: dnf (7568) used greatest stack depth: 10752 bytes left
[16332.569785] kernel: Lockdown: systemd-logind: hibernation is
restricted; see man kernel_lockdown.7
[16337.349525] kernel: rfkill: input handler enabled
[16339.203377] kernel: BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!
[16339.203439] kernel: turning off the locking correctness validator.
[16339.203491] kernel: Please attach the output of /proc/lock_stat to
the bug report
[16339.203555] kernel: CPU: 2 PID: 5625 Comm: signal-desktop Not
tainted 5.12.0-0.rc7.189.fc35.x86_64+debug #1
[16339.203636] kernel: Hardware name: HP HP Spectre Notebook/81A0,
BIOS F.44 11/25/2019
[16339.203698] kernel: Call Trace:
[16339.203723] kernel:  dump_stack+0x7f/0xa1
[16339.203762] kernel:  __lock_acquire.cold+0x1a9/0x2bf
[16339.203810] kernel:  lock_acquire+0xc4/0x3a0
[16339.203850] kernel:  ? __delayacct_thrashing_end+0x36/0x60
[16339.203898] kernel:  ? mark_held_locks+0x50/0x80
[16339.203938] kernel:  _raw_spin_lock_irqsave+0x4d/0x90
[16339.203981] kernel:  ? __delayacct_thrashing_end+0x36/0x60
[16339.204030] kernel:  __delayacct_thrashing_end+0x36/0x60
[16339.204077] kernel:  wait_on_page_bit_common+0x38e/0x490
[16339.204125] kernel:  ? add_page_wait_queue+0xf0/0xf0
[16339.204170] kernel:  read_extent_buffer_pages+0x55e/0x610
[16339.204222] kernel:  btree_read_extent_buffer_pages+0x97/0x110
[16339.204277] kernel:  read_tree_block+0x39/0x60
[16339.204314] kernel:  btrfs_read_node_slot+0xe3/0x130
[16339.204358] kernel:  push_leaf_left+0x98/0x190
[16339.204400] kernel:  btrfs_del_items+0x2ba/0x440
[16339.204446] kernel:  btrfs_truncate_inode_items+0x254/0xfc0
[16339.204499] kernel:  ? _raw_spin_unlock+0x1f/0x30
[16339.204542] kernel:  ? btrfs_block_rsv_migrate+0x6d/0xb0
[16339.204589] kernel:  btrfs_evict_inode+0x3fe/0x4e0
[16339.204631] kernel:  evict+0xcf/0x1d0
[16339.204662] kernel:  __dentry_kill+0xe8/0x190
[16339.204697] kernel:  ? dput+0x20/0x480
[16339.204729] kernel:  dput+0x2b8/0x480
[16339.204758] kernel:  __fput+0x102/0x260
[16339.204792] kernel:  task_work_run+0x5c/0xa0
[16339.204830] kernel:  do_exit+0x3e1/0xc20
[16339.204864] kernel:  ? find_held_lock+0x32/0x90
[16339.204903] kernel:  ? sched_clock+0x5/0x10
[16339.204938] kernel:  ? sched_clock_cpu+0xc/0xb0
[16339.204977] kernel:  do_group_exit+0x39/0xb0
[16339.205008] kernel:  get_signal+0x16f/0xb00
[16339.205037] kernel:  arch_do_signal_or_restart+0xfc/0x750
[16339.205075] kernel:  ? finish_task_switch.isra.0+0xa0/0x2c0
[16339.205120] kernel:  ? finish_task_switch.isra.0+0x6a/0x2c0
[16339.205165] kernel:  ? do_user_addr_fault+0x1ea/0x6b0
[16339.205208] kernel:  exit_to_user_mode_prepare+0x15d/0x240
[16339.205253] kernel:  ? asm_exc_page_fault+0x8/0x30
[16339.205296] kernel:  irqentry_exit_to_user_mode+0x5/0x40
[16339.205343] kernel:  asm_exc_page_fault+0x1e/0x30
[16339.205383] kernel: RIP: 0033:0x7f49d11b6674
[16339.205421] kernel: Code: Unable to access opcode bytes at RIP
0x7f49d11b664a.
[16339.205481] kernel: RSP: 002b:7f49ce07f250 EFLAGS: 00010206
[16339.205530] kernel: RAX: 55593f9bc088 RBX: 7f49d11d9140
RCX: 084e
[16339.205602] kernel: RDX: 0c4e RSI: 0099c84e
RDI: 267213a2
[16339.205664] kernel: RBP:  R08: 7f49ce07f390
R09: 7f49d11d9400
[16339.205720] kernel: R10: 7f49d11aa540 R11: 005a
R12: 005a
[16339.205781] kernel: R13: 7f49ce1c5688 R14: 0001
R15: 
[16339.626109] kernel: wlp108s0: deauthenticating from
f8:a0:97:6e:c7:e8 by local choice (Reason: 3=DEAUTH_LEAVING)
[16340.238863] kernel: kauditd_printk_skb: 93 callbacks suppressed



-- 
Chris Murphy

Re: Design strangeness of incremental btrfs send/recieve

2021-04-16 Thread Chris Murphy

On Fri, Apr 16, 2021 at 9:03 PM Alexandru Stan  wrote:
>
> # sending back incrementally (eg: without sending back file-0) fails
> alex@alex-desktop:/mnt% sudo btrfs send bigfs/myvolume-1 -p
> bigfs/myvolume-3|sudo btrfs receive ssdfs/
> At subvol bigfs/myvolume-1
> At snapshot myvolume-1
> ERROR: cannot find parent subvolume

What about using -c instead of -p?



-- 
Chris Murphy

Re: Dead fs on 2 Fedora systems: block=57084067840 write time tree block corruption detected

2021-04-15 Thread Chris Murphy

On Thu, Apr 15, 2021 at 2:04 AM Niccolò Belli  wrote:
>
> Full dmesg: https://pastebin.com/pNBhAPS5

This is at initial ro mount time during boot:
[ 4.035226] BTRFS info (device nvme0n1p8): bdev /dev/nvme0n1p8 errs:
wr 0, rd 0, flush 0, corrupt 41, gen 0

There are previously detected corruption events. This is just a simple
counter. It could be the same corruption encountered 41 times, or it
could be 41 separate corrupt blocks. In other words, older logs might
have a clue about what first started going wrong.

> I have another laptop with Arch Linux and btrfs, should I be worried
> about it? Maybe it's a Fedora thing?

Both are using upstream stable Btrfs code. I think the focus at this
point is on tracking down a hardware cause for the two problems,
however unusually bad luck that is; but also there could be a bug
(e.g. repair shouldn't crash).

The correct reaction to corruption on Btrfs is to update backups while
you still can, while it's still mounted or can be mounted. Then try
repair once the underlying problem has been rectified.

-- 
Chris Murphy

Re: Dead fs on 2 Fedora systems: block=57084067840 write time tree block corruption detected

2021-04-15 Thread Chris Murphy

First computer/file system:

(from the photo):

[   136.259984] BTRFS critical (device nvme0n1p8): corrupt leaf: root=257
block=31259951104 slot=9 ino=3244515, name hash mismatch with key, have
0xF22F547D expect 0x92294C62

This is not obviously a bit flip. I'm not sure what's going on here.

Second computer/file system:

[30177.298027] BTRFS critical (device nvme0n1p8): corrupt leaf: root=791
block=57084067840 slot=64 ino=1537855, name hash mismatch with key, have
0xa461adfd expect 0xa461adf5

This is clearly a bit flip. It's likely some kind of hardware related
problem, despite the memory checking already done, it just is rare
enough to evade detection with a typical memory tester like
memtest86(+). You could try 'memtester' or  '7z b 100' and see if you
can trigger it. It's a catch-22 with such a straightforward problem
like a bit flip, that it's risky to attempt a repair which can end up
causing worse corruption.

What about the mount options for both file systems? (cat /proc/mounts
or /etc/fstab)

--
Chris Murphy

Re: Parent transid verify failed (and more): BTRFS for data storage in Xen VM setup

2021-04-10 Thread Chris Murphy

On Sat, Apr 10, 2021 at 8:49 AM Roman Mamedov  wrote:
>
> On Sat, 10 Apr 2021 13:38:57 +
> Paul Leiber  wrote:
>
> > d) Perhaps the complete BTRFS setup (Xen, VMs, pass through the partition, 
> > Samba share) is flawed?
>
> I kept reading and reading to find where you say you unmounted in on the host,
> and then... :)
>
> > e) Perhaps it is wrong to mount the BTRFS root first in the Dom0 and then 
> > accessing the subvolumes in the DomU?
>
> Absolutely O.o
>
> Subvolumes are very much like directories, not any kind of subpartitions.

Right. The block device (partition containing the Btrfs file system)
must be exclusively used by one kernel, host or guest. Dom0 or DomU.
Can't be both.

The only exception I'm aware of is virtiofs or virtio-9p, but I
haven't messed with that stuff yet.

-- 
Chris Murphy

Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.

2021-04-10 Thread Chris Murphy

Keeping everything else the same, and only reverting to kernel
5.9.16-200.fc33.x86_64, this kernel message

>overlayfs: upper fs does not support xattr, falling back to index=off and 
>metacopy=off

no longer appears when I 'podman system reset' or when 'podman build'
bolt, using the overlay driver.

However, I do still get
Bail out! ERROR:../tests/test-common.c:1413:test_io_dir_is_empty:
'empty' should be FALSE


--
Chris Murphy

Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.

2021-04-10 Thread Chris Murphy

On Sat, Apr 10, 2021 at 1:43 PM Chris Murphy  wrote:
>
> On Sat, Apr 10, 2021 at 1:42 PM Chris Murphy  wrote:
> >
> > On Sat, Apr 10, 2021 at 1:36 PM Chris Murphy  
> > wrote:
> > >
> > > $ sudo mount -o remount,userxattr /home
> > > mount: /home: mount point not mounted or bad option.
> > >
> > > [   92.573364] BTRFS error (device sda6): unrecognized mount option 
> > > 'userxattr'
> > >
> >
> > [   63.320831] BTRFS error (device sda6): unrecognized mount option 
> > 'user_xattr'
> >
> > And if I try it with rootflags at boot, boot fails due to mount
> > failure due to unrecognized mount option.
>
> These are all with kernel 5.12-rc6


Ohhh to tmpfs. Hmmm. I have no idea how to do that with this test
suite. I'll ask bolt folks. I'm just good at bumping into walls,
obviously.

-- 
Chris Murphy

Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.

2021-04-10 Thread Chris Murphy

On Sat, Apr 10, 2021 at 1:42 PM Chris Murphy  wrote:
>
> On Sat, Apr 10, 2021 at 1:36 PM Chris Murphy  wrote:
> >
> > $ sudo mount -o remount,userxattr /home
> > mount: /home: mount point not mounted or bad option.
> >
> > [   92.573364] BTRFS error (device sda6): unrecognized mount option 
> > 'userxattr'
> >
>
> [   63.320831] BTRFS error (device sda6): unrecognized mount option 
> 'user_xattr'
>
> And if I try it with rootflags at boot, boot fails due to mount
> failure due to unrecognized mount option.

These are all with kernel 5.12-rc6


-- 
Chris Murphy

Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.

2021-04-10 Thread Chris Murphy

On Sat, Apr 10, 2021 at 1:36 PM Chris Murphy  wrote:
>
> $ sudo mount -o remount,userxattr /home
> mount: /home: mount point not mounted or bad option.
>
> [   92.573364] BTRFS error (device sda6): unrecognized mount option 
> 'userxattr'
>

[   63.320831] BTRFS error (device sda6): unrecognized mount option 'user_xattr'

And if I try it with rootflags at boot, boot fails due to mount
failure due to unrecognized mount option.


-- 
Chris Murphy

Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.

2021-04-10 Thread Chris Murphy

On Sat, Apr 10, 2021 at 11:55 AM Amir Goldstein  wrote:
>
> On Sat, Apr 10, 2021 at 8:36 PM Chris Murphy  wrote:
> >
> > I can reproduce the bolt testcase problem in a podman container, with
> > overlay driver, using ext4, xfs, and btrfs. So I think I can drop
> > linux-btrfs@ from this thread.
> >
> > Also I can reproduce the title of this thread simply by 'podman system
> > reset' and see the kernel messages before doing the actual reset. I
> > have a strace here of what it's doing:
> >
> > https://drive.google.com/file/d/1L9lEm5n4-d9qemgCq3ijqoBstM-PP1By/view?usp=sharing
> >
>
> I'm confused. The error in the title of the page is from overlayfs mount().
> I see no mount in the strace.
> I feel that I am missing some info.
> Can you provide the overlayfs mount arguments
> and more information about the underlying layers?

Not really? There are none if a container isn't running, and in this
case no containers are running, in fact there are no upper or lower
dirs because I had already reset podman before doing 'strace podman
system reset' - I get the kernel message twice every time I merely do
'podman system reset'

overlayfs: upper fs does not support xattr, falling back to index=off
and metacopy=off
overlayfs: upper fs does not support xattr, falling back to index=off
and metacopy=off

This part of the issue might be something of a goose chase. I don't
know if it's relevant or distracting.


> > Yep. I think tmpfs supports xattr but not user xattr? And this example
> > is rootless podman, so it's all unprivileged.
> >
>
> OK, so unprivileged overlayfs mount support was added in v5.11
> and it requires opt-in with mount option "userxattr", which could
> explain the problem if tmpfs is used as upper layer.
>
> Do you know if that is the case?
> I sounds to me like it may not be a kernel regression per-se,
> but a regression in the container runtime that started to use
> a new kernel feature?
> Need more context to understand.
>
> Perhaps the solution will be to add user xattr support to tmpfs..

$ sudo mount -o remount,userxattr /home
mount: /home: mount point not mounted or bad option.

[   92.573364] BTRFS error (device sda6): unrecognized mount option 'userxattr'

/home is effectively a bind mount because it is backed by a btrfs subvolume...

/dev/sda6 on /home type btrfs
(rw,noatime,seclabel,compress=zstd:1,ssd,space_cache=v2,subvolid=586,subvol=/home)

...which is mounted via fstab using -o subvol=home

Is it supported to remount,userxattr? If not then maybe this is needed:

rootflags=subvol=root,userxattr


-- 
Chris Murphy

Re: btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.

2021-04-10 Thread Chris Murphy

I can reproduce the bolt testcase problem in a podman container, with
overlay driver, using ext4, xfs, and btrfs. So I think I can drop
linux-btrfs@ from this thread.

Also I can reproduce the title of this thread simply by 'podman system
reset' and see the kernel messages before doing the actual reset. I
have a strace here of what it's doing:

https://drive.google.com/file/d/1L9lEm5n4-d9qemgCq3ijqoBstM-PP1By/view?usp=sharing

It may be something intentional. The failing testcase,
:../tests/test-common.c:1413:test_io_dir_is_empty also has more
instances of this line, but I don't know they are related. So I'll
keep looking into that.

On Sat, Apr 10, 2021 at 2:04 AM Amir Goldstein  wrote:

> As the first step, can you try the suggested fix to ovl_dentry_version_inc()
> and/or adding the missing pr_debug() and including those prints in
> your report?

I'll work with bolt upstream and try to further narrow down when it is
and isn't happening.

> > I can reproduce this with 5.12.0-0.rc6.184.fc35.x86_64+debug and at
> > approximately the same time I see one, sometimes more, kernel
> > messages:
> >
> > [ 6295.379283] overlayfs: upper fs does not support xattr, falling
> > back to index=off and metacopy=off.
> >
>
> Can you say why there is no xattr support?

I'm not sure. It could be podman specific or fuse-overlayfs related.
Maybe something is using /tmp in one case and not another for some
reason?

> Is the overlayfs mount executed without privileges to create trusted.* xattrs?
> The answer to that may be the key to understanding the bug.

Yep. I think tmpfs supports xattr but not user xattr? And this example
is rootless podman, so it's all unprivileged.

> My guess is it has to do with changes related to mounting overlayfs
> inside userns, but I couldn't find any immediate suspects.
>
> Do you have any idea since when the regression appeared?
> A bisect would have been helpful here.

Yep. All good ideas. Thanks for the fast reply. I'll report back once
this has been narrowed down futher.

-- 
Chris Murphy

btrfs+overlayfs: upper fs does not support xattr, falling back to index=off and metacopy=off.

2021-04-09 Thread Chris Murphy

Hi,

The primary problem is Bolt (Thunderbolt 3) tests that are
experiencing a regression when run in a container using overlayfs,
failing at:

Bail out! ERROR:../tests/test-common.c:1413:test_io_dir_is_empty:
'empty' should be FALSE

https://gitlab.freedesktop.org/bolt/bolt/-/issues/171#note_872119

I can reproduce this with 5.12.0-0.rc6.184.fc35.x86_64+debug and at
approximately the same time I see one, sometimes more, kernel
messages:

[ 6295.379283] overlayfs: upper fs does not support xattr, falling
back to index=off and metacopy=off.

But I don't know if that kernel message relates to the bolt test failure.

If I run the test outside of a container, it doesn't fail. If I run
the test in a podman container using the btrfs driver instead of the
overlay driver, it doesn't fail. So it seems like this is an overlayfs
bug, but could be some kind of overlayfs+btrfs interaction.

Could this be related and just not yet merged?
https://lore.kernel.org/linux-unionfs/20210309162654.243184-1-amir7...@gmail.com/

Thanks,

-- 
Chris Murphy

5.12-rc6 splat, MAX_LOCKDEP_CHAIN_HLOCKS too low, Workqueue: btrfs-delalloc btrfs_work_helper

2021-04-09 Thread Chris Murphy

Got this while building bolt in a podman container. I've got reproduce
steps and test files here
https://bugzilla.redhat.com/show_bug.cgi?id=1948054


[ 3229.119497] overlayfs: upper fs does not support xattr, falling
back to index=off and metacopy=off.
[ 3229.155339] overlayfs: upper fs does not support xattr, falling
back to index=off and metacopy=off.
[ 3238.380647] BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!
[ 3238.380654] turning off the locking correctness validator.
[ 3238.380656] Please attach the output of /proc/lock_stat to the bug report
[ 3238.380657] CPU: 4 PID: 9115 Comm: kworker/u16:20 Not tainted
5.12.0-0.rc6.184.fc35.x86_64+debug #1
[ 3238.380660] Hardware name: Apple Inc.
MacBookPro8,2/Mac-94245A3940C91C80, BIOS
MBP81.88Z.0050.B00.1804101331 04/10/18
[ 3238.380663] Workqueue: btrfs-delalloc btrfs_work_helper
[ 3238.380670] Call Trace:
[ 3238.380674]  dump_stack+0x7f/0xa1
[ 3238.380680]  __lock_acquire.cold+0x1a9/0x2bf
[ 3238.380686]  ? __lock_acquire+0x3ac/0x1e10
[ 3238.380691]  lock_acquire+0xc4/0x3a0
[ 3238.380695]  ? percpu_counter_add_batch+0x45/0x60
[ 3238.380699]  ? lock_acquire+0xc4/0x3a0
[ 3238.380702]  ? lock_is_held_type+0xa7/0x120
[ 3238.380706]  ? __set_page_dirty_nobuffers+0x6b/0x1e0
[ 3238.380711]  _raw_spin_lock_irqsave+0x4d/0x90
[ 3238.380715]  ? percpu_counter_add_batch+0x45/0x60
[ 3238.380718]  percpu_counter_add_batch+0x45/0x60
[ 3238.380721]  account_page_dirtied+0x102/0x320
[ 3238.380724]  __set_page_dirty_nobuffers+0xa2/0x1e0
[ 3238.380727]  set_extent_buffer_dirty+0x63/0x80
[ 3238.380732]  btrfs_mark_buffer_dirty+0x60/0x80
[ 3238.380737]  copy_for_split+0x29e/0x360
[ 3238.380741]  split_leaf+0x1c2/0x5e0
[ 3238.380746]  btrfs_search_slot+0x99a/0x9f0
[ 3238.380751]  btrfs_insert_empty_items+0x58/0xa0
[ 3238.380754]  cow_file_range_inline.constprop.0+0x1cf/0x760
[ 3238.380758]  ? __local_bh_enable_ip+0x82/0xd0
[ 3238.380762]  ? zstd_put_workspace+0x82/0x160
[ 3238.380765]  ? __local_bh_enable_ip+0x82/0xd0
[ 3238.380769]  compress_file_range+0x471/0x830
[ 3238.380774]  async_cow_start+0x12/0x30
[ 3238.380777]  ? submit_compressed_extents+0x410/0x410
[ 3238.380779]  btrfs_work_helper+0x105/0x400
[ 3238.380782]  ? lock_is_held_type+0xa7/0x120
[ 3238.380786]  process_one_work+0x2b0/0x5e0
[ 3238.380791]  worker_thread+0x55/0x3c0
[ 3238.380793]  ? process_one_work+0x5e0/0x5e0
[ 3238.380796]  kthread+0x13a/0x150
[ 3238.380799]  ? __kthread_bind_mask+0x60/0x60
[ 3238.380801]  ret_from_fork+0x1f/0x30

The /proc/lock_stat is in the downstream bug as an attachment. There's
possibly three things going on here, the bogus overlayfs warning, the
lockdep bug, and the call trace with btrfs bits in it. No idea if they
are related.

Re: Any ideas what this warnings are about?

2021-03-31 Thread Chris Murphy

 knlGS:
> >> CS:  0010 DS:  ES:  CR0: 80050033
> >> CR2: 7f654cf39010 CR3: 03884000 CR4: 003506f0
> >> Call Trace:
> >>   btrfs_commit_transaction+0x448/0xbc0 [btrfs]
> >>   ? btrfs_wait_ordered_range+0x1b8/0x210 [btrfs]
> >>   ? btrfs_sync_file+0x2b8/0x4e0 [btrfs]
> >>   btrfs_sync_file+0x343/0x4e0 [btrfs]
> >>   __x64_sys_fsync+0x34/0x60
> >>   do_syscall_64+0x33/0x40
> >
> > Normally you need to mount -o flushoncommit to trigger this warning.
> > Maybe sync is triggering it too?
>
> I've looked again and yes, this "special" filesystem is mounted
> flushoncommit and discard=async. Would it be better to not set these
> options, for now?


Flushoncommit is safe but noisy in dmesg, and can make things slow it
just depends on the workload. And discard=async is also considered
safe, though relatively new. The only way to know for sure is disable
it, and only it, run for some time period to establish "normative"
behavior, and then enable only this option and see if behavior changes
from the baseline.

If you don't have a heavy write and delete workload, you may not
really need discard=async anyway, and a weekly fstrim is generally
sufficient for the fast majority of workloads. Conversely a heavy
write and delete workload translates into a backlog of trim that gets
issued all at once, once a week, and can make an SSD bog down after
it's issued. So you just have to test it with your particular workload
to know.

Discard=async exists because a weekly fstrim, and discard=sync can
supply way too much hinting all at once to the drive about what blocks
are no longer needed and are ready for garbage collection. But again,
it's workload specific, and even hardware specific. Some hardware is
sufficiently overprovisioned that there's no benefit to issuing
discards at all, and normal usage gives the drive firmware all it
needs to know about what blocks are ready for garbage collection (and
erasing blocks to prepare them for future writes).

-- 
Chris Murphy

Re: Re[4]: Filesystem sometimes Hangs

2021-03-31 Thread Chris Murphy

On Wed, Mar 31, 2021 at 8:03 AM Hendrik Friedel  wrote:

> >>[Mo Mär 29 09:29:21 2021] BTRFS info (device sdc2): turning on sync discard
> >
> >Remove the discard mount option for this file system and see if that
> >fixes the problem. Run it for a week or two, or until you're certain
> >the problem is still happening (or certain it's gone). Some drives
> >just can't handle sync discards, they become really slow and hang,
> >just like you're reporting.
>
> In fstab, this option is not set:
> /dev/disk/by-label/DataPool1/srv/dev-disk-by-label-DataPool1
> btrfs   noatime,defaults,nofail 0 2

You have more than one btrfs file system. I'm suggesting not using
discard on any of them to try and narrow down the problem.  Something
is turning on discards for sdc2, find it and don't use it for a while.

> How do I deactivate discard then?
> These drives are spinning disks. I thought that discard is only relevant
> for SSDs?

It's relevant for thin provisioning and sparse files too. But if sdc2
is a HDD then the sync discard message isn't related to the problem,
but also makes me wonder why something is enabling sync discards on a
HDD?

Anway I think you're on the right track to try 5.11.11 and if you
experience a hang again, use sysrq+w and that will dump the blocked
task trace into dmesg. Also include a description of the workload at
the time of the hang, and recent commands issued.

-- 
Chris Murphy

Re: Re[2]: Filesystem sometimes Hangs

2021-03-30 Thread Chris Murphy

On Tue, Mar 30, 2021 at 6:50 AM Hendrik Friedel  wrote:

> >  Next
> >'btrfs check --readonly' (must be done offline ie booted from usb
> >stick). And if it all comes up without errors or problems, you can
> >zero the statistics with 'btrfs dev stats -z'.
> No error found. Neither in btrfs check, nor in scrub.
> So, shall I reset the stats then?

Up to you. It's probably better to zero them because it's obvious if
the numbers change from 0, there's a problem.

> 5.10.0-0.bpo.3-amd64

It's probably OK. I'm not sure what upstream stable version this
translates into, but current stable are 5.10.27 and 5.11.11. There
have been multiple btrfs bug fixes since 5.10.0 was released.

I missed in your first email this line:

>[Mo Mär 29 09:29:21 2021] BTRFS info (device sdc2): turning on sync discard

Remove the discard mount option for this file system and see if that
fixes the problem. Run it for a week or two, or until you're certain
the problem is still happening (or certain it's gone). Some drives
just can't handle sync discards, they become really slow and hang,
just like you're reporting. It's probably adequate to just enable the
fstrim.timer, part of util-linux, which runs once per week. If you
have really heavy write and delete workloads, you might benefit from
discard=async mount option (async instead of sync). But first you
should just not do any discards at all for a while to see if that's
the problem and then deliberately re-introduce just that one single
change so you can monitor it for problems.

-- 
Chris Murphy

Re: Support demand on Btrfs crashed fs.

2021-03-30 Thread Chris Murphy

I'm going to fill in some details from the multiday conversation with
IRC regulars. We couldn't figure out a way forward.

* WDC Red with Firmware Version: 80.00A80, which is highly suspected
to deal with power fail and write caching incorrectly, and at least on
Btrfs apparently pretty much always drops writes for critical
metadata.
* A power fail / reset happened
* No snapshots
* --repair and --init-extent-tree  may not have done anything because
they didn't complete
* Less than 10% needs to be recovered and it's accepted that it can't
be repaired. The focus is just on a limited restore, but we can't get
past the transid failures.


zapan@UBUNTU-SERVER:~$ sudo btrfs check --readonly /dev/md0
Opening filesystem to check...
parent transid verify failed on 23079040831488 wanted 524940 found 524941
parent transid verify failed on 23079040831488 wanted 524940 found 524941
Ignoring transid failure
parent transid verify failed on 23079040319488 wanted 524931 found 524939
Ignoring transid failure
Checking filesystem on /dev/md0
UUID: f4f04e16-ce38-4a57-8434-67562a0790bd
[1/7] checking root items
parent transid verify failed on 23079042863104 wanted 423153 found 524931
parent transid verify failed on 23079042863104 wanted 423153 found 524931
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=23079040999424 item=11 parent
level=2 child bytenr=23079042863104 child level=0
ERROR: failed to repair root items: Input/output error
[2/7] checking extents
parent transid verify failed on 23079042863104 wanted 423153 found 524931
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=23079040999424 item=11 parent
level=2 child bytenr=23079042863104 child level=0
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
root 5 root dir 256 not found
parent transid verify failed on 23079042863104 wanted 423153 found 524931
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=23079040999424 item=11 parent
level=2 child bytenr=23079042863104 child level=0
ERROR: errors found in fs roots
found 0 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 0
total fs tree bytes: 0
total extent tree bytes: 0
btree space waste bytes: 0
file data blocks allocated: 0
referenced 0

btrfs-find-root doesn't find many options to work with, and all of
them fail with 'btrfs restore -t'


zapan@UBUNTU-SERVER:~$ sudo btrfs-find-root /dev/md0
parent transid verify failed on 23079040831488 wanted 524940 found 524941
parent transid verify failed on 23079040831488 wanted 524940 found 524941
Ignoring transid failure
parent transid verify failed on 23079040319488 wanted 524931 found 524939
Ignoring transid failure
Superblock thinks the generation is 524941
Superblock thinks the level is 2
Found tree root at 23079040999424 gen 524941 level 2
Well block 23079040327680(gen: 524940 level: 2) seems good, but
generation/level doesn't match, want gen: 524941 level: 2
Well block 23079040389120(gen: 524939 level: 2) seems good, but
generation/level doesn't match, want gen: 524941 level: 2
zapan@UBUNTU-SERVER:~$ sudo btrfs restore -viD -t 23079040389120
/dev/md0 /mnt/raid1/restore/
parent transid verify failed on 23079040389120 wanted 524941 found 524939
parent transid verify failed on 23079040389120 wanted 524941 found 524939
Ignoring transid failure
parent transid verify failed on 23079040323584 wanted 524939 found 524941
parent transid verify failed on 23079040323584 wanted 524939 found 524941
Ignoring transid failure
parent transid verify failed on 23079040319488 wanted 524931 found 524939
Ignoring transid failure
This is a dry-run, no files are going to be restored
Reached the end of the tree searching the directory
zapan@UBUNTU-SERVER:~$ sudo btrfs restore -viD -t 23079040327680
/dev/md0 /mnt/raid1/restore/
parent transid verify failed on 23079040327680 wanted 524941 found 524940
parent transid verify failed on 23079040327680 wanted 524941 found 524940
Ignoring transid failure
parent transid verify failed on 23079040831488 wanted 524940 found 524941
parent transid verify failed on 23079040831488 wanted 524940 found 524941
Ignoring transid failure
parent transid verify failed on 23079040319488 wanted 524931 found 524939
Ignoring transid failure
This is a dry-run, no files are going to be restored
Reached the end of the tree searching the directory





-- 
Chris Murphy

Re: Re: Help needed with filesystem errors: parent transid verify failed

2021-03-30 Thread Chris Murphy

On Tue, Mar 30, 2021 at 2:44 AM B A  wrote:
>
>
> > Gesendet: Dienstag, 30. März 2021 um 00:07 Uhr
> > Von: "Chris Murphy" 
> > An: "B A" 
> > Cc: "Btrfs BTRFS" 
> > Betreff: Re: Help needed with filesystem errors: parent transid verify 
> > failed
> >
> > On Sun, Mar 28, 2021 at 9:41 AM B A  wrote:
> > >
> > > * Samsung 840 series SSD (SMART data looks fine)
> >
> > EVO or PRO? And what does its /proc/mounts line look like?
>
> Model is MZ-7TD500, which seems to be an EVO. Firmware is DXT08B0Q.

For me smartctl reports
Device Model: Samsung SSD 840 EVO 250GB
Firmware Version: EXT0DB6Q

Yours might be a PRO or it could just be a different era EVO. Last I
checked, Samsung had no firmware updates on their website for the 840
EVO. While I'm aware of some minor firmware bugs related to smartctl
testing, so far I've done well over 100 pull the power cord tests
while doing heavy writes (with Btrfs), and have never had a problem.
So I'd say there's probably not a "per se" problem with this model.
Best guess is that since the leaves pass checksum, it's not
corruption, but some SSD equivalent of a misdirected write (?) if
that's possible. It just looks like these two leaves are in the wrong
place.

> > Total_LBAs_Written?
>
> Raw value:
92857573119

OK I'm at 33063832698

Well hopefully --repair will fix it (let us know either way) and if
not, then we'll see what Josef can come up with, or alternatively you
can just mkfs and restore from backups which will surely be faster.

-- 
Chris Murphy

Re: Help needed with filesystem errors: parent transid verify failed

2021-03-29 Thread Chris Murphy

On Sun, Mar 28, 2021 at 9:41 AM B A  wrote:
>
> * Samsung 840 series SSD (SMART data looks fine)

EVO or PRO? And what does its /proc/mounts line look like?

Total_LBAs_Written?


-- 
Chris Murphy

Re: help needed with raid 6 filesystem with errors

2021-03-29 Thread Chris Murphy

On Mon, Mar 29, 2021 at 4:22 AM Bas Hulsken  wrote:
>
> Dear list,
>
> due to a disk intermittently failing in my 4 disk array, I'm getting
> "transid verify failed" errors on my btrfs filesystem (see attached
> dmesg | grep -i btrfs dump in btrfs_dmesg.txt). When I run a scrub, the
> bad disk (/dev/sdd) becomes unresponsive, so I'm hesitant to try that
> again (happened 3 times now, and was the root cause of the transid
> verify failed errors possibly, at least they did not show up earlier
> than the failed scrub).

Is the dmesg filtered? An unfiltered dmesg might help understand what
might be going on with the drive being unresponsive, if it's spitting
out any kind of errors itself or if there are kernel link reset
messages.

Check if the drive supports SCT ERC.

smartctl -l scterc /dev/sdX

If it does but it isn't enabled, enable it. This is true for all the drives.

smartctl -l scterc,70,70

That will result in the drive giving up on errors much sooner rather
than doing the very slow "deep recovery" on reads. If this goes beyond
30 seconds, the kernel's command timer will think the device is
unresponsive and issue a link reset which is ... bad for this use
case. You really want the drive to error out quickly and allow Btrfs
to do the fixups.

If you can't configure the SCT ERC on the drives, you'll need to
increase the kernel command timeout which is a per device value in
/sys/block/sdX/device/timeout  - default is 30 and chances are 180 is
enough (which sounds terribly high and it is but reportedly some
consumer drives can have such high timeouts).

Basically you want the drive timeout to be shorter than the kernel's.

>A new disk is on it's way to use btrfs replace,
> but I'm not sure whehter that will be a wise choice for a filesystem
> with errors. There was never a crash/power failure, so the filesystem
> was unmounted at every reboot, but as said on 3 occasions (after a
> scrub), that unmount was with on of the four drives unresponsive.

The least amount of risk is to not change anything. When you do the
replace, make sure you use recent btrfs-progs and use 'btrfs replace'
instead of 'btrfs device add/remove'

https://lore.kernel.org/linux-btrfs/20200627032414.gx10...@hungrycats.org/

If metadata is raid5 too, or if it's not already using space_cache v2,
I'd probably leave it alone until after the flakey device is replaced.

> Funnily enough, after a reboot every time the filesystem gets mounted
> without issues (the unresponsive drive is back online), and btrfs check
> --readonly claims the filesystem has no errors (see attached
> btrfs_sdd_check.txt).

I'd take advantage of it's cooperative moment by making sure backups
are fresh in case things get worse.

> Not sure what to do next, so seeking your advice! The important data on
> the drive is backed up, and I'll be running a verify to see if there
> are any corruptions overnight. Would still like to try to save the
> filesystem if possible though.

-- 
Chris Murphy

Re: Filesystem sometimes Hangs

2021-03-29 Thread Chris Murphy

 Mar 28 20:26:20 homeserver kernel: [1298220.030331]
> start_transaction+0x46d/0x540 [btrfs]
> Mar 28 20:26:20 homeserver kernel: [1298220.030361]
> btrfs_create+0x58/0x1f0 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.854109] task:btrfs-cleaner
> state:D stack:0 pid:20078 ppid: 2 flags:0x4000
> Mar 28 20:28:21 homeserver kernel: [1298340.854151]
> wait_current_trans+0xc2/0x120 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.854169]
> start_transaction+0x46d/0x540 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.854183]
> btrfs_drop_snapshot+0x90/0x7f0 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.854202]  ?
> btrfs_delete_unused_bgs+0x3e/0x850 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.854218]
> btrfs_clean_one_deleted_snapshot+0xd7/0x130 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.854232]
> cleaner_kthread+0xfa/0x120 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.854247]  ?
> btrfs_alloc_root+0x3d0/0x3d0 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.857610]
> wait_current_trans+0xc2/0x120 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.857627]
> start_transaction+0x46d/0x540 [btrfs]
> Mar 28 20:28:21 homeserver kernel: [1298340.857643]
> btrfs_create+0x58/0x1f0 [btrfs]
> Mar 28 20:58:34 homeserver kernel: [1300153.336160] task:btrfs-transacti
> state:D stack:0 pid:20080 ppid: 2 flags:0x4000
> Mar 28 20:58:34 homeserver kernel: [1300153.336215]
> btrfs_commit_transaction+0x92b/0xa50 [btrfs]
> Mar 28 20:58:34 homeserver kernel: [1300153.336246]
> transaction_kthread+0x15d/0x180 [btrfs]
> Mar 28 20:58:34 homeserver kernel: [1300153.336273]  ?
> btrfs_cleanup_transaction+0x590/0x590 [btrfs]
>
>
> What could I do to find the cause?

What kernel version?

-- 
Chris Murphy

Re: Re: Help needed with filesystem errors: parent transid verify failed

2021-03-29 Thread Chris Murphy

On Mon, Mar 29, 2021 at 1:34 AM B A  wrote:
>
> This is a very old BTRFS filesystem created with Fedora *23* i.e. a linux 
> kernel and btrfs-progs around version 4.2. It was probably created 2015-10-31 
> with Fedora 23 beta and kernel 4.2.4 or 4.2.5.
>
> I ran `btrfs scrub` about a month ago without issues. I ran `btrfs check` 
> maybe a year ago without issues. I also run `btrfs filesystem balance` from 
> time to time (~once a year). None of these have shown the issue before. Does 
> that mean that the issue has not been present for a long time (>1 year)?

Maybe. The generation on these two leaves look recent. But kernels
since ~5.3 have a write time tree checker designed to catch metadata
errors before they are written.

What do you get for:

btrfs insp dump-s -f /dev/dm-0

Hopefully Qu or Josef will have an idea.

--
Chris Murphy

Re: Help needed with filesystem errors: parent transid verify failed

2021-03-28 Thread Chris Murphy

On Sun, Mar 28, 2021 at 7:02 PM Chris Murphy  wrote:
>
> Can you post the output from both:
>
> btrfs insp dump-t -b 1144783093760 /dev/dm-0
> btrfs insp dump-t -b 1144881201152 /dev/dm-0

I'm not sure if those dumps will contain filenames, so check them.
It's ok to remove filenames before posting the output. You can also
use the option --hide-names.

btrfs insp dump-t --hide-names -b 1144783093760 /dev/dm-0

It may be a good idea to do a memory test as well.

-- 
Chris Murphy

Re: Help needed with filesystem errors: parent transid verify failed

2021-03-28 Thread Chris Murphy

On Sun, Mar 28, 2021 at 9:41 AM B A  wrote:
>
> Dear btrfs experts,
>
>
> On my desktop PC, I have 1 btrfs partition on a single SSD device with 3 
> subvolumes (/, /home, /var). Whenever I boot my PC, after logging in to 
> GNOME, the btrfs partition is being remounted as ro due to errors. This is 
> the dmesg output at that time:
>
> > [  616.155392] BTRFS error (device dm-0): parent transid verify failed on 
> > 1144783093760 wanted 2734307 found 2734305
> > [  616.155650] BTRFS error (device dm-0): parent transid verify failed on 
> > 1144783093760 wanted 2734307 found 2734305
> > [  616.155657] BTRFS: error (device dm-0) in __btrfs_free_extent:3054: 
> > errno=-5 IO failure
> > [  616.155662] BTRFS info (device dm-0): forced readonly
> > [  616.155665] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2124: 
> > errno=-5 IO failure

transid error usually means something below Btrfs got the write
ordering wrong and one or more writes dropped, but the problem isn't
detected until later which means it's an older problem. What's the
oldest kernel this file system has been written with? That is, is it a
new Fedora 33 file system? Or older? Fedora 33 came with 5.8.15.

ERROR: child eb corrupted: parent bytenr=1144783093760 item=14 parent
level=1 child level=2
ERROR: child eb corrupted: parent bytenr=1144881201152 item=14 parent
level=1 child level=2

Can you post the output from both:

btrfs insp dump-t -b 1144783093760 /dev/dm-0
btrfs insp dump-t -b 1144881201152 /dev/dm-0

> What shall I do now? Do I need any of the invasive methods (`btrfs rescue` or 
> `btrfs check --repair`) and if yes, which method do I choose?

No repairs yet until we know what's wrong and if it's safe to try to repair it.

In the meantime I highly recommend refreshing backups of /home in case
this can't be repaired. It might be easier to do this with a Live USB
boot of Fedora 33, and use 'mount -o ro,subvol=home /dev/dm-0
/mnt/home' to mount your home read-only to get a backup. Live
environment will be more cooperative.

--
Chris Murphy

Re: 5.12-rc4: rm directory hangs for > 1m on an idle system

2021-03-27 Thread Chris Murphy

Fresh boot, this time no compression, everything else the same. Time
to delete both directories takes as long as it takes to copy one of
them ~1m17s. This time I took an early and late sysrq t pair, and
maybe caught some extra stuff.

[ 1190.094618] kernel: Workqueue: events_unbound
btrfs_preempt_reclaim_metadata_space
[ 1190.094633] kernel: Call Trace:
[ 1190.094641] kernel:  ? find_extent_buffer+0x5/0x200
[ 1190.094656] kernel:  ? find_held_lock+0x32/0x90
[ 1190.094683] kernel:  ? __lock_acquire+0x172/0x1e10
[ 1190.094694] kernel:  ? lock_is_held_type+0xa7/0x120
[ 1190.094714] kernel:  ? btrfs_search_slot+0x6d2/0x9f0
[ 1190.094729] kernel:  ? btrfs_get_64+0x5e/0x100
[ 1190.094751] kernel:  ? lock_acquire+0xc2/0x3a0
[ 1190.094768] kernel:  ? _raw_spin_unlock+0x1f/0x30
[ 1190.094779] kernel:  ? rcu_read_lock_sched_held+0x3f/0x80
[ 1190.094798] kernel:  ? __lock_acquire+0x172/0x1e10
[ 1190.094811] kernel:  ? lookup_extent_backref+0x43/0xd0
[ 1190.094829] kernel:  ? release_extent_buffer+0xa3/0xe0
[ 1190.094846] kernel:  ? __btrfs_free_extent+0x49c/0x8f0
[ 1190.094878] kernel:  ? __btrfs_run_delayed_refs+0x29a/0x1270
[ 1190.094912] kernel:  ? _raw_spin_unlock+0x1f/0x30
[ 1190.094934] kernel:  ? btrfs_run_delayed_refs+0x86/0x210
[ 1190.094954] kernel:  ? flush_space+0x570/0x6d0
[ 1190.094966] kernel:  ? lock_release+0x280/0x410
[ 1190.094987] kernel:  ? btrfs_preempt_reclaim_metadata_space+0x170/0x2f0
[ 1190.095007] kernel:  ? process_one_work+0x2b0/0x5e0
[ 1190.095035] kernel:  ? worker_thread+0x55/0x3c0
[ 1190.095045] kernel:  ? process_one_work+0x5e0/0x5e0
[ 1190.095060] kernel:  ? kthread+0x13a/0x150
[ 1190.095070] kernel:  ? __kthread_bind_mask+0x60/0x60
[ 1190.095085] kernel:  ? ret_from_fork+0x1f/0x30

dmesg
https://drive.google.com/file/d/1VQNAVynVTJo6VqsRX9K5-Z0dMsLmb-vH/view?usp=sharing

5.12-rc4: rm directory hangs for > 1m on an idle system

2021-03-27 Thread Chris Murphy

5.12.0-0.rc4.175.fc35.x86_64+debug

/dev/sdb1 on /srv/extra type btrfs
(rw,relatime,seclabel,compress=zstd:1,space_cache=v2,subvolid=5,subvol=/)

The directories being deleted are on a separate drive (HDD) from /
(SSD). It's an unpacked Firefox source tarball, ~2.7G. I had two
separate copies, so the rm command was merely:

rm -rf firefox1 firefox2

And that command did not return to a prompt for over a minute, with no
disk activity all, on an otherwise idle laptop. sysrq+w shows nothing,
sysrq+t shows some things.


[ 9638.375968] kernel: task:rm  state:R  running task
stack:13176 pid: 2275 ppid:  1892 flags:0x
[ 9638.375986] kernel: Call Trace:
[ 9638.375998] kernel:  ? __lock_acquire+0x3ac/0x1e10
[ 9638.376014] kernel:  ? __lock_acquire+0x3ac/0x1e10
[ 9638.376036] kernel:  ? lock_acquire+0xc2/0x3a0
[ 9638.376051] kernel:  ? lock_acquire+0xc2/0x3a0
[ 9638.376069] kernel:  ? lock_acquire+0xc2/0x3a0
[ 9638.376081] kernel:  ? lock_is_held_type+0xa7/0x120
[ 9638.376090] kernel:  ? rcu_read_lock_sched_held+0x3f/0x80
[ 9638.376099] kernel:  ? __btrfs_tree_lock+0x27/0x120
[ 9638.376111] kernel:  ? __clear_extent_bit+0x274/0x560
[ 9638.376120] kernel:  ? _raw_spin_lock_irqsave+0x67/0x90
[ 9638.376139] kernel:  ? __lock_acquire+0x3ac/0x1e10
[ 9638.376153] kernel:  ? lock_acquire+0xc2/0x3a0
[ 9638.376161] kernel:  ? __lock_acquire+0x3ac/0x1e10
[ 9638.376189] kernel:  ? lock_is_held_type+0xa7/0x120
[ 9638.376208] kernel:  ? release_extent_buffer+0xa3/0xe0
[ 9638.376224] kernel:  ? btrfs_update_root_times+0x2a/0x60
[ 9638.376237] kernel:  ? btrfs_insert_orphan_item+0x62/0x80
[ 9638.376246] kernel:  ? _atomic_dec_and_lock+0x31/0x50
[ 9638.376264] kernel:  ? btrfs_evict_inode+0x16b/0x4e0
[ 9638.376273] kernel:  ? btrfs_evict_inode+0x370/0x4e0
[ 9638.376293] kernel:  ? evict+0xcf/0x1d0
[ 9638.376305] kernel:  ? do_unlinkat+0x1b2/0x2c0
[ 9638.376329] kernel:  ? do_syscall_64+0x33/0x40
[ 9638.376338] kernel:  ? entry_SYSCALL_64_after_hwframe+0x44/0xae

The entire dmesg is here:
https://drive.google.com/file/d/1gyyp59Ju1aRIz3FCZU-kmu05-W1NN89A/view?usp=sharing

It isn't nearly as bad deleting one directory at once ~15s.

-- 
Chris Murphy

Re: parent transid verify failed / ERROR: could not setup extent tree

2021-03-23 Thread Chris Murphy

On Tue, Mar 23, 2021 at 12:50 AM Dave T  wrote:
>
> > d. Just skip the testing and try usebackuproot with a read-write
> > mount. It might make things worse, but at least it's fast to test. If
> > it messes things up, you'll have to recreate this backup from scratch.
>
> I took this approach. My command was simply:
>
> mount -o usebackuproot /dev/mapper/xzy /backup
>
> It appears to have succeeded because it mounted without errors. I
> completed a new incremental backup (with btrbk) and it finished
> without errors.
> I'll be pleased if my backup history is preserved, as appears to be the case.
>
> I will run some checks on those backup subvolumes tomorrow. Are there
> specific checks you would recommend?

It will have replaced all the root nodes and super blocks within a
minute, or immediately upon umount. So you can just do a 'btrfs check'
and see if that comes up clean now. It's basically a kind of rollback
and if it worked, there will be no inconsistencies found by btrfs
check.



-- 
Chris Murphy

Re: parent transid verify failed / ERROR: could not setup extent tree

2021-03-22 Thread Chris Murphy

On Mon, Mar 22, 2021 at 12:32 AM Dave T  wrote:
>
> On Sun, Mar 21, 2021 at 2:03 PM Chris Murphy  wrote:
> >
> > On Sat, Mar 20, 2021 at 11:54 PM Dave T  wrote:
> > >
> > > # btrfs check -r 2853787942912 /dev/mapper/xyz
> > > Opening filesystem to check...
> > > parent transid verify failed on 2853787942912 wanted 29436 found 29433
> > > parent transid verify failed on 2853787942912 wanted 29436 found 29433
> > > parent transid verify failed on 2853787942912 wanted 29436 found 29433
> > > Ignoring transid failure
> > > parent transid verify failed on 2853827723264 wanted 29433 found 29435
> > > parent transid verify failed on 2853827723264 wanted 29433 found 29435
> > > parent transid verify failed on 2853827723264 wanted 29433 found 29435
> > > Ignoring transid failure
> > > leaf parent key incorrect 2853827723264
> > > ERROR: could not setup extent tree
> > > ERROR: cannot open file system
> >
> > btrfs insp dump-t -t 2853827723264 /dev/
>
> # btrfs insp dump-t -t 2853827723264 /dev/mapper/xzy
> btrfs-progs v5.11
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> Ignoring transid failure
> leaf parent key incorrect 2853827608576
> WARNING: could not setup extent tree, skipping it
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> Ignoring transid failure
> leaf parent key incorrect 2853827608576
> Couldn't setup device tree
> ERROR: unable to open /dev/mapper/xzy
>
> # btrfs insp dump-t -t 2853787942912 /dev/mapper/xzy
> btrfs-progs v5.11
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> Ignoring transid failure
> leaf parent key incorrect 2853827608576
> WARNING: could not setup extent tree, skipping it
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> Ignoring transid failure
> leaf parent key incorrect 2853827608576
> Couldn't setup device tree
> ERROR: unable to open /dev/mapper/xzy
>
> # btrfs insp dump-t -t 2853827608576 /dev/mapper/xzy
> btrfs-progs v5.11
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> Ignoring transid failure
> leaf parent key incorrect 2853827608576
> WARNING: could not setup extent tree, skipping it
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> Ignoring transid failure
> leaf parent key incorrect 2853827608576
> Couldn't setup device tree
> ERROR: unable to open /dev/mapper/xzy

That does not look promising. I don't know whether a read-write mount
with usebackuproot will recover, or end up with problems.

Options:

a. btrfs check --repair
This probably fails on the same problem, it can't setup the extent tree.

b. btrfs check --init-extent-tree
This is a heavy hammer, it might succeed, but takes a long time. On 5T
it might take double digit hours or even single digit days. It's
generally faster to just wipe the drive and restore from backups than
use init-extent-tree (I understand this *is* your backup).

c. Setup an overlay file on device mapper, to redirect the writes from
a read-write mount with usebackup root. I think it's sufficient to
just mount, optionally write some files (empty or not), and umount.
Then do a btrfs check to see if the current tree is healthy.
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

That guide is a bit complex to deal with many drives with mdadm raid,
so you can simplify it for just one drive. The gist is no writes go to
the drive itself, it's treated as read-only by device-mapper (in fact
you can optionally add a pre-step with the blockdev command and
--setro to make sure the entire drive is read-only; just make sure to
make it rw once you're done testing). All the writes with this overlay
go into a loop mounted file which you intentionally just throw away
after testing.

d. Just skip the testing and try usebackuproot with a read-write
mount. It might make things worse, but at least it's fast to test. If
it messes things up, you'll have to recreate this backup from scratch.

As for how to prevent this? I'm not sure. About the best we can do is
disable the drive write cache with a udev rule, and/or raid1 with
another make/model drive, and let Btrfs de

Re: parent transid verify failed / ERROR: could not setup extent tree

2021-03-21 Thread Chris Murphy

On Sat, Mar 20, 2021 at 11:54 PM Dave T  wrote:
>
> # btrfs check -r 2853787942912 /dev/mapper/xyz
> Opening filesystem to check...
> parent transid verify failed on 2853787942912 wanted 29436 found 29433
> parent transid verify failed on 2853787942912 wanted 29436 found 29433
> parent transid verify failed on 2853787942912 wanted 29436 found 29433
> Ignoring transid failure
> parent transid verify failed on 2853827723264 wanted 29433 found 29435
> parent transid verify failed on 2853827723264 wanted 29433 found 29435
> parent transid verify failed on 2853827723264 wanted 29433 found 29435
> Ignoring transid failure
> leaf parent key incorrect 2853827723264
> ERROR: could not setup extent tree
> ERROR: cannot open file system

btrfs insp dump-t -t 2853827723264 /dev/

> It appears the backup root is already stale.

I'm not sure. If you can post the contents of that leaf (I don't think
it will contain filenames but double check) Qu might have an idea if
it's safe to try a read-write mount with -o usebackuproot without
causing problems later.

> > What you eventually need to look at is what precipitated the transid
> > failures, and avoid it.
>
> The USB drive was disconnected by the user (an accident). I have other
> devices with the same hardware that have never experienced this issue.
>
> Do you have further ideas or suggestions I can try? Thank you for your
> time and for sharing your expertise.

The drive could be getting write ordering wrong all the time, and it
only turns into a problem with a crash, power fail, or accidental
disconnect.  More common is the write ordering is only sometimes
wrong, and a crash or powerfail is usually survivable, but leads to a
false sense of security about the drive.

The simple theory of write order is data->metadata->sync->super->sync.
It shouldn't ever be the case that a newer superblock generation is on
stable media before the metadata it points to.

-- 
Chris Murphy

Re: parent transid verify failed / ERROR: could not setup extent tree

2021-03-20 Thread Chris Murphy

On Sat, Mar 20, 2021 at 5:15 AM Dave T  wrote:
>
> I hope to get  some expert advice before I proceed. I don't want to
> make things worse. Here's my situation now:
>
> This problem is with an external USB drive and it is encrypted.
> cryptsetup open succeeds. But mount fails.k
>
> mount /backup
> mount: /backup: wrong fs type, bad option, bad superblock on
> /dev/mapper/xusbluks, missing codepage or helper program, or other
> error.
>
>  Next the following command succeeds:
>
> mount -o ro,recovery /dev/mapper/xusbluks /backup
>
> This is my backup disk (5TB), and I don't have another 5TB disk to
> copy all the data to. I hope I can fix the issue without losing my
> backups.
>
> Next step I did:
>
> # btrfs check /dev/mapper/xyz
> Opening filesystem to check...
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> parent transid verify failed on 2853827608576 wanted 29436 found 29433
> Ignoring transid failure
> leaf parent key incorrect 2853827608576
> ERROR: could not setup extent tree
> ERROR: cannot open file system

>From your superblock:

backup 2:
backup_tree_root:   2853787942912   gen: 29433  level: 1

Do this:

btrfs check -r 2853787942912 /dev/xyz

If it comes up clean it's safe to do: mount -o usebackuproot, without
needing to use ro. And in that case it'll self recover. You will lose
some data, between the commits. It is possible there's partial loss,
so it's not enough to just do a scrub, you'll want to freshen the
backups as well - if that's what was happening at the time that the
trouble happened (the trouble causing the subsequent transid
failures).

Sometimes backup roots are already stale and inconsistent due to
overwrites, so the btrfs check might find problems with that older
root.

What you eventually need to look at is what precipitated the transid
failures, and avoid it. Typical is a drive firmware bug where it gets
write ordering wrong and then there's a crash or power fail. Possibly
one way to work around the bug is disabling the drive's write cache
(use a udev rule to make sure it's always applied). Another way is add
a different make/model drive to it, and convert to raid1 profile. And
hopefully they won't have overlapping firmware bugs.

-- 
Chris Murphy

Re: All files are damaged after btrfs restore

2021-03-16 Thread Chris Murphy

On Tue, Mar 16, 2021 at 7:39 PM Qu Wenruo  wrote:
> > Using that restore I was able to restore approx. 7 TB of the
> > originally stored 22 TB under that directory.
> > Unfortunately nearly all the files are damaged. Small text files are
> > still OK. But every larger binary file is useless.
> > Is there any possibility to fix the filesystem in a way, that I get
> > the data less damaged?
>
>  From the result, it looks like the on-disk data get (partially) wiped out.
> I doubt if it's just simple controller failure, but more likely
> something not really reaching disk or something more weird.

Hey Qu, thanks for the reply.

So it's not clear until further downthread that it's bcache in
writeback mode with an SSD that failed. And I've probably
underestimated the significance of how much data (in this case both
Btrfs metadata and user data) and for how long it can stay *only* on
the SSD with this policy.

https://bcache.evilpiepirate.org/ says it straight up, if using
writeback, it is not at all safe for the cache and backing devices to
be separated. If the cache device fails, everything on it is gone. By
my reading, for example, if the writeback percent is 50%, and the
cache device is 128G, at any given time 64G is *only* on the SSD.
There's no idle time flushing to the backing device that eventually
makes the backing device possibly a self sufficient storage device on
its own, it always needs the cache device.

-- 
Chris Murphy

Re: All files are damaged after btrfs restore

2021-03-16 Thread Chris Murphy

Hi,

The problem exceeds my knowledge of both Btrfs and bcache/ssd failure
modes. I'm not sure what professional data recovery can really do,
other than throw a bunch of people at stitching things back together
again without any help from the file system. I know that the state of
the repair tools is not great, and it is confusing what to use in what
order. I don't know if a support contract from one of the distros
supporting Btrfs (most likely SUSE) is a better way to get assistance
with this kind of recovery while also supporting development. But
that's a question for SUSE sales :)

Most of the emphasis of upstream development has been on preventing
problems from happening to critical Btrfs metadata in the first place.
Its ability to self-heal really depends on it having independent block
devices to write to, e.g. metadata raid 1. Metadata DUP might normally
help with only spinning drives, but with a cache device, it's going to
cache all of these concurrent metadata writes.

If critical metadata is seriously damaged or missing, it's probably
impossible to fix or even skip over with the current state of the
tools. Current code needs an entry point into the chunk tree in order
to make the logical to physical mapping; and then needs an entry point
to the root tree to get to the proper snapshot file tree. If all the
recent and critical metadata is lost on the failed bcache caching
device, then a totally different strategy is needed.

The file btree for the snapshot you want should be on the backing
device, as well as its data chunks, and the mapping in the ~94% of the
chunk tree that's on disk. I won't be surprised if the file system is
broken beyond repair, but I'd be a little surprised if someone more
knowledgeable can't figure out a way to get the data out of a week old
snapshot. But that's speculation on my part. I really have no idea how
long it could take for bcache in writeback mode to flush to the
backing device.

--
Chris Murphy

On Tue, Mar 16, 2021 at 3:35 AM Sebastian Roller
 wrote:
>
> Hi again.
>
> > Looks like the answer is no. The chunk tree really has to be correct
> > first before anything else because it's central to doing all the
> > logical to physical address translation. And if it's busted and can't
> > be repaired then nothing else is likely to work or be repairable. It's
> > that critical.
> >
> > > I already ran chunk-recover. It needs two days to finish. But I used
> > > btrfs-tools version 4.14 and it failed.
> >
> > I'd have to go dig in git history to even know if there's been
> > improvements in chunk recover since then. But I pretty much consider
> > any file system's tool obsolete within a year. I think it's total
> > nonsense that distributions are intentionally using old tools.
> > >
> > > root@hikitty:/mnt$ btrfs rescue chunk-recover /dev/sdf1
> > > Scanning: DONE in dev0
> > > checksum verify failed on 99593231630336 found E4E3BDB6 wanted 
> > > checksum verify failed on 99593231630336 found E4E3BDB6 wanted 
> > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> > > checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> > > bytenr mismatch, want=124762809384960, have=0
> > > open with broken chunk error
> > > Chunk tree recovery failed
> > >
> > > I could try again with a newer version. (?) Because with version 4.14
> > > also btrfs restore failed.
> >
> > It is entirely possible that 5.11 fails exactly the same way because
> > it's just too badly damaged for the current state of the recovery
> > tools to deal with damage of this kind. But it's also possible it'll
> > work. It's a coin toss unless someone else a lot more familiar with
> > the restore code speaks up. But looking at just the summary change
> > log, it looks like no work has happened in chunk recover for a while.
> >
> > https://btrfs.wiki.kernel.org/index.php/Changelog
>
> So I ran another chunk-recover with btrfs-progs version 5.11. This is
> part of the output. (The list doesn't allow me attach the whole output
> to this mail (5 mb zipped). But if you let me know what's important I
> can send that.)
>
> root@hikitty:~$ nohup /root/install/btrfs-progs-5.11/btrfs -v rescue
> chunk-recover /dev/sdi1 >
> /transfer/sebroll/btrfs-rescue-chunk-recover.out.txt 2>&1 &
> nohup: ignoring input
> All Devices:
> Device: id = 2, name = /dev/sdi1
>

Re: BTRFS error (device sda1): bdev /dev/sdb1 errs: wr 2702175, rd 2719033, flush 0, corrupt 6, gen 0

2021-03-13 Thread Chris Murphy

On Sat, Mar 13, 2021 at 5:22 AM Thomas <74cmo...@gmail.com> wrote:

> Gerät  Boot Anfang  Ende  Sektoren  Größe Kn Typ
> /dev/sdb1 2048 496093750 496091703 236,6G 83 Linux

> However the output of btrfs insp dump-s  is different:
> thomas@pc1-desktop:~
> $ sudo btrfs insp dump-s /dev/sdb1 | grep dev_item.total_bytes
> dev_item.total_bytes256059465728

sdb1 has 253998951936 bytes which is *less* than the btrfs super block
is saying it should be. 1.919 GiB less. I'm going to guess that the
sdb1 partition was reduced without first shrinking the file system.
The most common way this happens is not realizing that each member
device of a btrfs file system must be separately shrunk. If you do not
specify a devid, then devid 1 is assumed.

man btrfs filesystem
"The devid can be found in the output of btrfs filesystem show and
defaults to 1 if not specified."

I bet that the file system was shunk one time, this shrunk only devid
1 which is also /dev/sda1. But then both partitions were shrunk
thereby truncating sdb1, resulting in these errors.

If that's correct, you need to change the sdb1 partition back to its
original size (matching the size of the sdb1 btrfs superblock). Scrub
the file system so sdb1 can be repaired from any prior damage from the
mistake. Shrink this devid to match the size of the other devid, and
then change the partition.

> Gerät  BootAnfang  Ende  Sektoren  Größe Kn Typ
> /dev/sda1  * 2048 496093750 496091703 236,6G 83 Linux
>
> thomas@pc1-desktop:~
> $ sudo btrfs insp dump-s /dev/sda1 | grep dev_item.total_bytes
> dev_item.total_bytes253998948352

This is fine. The file system is 3584 bytes less than the partition.
I'm not sure why it doesn't end on a 4KiB block boundary or why
there's a gap before the start of sda2...but at least it's benign.

-- 
Chris Murphy

Re: BTRFS error (device sda1): bdev /dev/sdb1 errs: wr 2702175, rd 2719033, flush 0, corrupt 6, gen 0

2021-03-12 Thread Chris Murphy

[4.365859] usb 8-1: device not accepting address 5, error -71
[4.365920] usb usb8-port1: unable to enumerate USB device
[4.433539] BTRFS info (device sda1): bdev /dev/sdb1 errs: wr
2701995, rd 2718862, flush 0, corrupt 6, gen 0

/dev/sdb is dropping a lot of reads and writes. Is /dev/sdb in a
SATA-USB enclosure of some kind?


[   16.914959] blk_update_request: I/O error, dev fd0, sector 0 op
0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   16.914963] floppy: error 10 while reading block 0

Curious but I don't think it's related.


[   20.685589] attempt to access beyond end of device
   sdb1: rw=524288, want=496544128, limit=496091703
[   20.685798] attempt to access beyond end of device
   sdb1: rw=2049, want=496544128, limit=496091703
[   20.685804] BTRFS error (device sda1): bdev /dev/sdb1 errs: wr
2701996, rd 2718862, flush 0, corrupt 6, gen 0

Something is definitely confused but I'm not sure what or why.

$ sudo btrfs insp dump-s /dev/sdb1 | grep dev_item.total_bytes

Compare that value with (Sectors * 512) from:

$ sudo fdisk -l /dev/sdb

The fdisk number of bytes should be the same as or more than the btrfs bytes.

$ sudo smartctl -x /dev/sdb

That might require installing the smartmontools package.

--
Chris Murphy


--
Chris Murphy

Re: All files are damaged after btrfs restore

2021-03-09 Thread Chris Murphy

On Tue, Mar 9, 2021 at 10:03 AM Sebastian Roller
 wrote:

> I found 12 of these 'tree roots' on the volume. All the snapshots are
> under the same tree root. This seems to be the subvolume where I put
> the snapshots.

Snapshots are subvolumes. All of them will appear in the root tree,
even if they're organized as being in a directory or in some other
subvolume.

>So for the snapshots there is only one option to use
> with btrfs restore -r.

It can be done by its own root node address using -f or by subvolid
using -r. The latter needs to be looked up in a reliable root tree.
But I think the distinction may not matter here because really it's
the chunk tree that's messed up, and that's what's used to find
everything. The addresses in the file tree (the subvolume/snapshot
tree that contains file listings, inodes, metadata, and the address of
the file) are all logical addresses in btrfs linear space. That means
nothing without the translation to physical device and blocks, which
is the job of the chunk tree.

>But I also found the data I'm looking for under
> some other of these tree roots. One of them is clearly the subvolume
> the backup went to (the source of the snapshots). But there is also a
> very old snapshot (4 years old) that has a tree root on its own. The
> files I restored  from there are different -- regarding checksums.
> They are also corrupted, but different. I have to do some more
> hexdumps to figure out, if it's better.

Unfortunately when things are messed up badly, the recovery tools may
be looking at a wrong or partial checksum tree and it just spits out
checksum complaints as a matter of course. You'd have to inspect the
file contents themselves, the checksum warnings might be real or
bogus.
> > OK this is interesting. There's two chunk trees to choose from. So is
> > the restore problem because older roots point to the older chunk tree
> > which is already going stale, and just isn't assembling blocks
> > correctly anymore? Or is it because the new chunk tree is bad?
>
> Is there a way to choose the chunk tree I'm using for operations like
> btrfs restore?

Looks like the answer is no. The chunk tree really has to be correct
first before anything else because it's central to doing all the
logical to physical address translation. And if it's busted and can't
be repaired then nothing else is likely to work or be repairable. It's
that critical.

> I already ran chunk-recover. It needs two days to finish. But I used
> btrfs-tools version 4.14 and it failed.

I'd have to go dig in git history to even know if there's been
improvements in chunk recover since then. But I pretty much consider
any file system's tool obsolete within a year. I think it's total
nonsense that distributions are intentionally using old tools.

>
> root@hikitty:/mnt$ btrfs rescue chunk-recover /dev/sdf1
> Scanning: DONE in dev0
> checksum verify failed on 99593231630336 found E4E3BDB6 wanted 
> checksum verify failed on 99593231630336 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> bytenr mismatch, want=124762809384960, have=0
> open with broken chunk error
> Chunk tree recovery failed
>
> I could try again with a newer version. (?) Because with version 4.14
> also btrfs restore failed.

It is entirely possible that 5.11 fails exactly the same way because
it's just too badly damaged for the current state of the recovery
tools to deal with damage of this kind. But it's also possible it'll
work. It's a coin toss unless someone else a lot more familiar with
the restore code speaks up. But looking at just the summary change
log, it looks like no work has happened in chunk recover for a while.

https://btrfs.wiki.kernel.org/index.php/Changelog

> > btrfs insp dump-t -t 1 /dev/sdi1
> >
> > And you'll need to look for a snapshot name in there, find its bytenr,
> > and let's first see if just using that works. If it doesn't then maybe
> > combining it with the next most recent root tree will work.
>
> I am working backwards right now using btrfs restore -f in combination
> with -t. So far no success.

Yep. I think it comes down to the chunk tree needing to be reasonable
first, before anything else is possible.

-- 
Chris Murphy

Re: btrfs fails to mount on kernel 5.11.4 but works on 5.10.19

2021-03-07 Thread Chris Murphy

On Sun, Mar 7, 2021 at 7:18 PM Norbert Preining  wrote:
>
> Hi Chris,
>
> once more ..
>
> > > Does the initrd on this system contain?
> > >   /usr/lib/udev/rules.d/64-btrfs.rules
>
> No, it didn't.
>
> Now I added it, and with 64-btrfs.rules available in the initrd I still
> get the same error (see previous screenshot) :-(

I suspect something is wrong with devid 9 in that case. If it's a
dracut system, then it waits indefinitely for sysroot. You'll need to
boot with something like rd.break=pre-mount and see first if you can
mount normally to /sysroot, but if devid 9 is still missing then mount
degraded and replace that device. Or otherwise find out why it's
missing.

I don't think the scrub helps right now, the issue is the device is
missing. Where scrub does help is if the device reappears for normal
mount following previous degraded mount - the scrub is needed to get
the missing device caught up with the rest.

-- 
Chris Murphy

Re: btrfs fails to mount on kernel 5.11.4 but works on 5.10.19

2021-03-07 Thread Chris Murphy

On Sun, Mar 7, 2021 at 5:25 PM Norbert Preining  wrote:
>
> Hi
>
> (please cc)
>
> thanks for your email. First some additional information. Since this
> happened I searched and realized that there seem to have been a problem
> with 5.12-rc1, which I tried for short time (checking whether AMD-GPU
> hangs are fixed). Now I read that -rc1 is a btrfs-killer. I have swap
> partition, not swap file, and 64G or RAM, so normally swap is not used,
> though.

That bug should not have affected the dedication swap partition case.



-- 
Chris Murphy

Re: btrfs fails to mount on kernel 5.11.4 but works on 5.10.19

2021-03-07 Thread Chris Murphy

On Sun, Mar 7, 2021 at 4:28 PM Norbert Preining  wrote:
>
> Dear all
>
> (please cc)
>
> not sure this is the right mailing list, but I cannot boot into 5.11.4
> it gives me
> devid 9 uui .
> failed to read the system array: -2
> open_ctree failed
> (only partial, typed in from photo)

Post the photo? This is a generic message and we need to see more
information. Is devid 9 missing?

Does the initrd on this system contain?

/usr/lib/udev/rules.d/64-btrfs.rules

That will wait until all devices are available before attempting to
mount. If it's not in the initrd, it won't wait and it's prone to
races, and you can often get mount failures because not all devices
are ready to be mounted.

>
> OTOH, 5.10.19 boots without a hinch
> $ btrfs fi show /
> Label: none  uuid: 911600cb-bd76-4299-9445-666382e8ad20
> Total devices 8 FS bytes used 3.28TiB
> devid1 size 899.01GiB used 670.00GiB path /dev/sdb3
> devid2 size 489.05GiB used 271.00GiB path /dev/sdd
> devid3 size 1.82TiB used 1.58TiB path /dev/sde1
> devid4 size 931.51GiB used 708.00GiB path /dev/sdf1
> devid5 size 1.82TiB used 1.58TiB path /dev/sdc1
> devid7 size 931.51GiB used 675.00GiB path /dev/nvme2n1p1
> devid8 size 931.51GiB used 680.03GiB path /dev/nvme1n1p1
> devid9 size 931.51GiB used 678.03GiB path /dev/nvme0n1p1

This seems to be a somewhat risky setup or at least highly performance
variable. Any single device that fails will result in boot failure.

-- 
Chris Murphy

Re: All files are damaged after btrfs restore

2021-03-07 Thread Chris Murphy

On Sun, Mar 7, 2021 at 6:58 AM Sebastian Roller
 wrote:
>
> Would it make sense to just try  restore -t on any root I got with
> btrfs-find-root with all of the snapshots?

Yes but I think you've tried this and you only got corrupt files or
files with holes, so that suggests very recent roots are just bad due
to the corruption, and older ones are pointing to a mix of valid and
stale blocks and it just ends up in confusion.

I think what you're after is 'btrfs restore -f'

   -f 
   only restore files that are under specified subvolume root
pointed by 

You can get this value from each 'tree root' a.k.a. the root of roots
tree, what the super calls simply 'root'. That contains references for
all the other trees' roots. For example:

item 12 key (257 ROOT_ITEM 0) itemoff 12936 itemsize 439
generation 97406 root_dirid 256 bytenr 30752768 level 1 refs 1
lastsnap 93151 byte_limit 0 bytes_used 2818048 flags 0x0(none)
uuid 4a0fa0d3-783c-bc42-bee1-ffcbe7325753
ctransid 97406 otransid 7 stransid 0 rtransid 0
ctime 1615103595.233916841 (2021-03-07 00:53:15)
otime 1603562604.21506964 (2020-10-24 12:03:24)
drop key (0 UNKNOWN.0 0) level 0
item 13 key (257 ROOT_BACKREF 5) itemoff 12911 itemsize 25
root backref key dirid 256 sequence 2 name newpool

The name of this subvolume is newpool, the subvolid is 257, and its
address is bytenr 30752768. That's the value to plug into btrfs
restore -f

The thing is, it needs an intact chunk tree, i.e. not damaged and not
too old, in order to translate that logical address into a physical
device and physical address.

>
> > OK so you said there's an original and backup file system, are they
> > both in equally bad shape, having been on the same controller? Are
> > they both btrfs?
>
> The original / live file system was not btrfs but xfs. It is in a
> different but equally bad state than the backup. We used bcache with a
> write-back cache on a ssd which is now completely dead (does not get
> recognized by any server anymore). To get the file system mounted I
> ran xfs-repair. After that only 6% of the data was left and this is
> nearly completely in lost+found. I'm now trying to sort these files by
> type, since the data itself looks OK. Unfortunately the surviving
> files seem to be the oldest ones.

Yeah writeback means the bcache device must survive and be healthy
before any repair attempts should be made, even restore attempts. It
also means you need hardware isolation, one SSD per HDD. Otherwise one
SSD failing means the whole thing falls apart. The mode to use for
read caching is writethrough.

> backup 0:
> backup_tree_root:   122583415865344 gen: 825256 
> level: 2
> backup_chunk_root:  141944043454464 gen: 825256 
> level: 2

> backup 1:
> backup_tree_root:   122343302234112 gen: 825253 
> level: 2
> backup_chunk_root:  141944034426880 gen: 825251 
> level: 2

> backup 2:
> backup_tree_root:   122343762804736 gen: 825254 
> level: 2
> backup_chunk_root:  141944034426880 gen: 825251 
> level: 2

> backup 3:
> backup_tree_root:   122574011269120 gen: 825255 
> level: 2
> backup_chunk_root:  141944034426880 gen: 825251 
> level: 2

OK this is interesting. There's two chunk trees to choose from. So is
the restore problem because older roots point to the older chunk tree
which is already going stale, and just isn't assembling blocks
correctly anymore? Or is it because the new chunk tree is bad?

On 72 TB, the last thing I want to recommend is chunk-recover. That'll
take forever but it'd be interesting to know which of these chunk
trees is good. The chunk tree is in the system block group. It's
pretty tiny so it's a small target for being overwritten...and it's
cow. So there isn't a reason to immediately start overwriting it. I'm
thinking maybe the new one got interrupted by the failure and the old
one is intact.

Ok so the next step is to find a snapshot you want to restore.

btrfs insp dump-t -t 1 /dev/sdi1

And you'll need to look for a snapshot name in there, find its bytenr,
and let's first see if just using that works. If it doesn't then maybe
combining it with the next most recent root tree will work.

-- 
Chris Murphy

convert and scrub: spanning stripes, attempt to access beyond end of device

2021-03-05 Thread Chris Murphy

Hi,

Downstream user is running into this bug:
https://github.com/kdave/btrfs-progs/issues/349

But additionally the scrub of this converted file system, which still
has ext2_saved/image, produces this message:

[36365.549230] BTRFS error (device sda8): scrub: tree block
1777055424512 spanning stripes, ignored. logical=1777055367168
[36365.549262] attempt to access beyond end of device
   sda8: rw=0, want=3470811376, limit=3470811312

Is this a known artifact of the conversion process? Will it go away
once the ext2_saved/image is removed? Should I ask the user to create
an e2image -Q from the loop mounted rollback image file for
inspection?

Thanks

-- 
Chris Murphy

Re: All files are damaged after btrfs restore

2021-03-04 Thread Chris Murphy

On Thu, Mar 4, 2021 at 8:35 AM Sebastian Roller
 wrote:
>
> > I don't know. The exact nature of the damage of a failing controller
> > is adding a significant unknown component to it. If it was just a
> > matter of not writing anything at all, then there'd be no problem. But
> > it sounds like it wrote spurious or corrupt data, possibly into
> > locations that weren't even supposed to be written to.
>
> Unfortunately I cannot figure out exactly what happened. Logs end
> Friday night while the backup script was running -- which also
> includes a finalizing balancing of the device. Monday morning after
> some exchange of hardware the machine came up being unable to mount
> the device.

It's probably not discernible with logs anyway. What hardware does
when it goes berserk? It's chaos. And all file systems have write
order requirements. It's fine if at a certain point writes just
abruptly stop going to stable media. But if things are written out of
order, or if the hardware acknowledges critical metadata writes are
written but were actually dropped, it's bad. For all file systems.

> OK -- I now had the chance to temporarily switch to 5.11.2. Output
> looks cleaner, but the error stays the same.
>
> root@hikitty:/mnt$ mount -o ro,rescue=all /dev/sdi1 hist/
>
> [ 3937.815083] BTRFS info (device sdi1): enabling all of the rescue options
> [ 3937.815090] BTRFS info (device sdi1): ignoring data csums
> [ 3937.815093] BTRFS info (device sdi1): ignoring bad roots
> [ 3937.815095] BTRFS info (device sdi1): disabling log replay at mount time
> [ 3937.815098] BTRFS info (device sdi1): disk space caching is enabled
> [ 3937.815100] BTRFS info (device sdi1): has skinny extents
> [ 3938.903454] BTRFS error (device sdi1): bad tree block start, want
> 122583416078336 have 0
> [ 3938.994662] BTRFS error (device sdi1): bad tree block start, want
> 99593231630336 have 0
> [ 3939.201321] BTRFS error (device sdi1): bad tree block start, want
> 124762809384960 have 0
> [ 3939.221395] BTRFS error (device sdi1): bad tree block start, want
> 124762809384960 have 0
> [ 3939.221476] BTRFS error (device sdi1): failed to read block groups: -5
> [ 3939.268928] BTRFS error (device sdi1): open_ctree failed

This looks like a super is expecting something that just isn't there
at all. If spurious behavior lasted only briefly during the hardware
failure, there's a chance of recovery. But this diminishes greatly if
the chaotic behavior was on-going for a while, many seconds or a few
minutes.

> I still hope that there might be some error in the fs created by the
> crash, which can be resolved instead of real damage to all the data in
> the FS trees. I used a lot of snapshots and deduplication on that
> device, so that I expect some damage by a hardware error. But I find
> it hard to believe that every file got damaged.

Correct. They aren't actually damaged.

However, there's maybe 5-15 MiB of critical metadata on Btrfs, and if
it gets corrupt, the keys to the maze are lost. And it becomes
difficult, sometimes impossible, to "bootstrap" the file system. There
are backup entry points, but depending on the workload, they go stale
in seconds to a few minutes, and can be subject to being overwritten.

When 'btrfs restore' is doing partial recovery that ends up with a lot
of damage and holes tells me it's found stale parts of the file system
- it's on old rails so to speak, there's nothing available to tell it
that this portion of the tree is just old and not valid anymore (or
only partially valid), but also the restore code is designed to be
more tolerant of errors because otherwise it would just do nothing at
all.

I think if you're able to find the most recent root node for a
snapshot you want to restore, along with an intact chunk tree it
should be possible to get data out of that snapshot. The difficulty is
finding it, because it could be almost anywhere.

OK so you said there's an original and backup file system, are they
both in equally bad shape, having been on the same controller? Are
they both btrfs?

What do you get for

btrfs insp dump-s -f /dev/sdXY

There might be a backup tree root in there that can be used with btrfs
restore -t

Also, sometimes easier to do this on  IRC on freenode.net in the channel #btrfs

-- 
Chris Murphy

Re: [report] lockdep warning when mounting seed device

2021-02-26 Thread Chris Murphy

On Wed, Feb 24, 2021 at 9:40 PM Su Yue  wrote:
>
>
> While playing with seed device(misc/next and v5.11), lockdep
> complains the following:
>
> To reproduce:
>
> dev1=/dev/sdb1
> dev2=/dev/sdb2
>
> umount /mnt
>
> mkfs.btrfs -f $dev1
>
> btrfstune -S 1 $dev1

No mount or copying data to the file system after mkfs and before
setting the seed flag? I wonder if that's related to the splat, even
though it shouldn't happen.

-- 
Chris Murphy

Re: All files are damaged after btrfs restore

2021-02-26 Thread Chris Murphy

On Fri, Feb 26, 2021 at 9:01 AM Sebastian Roller
 wrote:
>
> > > I think you best chance is to start out trying to restore from a
> > > recent snapshot. As long as the failed controller wasn't writing
> > > totally spurious data in random locations, that snapshot should be
> > > intact.
> >
> > i.e. the strategy for this is btrfs restore -r option
> >
> > That only takes subvolid. You can get a subvolid listing with -l
> > option but this doesn't show the subvolume names yet (patch is
> > pending)
> > https://github.com/kdave/btrfs-progs/issues/289
> >
> > As an alternative to applying that and building yourself, you can
> > approximate it with:
> >
> > sudo btrfs insp dump-t -t 1 /dev/sda6 | grep -A 1 ROOT_REF
> >
> > e.g.
> > item 9 key (FS_TREE ROOT_REF 631) itemoff 14799 itemsize 26
> > root ref key dirid 256 sequence 54 name varlog34
> >
>
> Using this command I got a complete list of all the snapshots back to
> 2016 with full name.
> I tried to restore from different snapshots and using btrfs restore -t
> from some other older roots.
> Unfortunately no matter which root I restore from, the files are
> always the same. I selected a list of some larger files, namely ppts
> and sgmls from one of our own tools, and restored them from different
> roots. Then I compared the files by checksums. They are the same from
> all roots I could find the files.
> The output of btrfs restore gives me some errors for checksums and
> deflate, but most of the files are just listed as restored.
>
> Errors look like this:
>
> Restoring 
> /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/AWI/AWI_6.14-2_2015.zip
> Restoring 
> /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/AWI/installInstructions.txt
> Done searching /Hardware_Software/ABAQUS/AWI
> checksum verify failed on 57937054842880 found 00B6 wanted 
> ERROR: lzo decompress failed: -4
> Error copying data for
> /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/CMA_win86_32_2012.0928.3/setup.exe
> Error searching
> /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/CMA_win86_32_2012.0928.3/setup.exe
> ERROR: lzo decompress failed: -4
> Error copying data for
> /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/CMAInstaller.msi
> ERROR: lzo decompress failed: -4
> Error copying data for
> /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/setup.exe
> Error searching
> /mnt/dumpo/recover/transfer/Hardware_Software/ABAQUS/CM/setup.exe
>
> Most of the files are just listed as "Restoring ...". Still they are
> severely damaged afterwards. They seem to contain "holes" filled with
> 0x00 (this is from some rudimentary hexdump examination of the files.)
>
> Any chance to recover/restore from that? Thanks.

I don't know. The exact nature of the damage of a failing controller
is adding a significant unknown component to it. If it was just a
matter of not writing anything at all, then there'd be no problem. But
it sounds like it wrote spurious or corrupt data, possibly into
locations that weren't even supposed to be written to.

I think if the snapshot b-tree is ok, and the chunk b-tree is ok, then
it should be possible to recover the data correctly without needing
any other tree. I'm not sure if that's how btrfs restore already
works.

Kernel 5.11 has a new feature, mount -o ro,rescue=all that is more
tolerant of mounting when there are various kinds of problems. But
there's another thread where a failed controller is thwarting
recovery, and that code is being looked at for further enhancement.
https://lore.kernel.org/linux-btrfs/CAEg-Je-DJW3saYKA2OBLwgyLU6j0JOF7NzXzECi0HJ5hft_5=a...@mail.gmail.com/



--
Chris Murphy

Re: All files are damaged after btrfs restore

2021-02-24 Thread Chris Murphy

On Wed, Feb 24, 2021 at 10:40 PM Chris Murphy  wrote:
>
> I think you best chance is to start out trying to restore from a
> recent snapshot. As long as the failed controller wasn't writing
> totally spurious data in random locations, that snapshot should be
> intact.

i.e. the strategy for this is btrfs restore -r option

That only takes subvolid. You can get a subvolid listing with -l
option but this doesn't show the subvolume names yet (patch is
pending)
https://github.com/kdave/btrfs-progs/issues/289

As an alternative to applying that and building yourself, you can
approximate it with:

sudo btrfs insp dump-t -t 1 /dev/sda6 | grep -A 1 ROOT_REF

e.g.
item 9 key (FS_TREE ROOT_REF 631) itemoff 14799 itemsize 26
root ref key dirid 256 sequence 54 name varlog34

The subvolume varlog34 is subvolid 631. It's the same for snapshots.
So the restore command will use -r 631 to restore only from that
subvolume.

-- 
Chris Murphy

Re: All files are damaged after btrfs restore

2021-02-24 Thread Chris Murphy

108864
>
> device name = /dev/sdh1
> superblock bytenr = 274877906944
>
> [All bad supers]:
>
> All supers are valid, no need to recover
>
>
> root@hikitty:/mnt$ btrfs rescue chunk-recover /dev/sdf1
> Scanning: DONE in dev0
> checksum verify failed on 99593231630336 found E4E3BDB6 wanted 
> checksum verify failed on 99593231630336 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> checksum verify failed on 124762809384960 found E4E3BDB6 wanted 
> bytenr mismatch, want=124762809384960, have=0
> open with broken chunk error
> Chunk tree recovery failed
>
> ^^ This was btrfs v4.14
>
>
> root@hikitty:~$ install/btrfs-progs-5.9/btrfs check --readonly /dev/sdi1
> Opening filesystem to check...
> checksum verify failed on 99593231630336 found 00B6 wanted 
> checksum verify failed on 124762809384960 found 00B6 wanted 
> checksum verify failed on 124762809384960 found 00B6 wanted 
> checksum verify failed on 124762809384960 found 00B6 wanted 
> bad tree block 124762809384960, bytenr mismatch, want=124762809384960, have=0
> ERROR: failed to read block groups: Input/output error
> ERROR: cannot open file system
>
>
> FIRST MOUNT AT BOOT TIME AFTER DESASTER
> Feb 15 08:05:11 hikitty kernel: BTRFS info (device sdf1): disk space
> caching is enabled
> Feb 15 08:05:11 hikitty kernel: BTRFS info (device sdf1): has skinny extents
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944039161856 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944039161856 (dev /dev/sdf1 sector 3974114336)
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944039165952 (dev /dev/sdf1 sector 3974114344)
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944039170048 (dev /dev/sdf1 sector 3974114352)
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944039174144 (dev /dev/sdf1 sector 3974114360)
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944037851136 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944037851136 (dev /dev/sdf1 sector 3974111776)
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944037855232 (dev /dev/sdf1 sector 3974111784)
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944037859328 (dev /dev/sdf1 sector 3974111792)
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944037863424 (dev /dev/sdf1 sector 3974111800)
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944040767488 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944040767488 (dev /dev/sdf1 sector 3974117472)
> Feb 15 08:05:12 hikitty kernel: BTRFS info (device sdf1): read error
> corrected: ino 0 off 141944040771584 (dev /dev/sdf1 sector 3974117480)
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944035147776 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944035115008 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944035131392 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944036327424 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944036278272 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944035164160 have 0
> Feb 15 08:05:12 hikitty kernel: BTRFS error (device sdf1): bad tree
> block start, want 141944036294656 have 0
> Feb 15 08:05:16 hikitty kernel: BTRFS error (device sdf1): failed to
> verify dev extents against chunks: -5
> Feb 15 08:05:16 hikitty kernel: BTRFS error (device sdf1): open_ctree failed

I think you best chance is to start out trying to restore from a
recent snapshot. As long as the failed controller wasn't writing
totally spurious data in random locations, that snapshot should be
intact.

If there are no recent snapshots, and it's unknown what the controller
was doing while it was failing or how long it was failing for?
Recovery can be difficult.

Try using btrfs-find-root to find older roots, and use that value with
btrfs restore -t option. These are not as tidy as snapshots though,
the older they are, the more they dead end into more recent
overwrites. So you want to start out with the most recent roots you
can and work backwards in time.


-- 
Chris Murphy

Re: 5.11 free space tree remount warning

2021-02-21 Thread Chris Murphy

On Sat, Feb 20, 2021 at 5:26 PM Wang Yugui  wrote:

> 1, this warning [*1] is not loged in /var/log/messages
> because it happened after the ro remount of /
>my server is a dell PowerEdge T640, this log can be confirmed by
> iDRAC console.

This is a fair point. The systemd journal is also not logging this for
the same reason. I see it on the console on reboots when there's
enough of a delay to notice it, and "warning" pretty much always
catches my eye.

-- 
Chris Murphy

5.11 free space tree remount warning

2021-02-19 Thread Chris Murphy

Hi,

systemd does remount ro at reboot/shutdown time, and if free space
tree exists, this is always logged:

[   27.476941] systemd-shutdown[1]: Unmounting file systems.
[   27.479756] [1601]: Remounting '/' read-only in with options
'seclabel,compress=zstd:1,space_cache=v2,subvolid=258,subvol=/root'.
[   27.489196] BTRFS info (device vda3): using free space tree
[   27.492009] BTRFS warning (device vda3): remount supports changing
free space tree only from ro to rw

Is there a way to better detect that this isn't an attempt to change
to v2? If there's no v1 present, it's not a change.

-- 
Chris Murphy

Re: ERROR: failed to read block groups: Input/output error

2021-02-18 Thread Chris Murphy

 (421 ROOT_ITEM 0) 21060222500864 level 2
> > tree key (427 ROOT_ITEM 0) 21061262114816 level 2
> > tree key (428 ROOT_ITEM 0) 21061278040064 level 2
> > tree key (440 ROOT_ITEM 0) 21061362417664 level 2
> > tree key (451 ROOT_ITEM 0) 21061017174016 level 2
> > tree key (454 ROOT_ITEM 0) 21559581114368 level 1
> > tree key (455 ROOT_ITEM 0) 21079314776064 level 1
> > tree key (456 ROOT_ITEM 0) 21058026831872 level 2
> > tree key (457 ROOT_ITEM 0) 21060907909120 level 3
> > tree key (497 ROOT_ITEM 0) 21058120990720 level 2
> > tree key (571 ROOT_ITEM 0) 21058195668992 level 2
> > tree key (599 ROOT_ITEM 0) 21058818015232 level 2
> > tree key (635 ROOT_ITEM 0) 21056973766656 level 2
> > tree key (638 ROOT_ITEM 0) 21061023072256 level 0
> > tree key (676 ROOT_ITEM 0) 21061314330624 level 2
> > tree key (3937 ROOT_ITEM 0) 21061408686080 level 0
> > tree key (3938 ROOT_ITEM 0) 21079315841024 level 1
> > tree key (3957 ROOT_ITEM 0) 21061419139072 level 2
> > tree key (6128 ROOT_ITEM 0) 21061400018944 level 1
> > tree key (8575 ROOT_ITEM 0) 21061023055872 level 0
> > tree key (18949 ROOT_ITEM 1728623) 21080421875712 level 1
> > tree key (18950 ROOT_ITEM 1728624) 21080424726528 level 2
> > tree key (18951 ROOT_ITEM 1728625) 21080424824832 level 2
> > tree key (18952 ROOT_ITEM 1728626) 21080426004480 level 3
> > tree key (18953 ROOT_ITEM 1728627) 21080422105088 level 2
> > tree key (18954 ROOT_ITEM 1728628) 21080424497152 level 2
> > tree key (18955 ROOT_ITEM 1728629) 21080426332160 level 2
> > tree key (18956 ROOT_ITEM 1728631) 21080423645184 level 2
> > tree key (18957 ROOT_ITEM 1728632) 21080425316352 level 2
> > tree key (18958 ROOT_ITEM 1728633) 21080423972864 level 2
> > tree key (18959 ROOT_ITEM 1728634) 2108042240 level 2
> > tree key (18960 ROOT_ITEM 1728635) 21080422662144 level 2
> > tree key (18961 ROOT_ITEM 1728636) 21080423153664 level 2
> > tree key (18962 ROOT_ITEM 1728637) 21080425414656 level 2
> > tree key (18963 ROOT_ITEM 1728638) 21080421171200 level 1
> > tree key (18964 ROOT_ITEM 1728639) 21080423481344 level 2
> > tree key (19721 ROOT_ITEM 0) 21076937326592 level 2
> > checksum verify failed on 21057125580800 found 0026 wanted 0035
> > checksum verify failed on 21057108082688 found 0074 wanted FFC5
> > checksum verify failed on 21057108082688 found 00ED wanted FFC5
> > checksum verify failed on 21057108082688 found 0074 wanted FFC5
> > Csum didn't match
>
> From what I understand it seems that some EXTENT_ITEM is corrupted and
> when mount tries to read block groups it encounters csum mismatch for
> it and immediatly aborts.
> Is there some tool I could use to check this EXTENT_ITEM and see if it
> can be fixed or maybe just removed?
> Basically I guess I need to find physical location on disk from this
> block number.
> Also I think ignoring csum for btrfs inspect would be useful.
>
> $ btrfs inspect dump-tree -b 21057050689536 /dev/sda
> btrfs-progs v5.10.1
> node 21057050689536 level 1 items 281 free space 212 generation
> 2262739 owner EXTENT_TREE
> node 21057050689536 flags 0x1(WRITTEN) backref revision 1
> fs uuid 8aef11a9-beb6-49ea-9b2d-7876611a39e5
> chunk uuid 4ffec48c-28ed-419d-ba87-229c0adb2ab9
> [...]
> key (19264654909440 EXTENT_ITEM 524288) block 21057101103104 gen 2262739
> [...]
>
>
> $ btrfs inspect dump-tree -b 21057101103104 /dev/sda
> btrfs-progs v5.10.1
> checksum verify failed on 21057101103104 found 00B9 wanted 0075
> checksum verify failed on 21057101103104 found 009C wanted 0075
> checksum verify failed on 21057101103104 found 00B9 wanted 0075
> Csum didn't match
> ERROR: failed to read tree block 21057101103104
>
>
> Thanks!

What do you get for

btrfs rescue super -v /dev/

btrfs check -b /dev/

You might try kernel 5.11 which has a new mount option that will skip
bad roots and csums. It's 'mount -o ro,rescue=all' and while it won't
let you fix it, in the off chance it mounts, it'll let you get data
out before trying to repair the file system, which sometimes makes
things worse.



-- 
Chris Murphy

Re: corrupt leaf, unexpected item end, unmountable

2021-02-18 Thread Chris Murphy

On Thu, Feb 18, 2021 at 6:12 PM Daniel Dawson  wrote:
>
> On 2/18/21 3:57 PM, Chris Murphy wrote:
> > metadata raid6 as well?
>
> Yes.

Once everything else is figured out, you should consider converting
metadata to raid1c3.

https://lore.kernel.org/linux-btrfs/20200627032414.gx10...@hungrycats.org/

> > What replacement command(s) are you using?
>
> For this drive, it was "btrfs replace start -r 3 /dev/sda3 /"

OK replace is good.

> > Do a RAM test for as long as you can tolerate it, or it finds the
> > defect. Sometimes they show up quickly, other times days.
> I didn't think of a flipped bit. Thanks.
> >> devid0 size 457.64GiB used 39.53GiB path /dev/sdc3
> >> devid1 size 457.64GiB used 39.56GiB path /dev/sda3
> >> devid2 size 457.64GiB used 39.56GiB path /dev/sdb3
> >> devid4 size 457.64GiB used 39.53GiB path /dev/sdd3
> >
> > This is confusing. devid 3 is claimed to be missing, but fi show isn't
> > showing any missing devices. If none of sd[abcd] are devid 3, then
> > what dev node is devid 3 and where is it?
> It looks to me like btrfs is temporarily assigning devid 0 to the new
> device being used as a replacement.That is what I observed before; once
> the replace operation was complete, it went back to the normal number.
> Since the replacement didn't finish this time, sdc3 is still devid 0.

The new replacement is devid 0 during the replacement. The drive being
replaced keeps its devid until the end, and then there's a switch,
that device is removed, and the signature on the old drive is wiped.
Sooo something is still wrong with the above because there's no
devid 3, there's kernel and btrfs check messages saying devid 3 is
missing.

It doesn't seem likely that /dev/sdc3 is devid 3 because it can't be
both missing and be the mounted dev node.

>[  202.676601] BTRFS warning (device sdc3): devid 3 uuid 
>911a642e-0a4c-4483-9a1f-cde7b87c5519 is missing

Try a reboot, and use blkid to check you've got all devices + 1 (the
new one that failed replacement). Verify all supers as well with
'btrfs rescue super-recover -v' and that it all correlates with 'btrfs
filesystem show' as well.

What should be true is the replace will resume upon being normally
mounted. But for that to happen, all the drives + 1 must be available.

If a tree log is damaged and prevents mount then, you need to make a
calculation. You can try to mount with ro,nologreplay and freshen
backups for anything you'd rather not lose - just in case things get
worse. And then you can zero the log and see if that'll let you
normally mount the device (i.e. rw and not degraded). But some of it
will depend on what's wrong.

-- 
Chris Murphy

Re: corrupt leaf, unexpected item end, unmountable

2021-02-18 Thread Chris Murphy

On Wed, Feb 17, 2021 at 7:43 PM Daniel Dawson  wrote:
>
> I was attempting to replace the drives in an array with RAID6 profile.

metadata raid6 as well?

What replacement command(s) are you using?


> The first replacement was seemingly successful (and there was a scrub
> afterward, with no errors). However, about 0.6% into the second
> replacement (sdc), something went wrong, and it went read-only (I should
> have copied the log of that somehow). Now it refuses to mount, and a
> (readonly) check cannot get started.
>
>
> # mount -o ro,degraded /dev/sda3 /mnt
> mount: /mnt: can't read superblock on /dev/sda3.
> # btrfs rescue super-recover /dev/sda3
> All supers are valid, no need to recover
>
>
> For this, dmesg shows:
>
> [  202.675384] BTRFS info (device sdc3): allowing degraded mounts
> [  202.675387] BTRFS info (device sdc3): disk space caching is enabled
> [  202.675389] BTRFS info (device sdc3): has skinny extents
> [  202.676302] BTRFS warning (device sdc3): devid 3 uuid
> 911a642e-0a4c-4483-9a1f-cde7b87c5519 is missing
> [  202.676601] BTRFS warning (device sdc3): devid 3 uuid
> 911a642e-0a4c-4483-9a1f-cde7b87c5519 is missing

What device is devid 3?


> [  202.985528] BTRFS info (device sdc3): bdev /dev/sdb3 errs: wr 0, rd
> 0, flush 0, corrupt 26, gen 0
> [  202.985533] BTRFS info (device sdc3): bdev /dev/sdd3 errs: wr 0, rd
> 0, flush 0, corrupt 98, gen 0
> [  203.278131] BTRFS info (device sdc3): start tree-log replay
> [  203.454496] BTRFS critical (device sdc3): corrupt leaf: root=7
> block=371567214592 slot=0, unexpected item end, have 16315 expect 16283
> [  203.454499] BTRFS error (device sdc3): block=371567214592 read time
> tree block corruption detected
> [  203.454634] BTRFS critical (device sdc3): corrupt leaf: root=7
> block=371567214592 slot=0, unexpected item end, have 16315 expect 16283
> [  203.454636] BTRFS error (device sdc3): block=371567214592 read time
> tree block corruption detected
> [  203.455794] BTRFS critical (device sdc3): corrupt leaf: root=7
> block=371567214592 slot=0, unexpected item end, have 16315 expect 16283

16315=0x3fbb, 16283=0x3f9b, 16315^16283 = 32 or 0x20

1110111011
1110011011
^

Do a RAM test for as long as you can tolerate it, or it finds the
defect. Sometimes they show up quickly, other times days.


> [  203.455796] BTRFS error (device sdc3): block=371567214592 read time
> tree block corruption detected
> [  203.455820] BTRFS: error (device sdc3) in __btrfs_free_extent:3105:
> errno=-5 IO failure
> [  203.455823] BTRFS: error (device sdc3) in
> btrfs_run_delayed_refs:2208: errno=-5 IO failure
> [  203.455833] BTRFS: error (device sdc3) in btrfs_replay_log:2287:
> errno=-5 IO failure (Failed to recover log tree)
> [  203.747758] BTRFS error (device sdc3): open_ctree failed
>
>
> I've looked for, but can't find, any bad blocks on the devices. Also, if
> it adds any info...
>
> # btrfs check --readonly /dev/sda3
> Opening filesystem to check...
> warning, device 3 is missing
> checksum verify failed on 371587727360 found 00FF wanted 0049
> checksum verify failed on 371587727360 found 0005 wanted 0010
> checksum verify failed on 371587727360 found 0005 wanted 0010
> bad tree block 371587727360, bytenr mismatch, want=371587727360,
> have=1076190010624
> ERROR: could not setup extent tree
> ERROR: cannot open file system
>
>
> Note: I'm running this off of System Rescue 7.01, which has earlier
> versions of things than what the machine in question has installed (the
> latter being Linux 5.10.16, with btrfs-progs v5.10.1).
>
> # uname -a
> Linux sysrescue 5.4.78-1-lts #1 SMP Wed, 18 Nov 2020 19:51:49 +
> x86_64 GNU/Linux
> # btrfs --version
> btrfs-progs v5.4.1
> # btrfs filesystem show
> Label: 'vroot2020'  uuid: 5214d903-783a-4d14-ac78-046da5ac1db7
> Total devices 4 FS bytes used 65.98GiB
> devid0 size 457.64GiB used 39.53GiB path /dev/sdc3
> devid1 size 457.64GiB used 39.56GiB path /dev/sda3
> devid2 size 457.64GiB used 39.56GiB path /dev/sdb3
> devid4 size 457.64GiB used 39.53GiB path /dev/sdd3


This is confusing. devid 3 is claimed to be missing, but fi show isn't
showing any missing devices. If none of sd[abcd] are devid 3, then
what dev node is devid 3 and where is it?

But yeah you're probably best off not trying to fix this file system
until the memory is sorted out.


-- 
Chris Murphy

Re: Recovering Btrfs from a freak failure of the disk controller

2021-02-14 Thread Chris Murphy

On Sun, Feb 14, 2021 at 4:24 PM Neal Gompa  wrote:
>
> On Sun, Feb 14, 2021 at 5:11 PM Chris Murphy  wrote:
> >
> > On Sun, Feb 14, 2021 at 1:29 PM Neal Gompa  wrote:
> > >
> > > Hey all,
> > >
> > > So one of my main computers recently had a disk controller failure
> > > that caused my machine to freeze. After rebooting, Btrfs refuses to
> > > mount. I tried to do a mount and the following errors show up in the
> > > journal:
> > >
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): disk 
> > > > space caching is enabled
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): has 
> > > > skinny extents
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): 
> > > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid 
> > > > inode transid: has 96 expect [0, 95]
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > > > block=796082176 read time tree block corruption detected
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): 
> > > > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid 
> > > > inode transid: has 96 expect [0, 95]
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > > > block=796082176 read time tree block corruption detected
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS warning (device sda3): 
> > > > couldn't read tree root
> > > > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > > > open_ctree failed
> > >
> > > I've tried to do -o recovery,ro mount and get the same issue. I can't
> > > seem to find any reasonably good information on how to do recovery in
> > > this scenario, even to just recover enough to copy data off.
> > >
> > > I'm on Fedora 33, the system was on Linux kernel version 5.9.16 and
> > > the Fedora 33 live ISO I'm using has Linux kernel version 5.10.14. I'm
> > > using btrfs-progs v5.10.
> >
> > Oh and also that block:
> >
> > btrfs insp dump-t -b 796082176 /dev/sda3
> >
>
> So, I've attached the output of the dump-s and dump-t commands.
>
> As for the other commands:
>
> # btrfs check --readonly /dev/sda3
> > Opening filesystem to check...
> > parent transid verify failed on 796082176 wanted 94 found 96

Not good. So three different transids in play.

Super says generation 94

Leaf block says its generation is 96, and two inodes have transid
96 including the one the tree checker is complaining about.
Somehow the super has an older generation than both what's in the leaf
and what's expected.




> > parent transid verify failed on 796082176 wanted 94 found 96
> > parent transid verify failed on 796082176 wanted 94 found 96
> > Ignoring transid failure
> > ERROR: could not setup extent tree
> > ERROR: cannot open file system
>
> # mount -o ro,rescue=all /dev/sda3 /mnt
> > mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda3, 
> > missing codepage or helper program, or other error.

Do you get the same kernel messages as originally reported? Or
something different?



-- 
Chris Murphy

Re: Recovering Btrfs from a freak failure of the disk controller

2021-02-14 Thread Chris Murphy

On Sun, Feb 14, 2021 at 1:29 PM Neal Gompa  wrote:
>
> Hey all,
>
> So one of my main computers recently had a disk controller failure
> that caused my machine to freeze. After rebooting, Btrfs refuses to
> mount. I tried to do a mount and the following errors show up in the
> journal:
>
> > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): disk space 
> > caching is enabled
> > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): has skinny 
> > extents
> > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): 
> > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode 
> > transid: has 96 expect [0, 95]
> > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > block=796082176 read time tree block corruption detected
> > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): 
> > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode 
> > transid: has 96 expect [0, 95]
> > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > block=796082176 read time tree block corruption detected
> > Feb 14 15:20:49 localhost-live kernel: BTRFS warning (device sda3): 
> > couldn't read tree root
> > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > open_ctree failed
>
> I've tried to do -o recovery,ro mount and get the same issue. I can't
> seem to find any reasonably good information on how to do recovery in
> this scenario, even to just recover enough to copy data off.
>
> I'm on Fedora 33, the system was on Linux kernel version 5.9.16 and
> the Fedora 33 live ISO I'm using has Linux kernel version 5.10.14. I'm
> using btrfs-progs v5.10.

Oh and also that block:

btrfs insp dump-t -b 796082176 /dev/sda3


-- 
Chris Murphy

Re: Recovering Btrfs from a freak failure of the disk controller

2021-02-14 Thread Chris Murphy

Can you also include:

btrfs insp dump-s

I wonder if log replay is indicated by non-zero value for log_root in
the super block. If so, you check if: ro,nologreplay or
ro,nologreplay,usebackuproot work.

--
Chris Murphy

Re: Recovering Btrfs from a freak failure of the disk controller

2021-02-14 Thread Chris Murphy

On Sun, Feb 14, 2021 at 1:29 PM Neal Gompa  wrote:

> > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): disk space 
> > caching is enabled
> > Feb 14 15:20:49 localhost-live kernel: BTRFS info (device sda3): has skinny 
> > extents
> > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): 
> > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode 
> > transid: has 96 expect [0, 95]
> > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > block=796082176 read time tree block corruption detected
> > Feb 14 15:20:49 localhost-live kernel: BTRFS critical (device sda3): 
> > corrupt leaf: root=401 block=796082176 slot=15 ino=203657, invalid inode 
> > transid: has 96 expect [0, 95]
> > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > block=796082176 read time tree block corruption detected
> > Feb 14 15:20:49 localhost-live kernel: BTRFS warning (device sda3): 
> > couldn't read tree root
> > Feb 14 15:20:49 localhost-live kernel: BTRFS error (device sda3): 
> > open_ctree failed
>
> I've tried to do -o recovery,ro mount and get the same issue. I can't
> seem to find any reasonably good information on how to do recovery in
> this scenario, even to just recover enough to copy data off.
>
> I'm on Fedora 33, the system was on Linux kernel version 5.9.16 and
> the Fedora 33 live ISO I'm using has Linux kernel version 5.10.14. I'm
> using btrfs-progs v5.10.
>
> Can anyone help?

>has 96 expect [0, 95]

Off by one error. I haven't previously seen this with 'invalid inode
transid'. There's an old kernel bug (long since fixed) that can inject
garbage into the inode transid but that's not what's going on here.

What do you get for:
btrfs check --readonly

In the meantime, it might be worth trying 5.11-rc7 or rc8 with the new
'ro,rescue=all' mount option and see if it can skip over this kind of
problem. The "parent transid verify failed" are pretty serious, again
not the same thing here. But I'm not sure how resilient repair is for
either off by one errors, or bitflips still.


--
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-11 Thread Chris Murphy

On Wed, Feb 10, 2021 at 11:12 PM Zygo Blaxell
 wrote:

>
> If we want the data compressed (and who doesn't?  journal data compresses
> 8:1 with btrfs zstd) then we'll always need to make a copy at close.
> Because systemd used prealloc, the copy is necessarily to a new inode,
> as there's no way to re-enable compression on an inode once prealloc
> is used (this has deep disk-format reasons, but not as deep as the
> nodatacow ones).

Pretty sure sd-journald still fallocates when datacow by touching
/etc/tmpfiles.d/journal-nocow.conf

And I know for sure those datacow files do compress on rotation.

Preallocated datacow might not be so bad if it weren't for that one
damn header or indexing block, whatever the proper term is, that
sd-journald hammers every time it fsyncs. I don't know if I wanna know
what it means to snapshot a datacow file that's prealloc. But in
theory if the same blocks weren't all being hammered, a preallocated
file shouldn't fragment like hell if each prealloc block gets just one
write.

> If we don't care about compression or datasums, then keep the file
> nodatacow and do nothing at close.  The defrag isn't needed and the
> FS_NOCOW_FL flag change doesn't work.

Agreed.

> It makes sense for SSD too.  It's 4K extents, so the metadata and small-IO
> overheads will be non-trivial even on SSD.  Deleting or truncating datacow
> journal files will put a lot of tiny free space holes into the filesystem.
> It will flood the next commit with delayed refs and push up latency.

I haven't seen meaningful latency on a single journal file, datacow
and heavily fragmented, on ssd. But to test on more than one file at a
time i need to revert the defrag commits, and build systemd, and let a
bunch of journals accumulate somehow. If I dump too much data
artificially to try and mimic aging, I know I will get nowhere near as
many of those 4KiB extents. So I dunno.

>
> > In that case the fragmentation is
> > quite considerable, hundreds to thousands of extents. It's
> > sufficiently bad that it'd be probably be better if they were
> > defragmented automatically with a trigger that tests for number of
> > non-contiguous small blocks that somehow cheaply estimates latency
> > reading all of them.
>
> Yeah it would be nice of autodefrag could be made to not suck.

It triggers on inserts, not appends. So it doesn't do anything for the
sd-journald case.

I would think the active journals are the one more likely to get
searched for recent events than archived journals. So in the datacow
case, you only get relief once it's rotated. It'd be nice to find an
decent, not necessarily perfect, way for them to not get so fragmented
in the first place. Or just defrag once a file has 16M of
non-contiguous extents.

Estimating extents though is another issue, especially with compression enabled.

-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-10 Thread Chris Murphy

On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell
 wrote:
>
> > At file close, the systemd should copy the data to a new file with no
> > special attributes and discard or recycle the old inode.  This copy
> > will be mostly contiguous and have desirable properties like csums and
> > compression, and will have iops equivalent to btrfs fi defrag.

Or switch to a cow-friendly format that's no worse on overwriting file
systems, but improves things on Btrfs and ZFS. RocksDB does well.


-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-10 Thread Chris Murphy

On Wed, Feb 10, 2021 at 8:13 PM Zygo Blaxell
 wrote:
>
> Sorry, I busted my mail client.  That was from me.  :-P
>
> On Wed, Feb 10, 2021 at 10:08:37PM -0500, kreij...@inwind.it wrote:
> > On Wed, Feb 10, 2021 at 08:14:09PM +0100, Goffredo Baroncelli wrote:
> > > Hi Chris,
> > >
> > > it seems that systemd-journald is more smart/complex than I thought:
> > >
> > > 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> > > closes the files, it mark again these as COW then defrag [1]
> > >
> > > 2) looking at the code, I suspect that systemd-journald closes the
> > > file asynchronously [2]. This means that looking at the "live" journal
> > > is not sufficient. In fact:
> > >
> > > /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> > > [...]
> > > - 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd4f-0005baed61106a18.journal
> > > - 
> > > system@3f2405cf9bcf42f0abe6de5bc702e394-bd64-0005baed659feff4.journal
> > > - 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd67-0005baed65a0901f.journal
> > > ---C- 
> > > system@3f2405cf9bcf42f0abe6de5bc702e394-cc63-0005bafed4f12f0a.journal
> > > ---C- 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cc85-0005baff0ce27e49.journal
> > > ---C- 
> > > system@3f2405cf9bcf42f0abe6de5bc702e394-cd38-0005baffe9080b4d.journal
> > > ---C- 
> > > user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cd3b-0005baffe908f244.journal
> > > ---C- user-1000.journal
> > > ---C- system.journal
> > >
> > > The output above means that the last 6 files are "pending" for a 
> > > de-fragmentation. When these will be
> > > "closed", the NOCOW flag will be removed and a defragmentation will start.
> >
> > Wait what?
> >
> > > Now my journals have few (2 or 3 extents). But I saw cases where the 
> > > extents
> > > of the more recent files are hundreds, but after few "journalct --rotate" 
> > > the older files become less
> > > fragmented.
> > >
> > > [1] 
> > > https://github.com/systemd/systemd/blob/fee6441601c979165ebcbb35472036439f8dad5f/src/libsystemd/sd-journal/journal-file.c#L383
> >
> > That line doesn't work, and systemd ignores the error.
> >
> > The NOCOW flag cannot be set or cleared unless the file is empty.
> > This is checked in btrfs_ioctl_setflags.
> >
> > This is not something that can be changed easily--if the NOCOW bit is
> > cleared on a non-empty file, btrfs data read code will expect csums
> > that aren't present on disk because they were written while the file was
> > NODATASUM, and the reads will fail pretty badly.  The entire file would
> > have to have csums added or removed at the same time as the flag change
> > (or all nodatacow file reads take a performance hit looking for csums
> > that may or may not be present).
> >
> > At file close, the systemd should copy the data to a new file with no
> > special attributes and discard or recycle the old inode.  This copy
> > will be mostly contiguous and have desirable properties like csums and
> > compression, and will have iops equivalent to btrfs fi defrag.

Journals implement their own checksumming. Yeah, if there's
corruption, Btrfs raid can't do a transparent fixup. But the whole
journal isn't lost, just the affected record. *shrug* I think if (a)
nodatacow and/or (b) SSD, just leave it alone. Why add more writes?

In particular the nodatacow case where I'm seeing consistently the
file made from multiples of 8MB contiguous blocks, even on HDD the
seek latency here can't be worth defraging the file.

I think defrag makes sense (a) datacow journals, i.e. the default
nodatacow is inhibited (b) HDD. In that case the fragmentation is
quite considerable, hundreds to thousands of extents. It's
sufficiently bad that it'd be probably be better if they were
defragmented automatically with a trigger that tests for number of
non-contiguous small blocks that somehow cheaply estimates latency
reading all of them. Since the files are interleaved, doing something
like "systemctl status dbus" might actually read many blocks even if
the result isn't a whole heck of alot of visible data.

But on SSD, cow or nocow, and HDD nocow - I think just leave them alone.

-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-10 Thread Chris Murphy

On Wed, Feb 10, 2021 at 12:14 PM Goffredo Baroncelli  wrote:
>
> Hi Chris,
>
> it seems that systemd-journald is more smart/complex than I thought:
>
> 1) systemd-journald set the "live" journal as NOCOW; *when* (see below) it
> closes the files, it mark again these as COW then defrag [1]

Found that in commit 11689d2a021d95a8447d938180e0962cd9439763 from 2015.

But archived journals are still all nocow for me on systemd 247. Is it
because the enclosing directory has file attribute 'C' ?

Another example:

Active journal "system.journal" INODE_ITEM contains
sequence 4515 flags 0x13(NODATASUM|NODATACOW|PREALLOC)

7 day old archived journal "systemd.journal" INODE_ITEM shows:
sequence 227 flags 0x13(NODATASUM|NODATACOW|PREALLOC)

So if it ever was COW, it flipped to NOCOW before the defrag. Is it expected?

and also this archived file's INODE_ITEM shows
generation 1748644 transid 1760983 size 16777216 nbytes 16777216

with EXTENT_ITEMs show
generation 1755533 type 1 (regular)
generation 1753668 type 1 (regular)
generation 1755533 type 1 (regular)
generation 1753989 type 1 (regular)
generation 1755533 type 1 (regular)
generation 1753526 type 1 (regular)
generation 1755533 type 1 (regular)
generation 1755531 type 1 (regular)
generation 1755533 type 1 (regular)
generation 1755531 type 2 (prealloc)

file tree output for this file
https://pastebin.com/6uDFNDdd

> 2) looking at the code, I suspect that systemd-journald closes the
> file asynchronously [2]. This means that looking at the "live" journal
> is not sufficient. In fact:
>
> /var/log/journal/e84907d099904117b355a99c98378dca$ sudo lsattr $(ls -rt *)
> [...]
> - 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd4f-0005baed61106a18.journal
> - 
> system@3f2405cf9bcf42f0abe6de5bc702e394-bd64-0005baed659feff4.journal
> - 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-bd67-0005baed65a0901f.journal
> ---C- 
> system@3f2405cf9bcf42f0abe6de5bc702e394-cc63-0005bafed4f12f0a.journal
> ---C- 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cc85-0005baff0ce27e49.journal
> ---C- 
> system@3f2405cf9bcf42f0abe6de5bc702e394-cd38-0005baffe9080b4d.journal
> ---C- 
> user-1000@97aaac476dfc404f9f2a7f6744bbf2ac-cd3b-0005baffe908f244.journal
> ---C- user-1000.journal
> ---C- system.journal
>
> The output above means that the last 6 files are "pending" for a 
> de-fragmentation. When these will be
> "closed", the NOCOW flag will be removed and a defragmentation will start.
>
> Now my journals have few (2 or 3 extents). But I saw cases where the extents
> of the more recent files are hundreds, but after few "journalct --rotate" the 
> older files become less
> fragmented.

Josef explained to me that BTRFS_IOC_DEFRAG is pretty simple and just
dirties extents it considers too small, and they end up just going
through the normal write path, along with anything else pending. And
also that fsync() will set the extents on disk so that the defrag
ioctl know what to dirty, but that ordinarily it's not required and
might have to do with the interleaving write pattern for the journals.

I'm not sure what this ioctl considers big enough that it's worth just
leaving alone. But in any case it sounds like the current write
workload at the time of defrag could affect the allocation, unlike
BTRFS_IOC_DEFRAG_RANGE which has a few knobs to control the outcome.
Or maybe the knobs just influence the outcome. Not sure.

If the device is HDD, it might be nice if the nodatacow journals are
datacow again so they could be compressed. But my evaluation shows
that nodatacow journals stick to an 8MB extent pattern, correlating to
fallocated append as they grow. It's not significantly fragmented to
start out with, whether HDD or SSD.

-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-09 Thread Chris Murphy

This is an active (but idle) system.journal file. That is, it's open
but not being written to. I did a sync right before this:

https://pastebin.com/jHh5tfpe

And then: btrfs fi defrag -l 8M system.journal

https://pastebin.com/Kq1GjJuh

Looks like most of it was a no op. So it seems btrfs in this case is
not confused by so many small extent items, it know they are
contiguous?

It doesn't answer the question what the "too small" threshold is for
BTRFS_IOC_DEFRAG, which is what sd-journald is using, though.

Another sync, and then, 'journalctl --rotate' and the resulting
archived file is now:

https://pastebin.com/aqac0dRj

These are not the same results between the two ioctls for the same
file, and not the same result as what you get with -l 32M (which I do
get if I use the default 32M). The BTRFS_IOC_DEFRAG interleaved result
is peculiar, but I don't think we can say it's ineffective, it might
be an intentional no op either because it's nodatacow or it sees that
these many extents are mostly contiguous and not worth defragmenting
(which would be good for keeping write amplification down).

So I don't know, maybe it's not wrong.

--
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-09 Thread Chris Murphy

On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli  wrote:
>
> On 2/9/21 1:42 AM, Chris Murphy wrote:
> > Perhaps. Attach strace to journald before --rotate, and then --rotate
> >
> > https://pastebin.com/UGihfCG9
>
> I looked to this strace.
>
> in line 115: it is called a ioctl()
> in line 123: it is called a ioctl()
>
> However the two descriptors for which the defrag is invoked are never sync-ed 
> before.
>
> I was expecting is to see a sync (flush the data on the platters) and then a
> ioctl(. This doesn't seems to be looking from the strace.
>
> I wrote a script (see below) which basically:
> - create a fragmented file
> - run filefrag on it
> - optionally sync the file <-
> - run btrfs fi defrag on it
> - run filefrag on it
>
> If I don't perform the sync, the defrag is ineffective. But if I sync the
> file BEFORE doing the defrag, I got only one extent.
> Now my hypothesis is: the journal log files are bad de-fragmented because 
> these
> are not sync-ed before.
> This could be tested quite easily putting an fsync() before the
> ioctl().
>
> Any thought ?

No idea. If it's a full sync then it could be expensive on either
slower devices or heavier workloads. On the one hand, there's no point
of doing an ineffective defrag so maybe the defrag ioctl should  just
do the sync first? On the other hand, this would effectively make the
defrag ioctl a full file system sync which might be unexpected. It's a
set of tradeoffs and I don't know what the expectation is.

What about fdatasync() on the journal file rather than a full sync?


-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-09 Thread Chris Murphy

On Tue, Feb 9, 2021 at 12:45 PM Goffredo Baroncelli  wrote:
>
> On 2/9/21 8:01 PM, Chris Murphy wrote:
> > On Tue, Feb 9, 2021 at 11:13 AM Goffredo Baroncelli  
> > wrote:
> >>
> >> On 2/9/21 1:42 AM, Chris Murphy wrote:
> >>> Perhaps. Attach strace to journald before --rotate, and then --rotate
> >>>
> >>> https://pastebin.com/UGihfCG9
> >>
> >> I looked to this strace.
> >>
> >> in line 115: it is called a ioctl()
> >> in line 123: it is called a ioctl()
> >>
> >> However the two descriptors for which the defrag is invoked are never 
> >> sync-ed before.
> >>
> >> I was expecting is to see a sync (flush the data on the platters) and then 
> >> a
> >> ioctl(. This doesn't seems to be looking from the strace.
> >>
> >> I wrote a script (see below) which basically:
> >> - create a fragmented file
> >> - run filefrag on it
> >> - optionally sync the file <-
> >> - run btrfs fi defrag on it
> >> - run filefrag on it
> >>
> >> If I don't perform the sync, the defrag is ineffective. But if I sync the
> >> file BEFORE doing the defrag, I got only one extent.
> >> Now my hypothesis is: the journal log files are bad de-fragmented because 
> >> these
> >> are not sync-ed before.
> >> This could be tested quite easily putting an fsync() before the
> >> ioctl().
> >>
> >> Any thought ?
> >
> > No idea. If it's a full sync then it could be expensive on either
> > slower devices or heavier workloads. On the one hand, there's no point
> > of doing an ineffective defrag so maybe the defrag ioctl should  just
> > do the sync first? On the other hand, this would effectively make the
> > defrag ioctl a full file system sync which might be unexpected. It's a
> > set of tradeoffs and I don't know what the expectation is.
> >
> > What about fdatasync() on the journal file rather than a full sync?
>
> I tried a fsync(2) call, and the results is the same.
> Only after reading your reply I realized that I used a sync(2), when
> I meant to use fsync(2).
>
> I update my python test code

Ok fsync should be least costly of the three.

The three unique things about systemd-journald that might be factors:

* nodatacow file
* fallocated file in 8MB increments multiple times up to 128M
* BTRFS_IOC_DEFRAG, whereas btrfs-progs uses BTRFS_IOC_DEFRAG_RANGE

So maybe it's all explained by lack of fsync, I'm not sure. But the
commit that added this doesn't show any form of sync.

https://github.com/systemd/systemd/commit/f27a386430cc7a27ebd06899d93310fb3bd4cee7



-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-08 Thread Chris Murphy

On Mon, Feb 8, 2021 at 3:21 PM Zygo Blaxell
 wrote:

> defrag will put the file's contents back into delalloc, and it won't be
> allocated until a flush (fsync, sync, or commit interval).  Defrag is
> roughly equivalent to simply copying the data to a new file in btrfs,
> except the logical extents are atomically updated to point to the new
> location.

BTRFS_IOC_DEFRAG results:
https://pastebin.com/1ufErVMs

BTRFS_IOC_DEFRAG_RANGE results:
https://pastebin.com/429fZmNB

They're different.

Questions: is this a bug? it is intentional? does the interleaved
BTRFS_IOC_DEFRAG version improve things over the non-defragmented
file, which had only 3 8MB extents for a 24MB file, plus 1 4KiB block?
Should BTRFS_IOC_DEFRAG be capable of estimating fragmentation and
just do a no op in that case?

> FIEMAP has an option flag to sync the data before returning a map.
> DEFRAG has an option to start IO immediately so it will presumably be
> done by the time you look at the extents with FIEMAP.

I waited for the defrag result to settle, so the results I've posted are stable.

> Be very careful how you set up this test case.  If you use fallocate on
> a file, it has a _permanent_ effect on the inode, and alters a lot of
> normal btrfs behavior downstream.  You won't see these effects if you
> just write some data to a file without using prealloc.

OK. That might answer the idempotent question. Following
BTRFS_IOC_DEFRAG most unwritten exents are no longer present. I can't
figure out the pattern. Some of the archived journals have them,
others have one, but none have the four or more that I see in active
use journals. And then when defragged with BTRFS_IOC_DEFRAG_RANGE none
of those have unwritten extents.

Since the file is changing each time it goes through the ioctl it
makes sense what comes out the back end is different.

While BTRFS_IOC_DEFRAG_RANGE has a no op if an extent is bigger than
the -l (len=) value, I can't tell that BTRFS_IOC_DEFRAG has any sort
of no op unless there's no fragments at all *shrug*.

Maybe they should use BTRFS_IOC_DEFRAG_RANGE and specify an 8MB exent?
Because in the nodatacow case, that's what they already have and it'd
be a no op. And then for datacow case... well I don't like
unconditional write amplification on SSDs just to satisfy the HDD
case. But it'd be avoidable by just using default (nodatacow for the
journals).

-- 
Chris Murphy

Re: is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-08 Thread Chris Murphy

On Mon, Feb 8, 2021 at 3:11 PM Goffredo Baroncelli  wrote:
>
> On 2/7/21 11:06 PM, Chris Murphy wrote:
> > systemd-journald journals on Btrfs default to nodatacow,  upon log
> > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The
> > result looks curious. I can't tell what the logic is from the results.
> >
> > The journal file starts out being fallocated with a size of 8MB, and
> > as it grows there is an append of 8MB increments, also fallocated.
> > This leads to a filefrag -v that looks like this (ext4 and btrfs
> > nodatacow follow the same behavior, both are provided for reference):
> >
> > ext4
> > https://pastebin.com/6vuufwXt
> >
> > btrfs
> > https://pastebin.com/Y18B2m4h
> >
> > Following defragment with BTRFS_IOC_DEFRAG it looks like this:
> > https://pastebin.com/1ufErVMs
> >
> > It appears at first glance to be significantly more fragmented. Closer
> > inspection shows that most of the extents weren't relocated. But
> > what's up with the peculiar interleaving? Is this an improvement over
> > the original allocation?
>
> I am not sure how read the filefrag output: I see several lines like
> [...]
> 5: 1691..1693: 125477..125479:  3:
> 6: 1694..1694: 125480..125480:  1: 
> unwritten
> [...]
>
> What means "unwritten" ? The kernel documentation [*] says:


My understanding is it's an exent that's been fallocated but not yet
written to. What I don't know is whether they are possibly tripping up
BTRFS_IOC_DEFRAG. I'm not skilled enough to create a bunch of these
journal logs quickly (I'd have to just let a system run and age its
own journals, which sucks, it takes forever) and then a small program
that runs the same file through BTRFS_IOC_DEFRAG twice to see if it's
idempotent. The resulting file after one submission does not have
unwritten extents.

Another thing I'm not sure of is whether ssd vs nossd affects the
defrag results. Or datacow versus nodatacow.

Another thing I'm not sure of is if autodefrag is a better solution to
the problem. Whereby it acts as a no op when the file is nodatacow,
and does the expected thing if it's datacow. But then we'd need an
autodefrag xattr to set on the enclosing directory for these journals
because there's no reliable way to set autodefrag mount option
globally, not knowing all the work loads. It can make some workloads
worse.



> My educate guess is that there is something strange in the sequence:
> - write
> - sync
> - close log
> - move log
> - defrag log
>
> May be the defrag starts before all the data reach the platters ?

Perhaps. Attach strace to journald before --rotate, and then --rotate

https://pastebin.com/UGihfCG9

>
> For what matters, I create a file with the same fragmentation like your one
>
> $ sudo filefrag -v data.txt
> Filesystem type is: 9123683e
> File size of data.txt is 25165824 (6144 blocks of 4096 bytes)
>   ext: logical_offset:physical_offset: length:   expected: flags:
> 0:0..   0:1597171..   1597171:  1:
> 1:1..1599:  163433285.. 163434883:   1599:1597172:
> 2: 1600..1607:1601255..   1601262:  8:  163434884:
> 3: 1608..1689:1604137..   1604218: 82:1601263:
> 4: 1690..1690:1597484..   1597484:  1:1604219:
> 5: 1691..1693:1597465..   1597467:  3:1597485:
> 6: 1694..1694:1597966..   1597966:  1:1597468:
> 7: 1695..1722:1599557..   1599584: 28:1597967:
> 8: 1723..1723:1599211..   1599211:  1:1599585:
> 9: 1724..1955:1648394..   1648625:232:1599212:
>10: 1956..1956:1599695..   1599695:  1:1648626:
>11: 1957..2047:1625881..   1625971: 91:1599696:
>12: 2048..2417:1648804..   1649173:370:1625972:
>13: 2418..2420:1597468..   1597470:  3:1649174:
>14: 2421..2478:1624667..   1624724: 58:1597471:
>15: 2479..2479:1596416..   1596416:  1:1624725:
>16: 2480..2482:1601045..   1601047:  3:1596417:
>17: 2483..2483:1596854..   1596854:  1:1601048:
>18: 2484..2523:1602715..   1602754: 40:1596855:
>19: 2524..2527:1597471..   1597474:  4:1602755:
>20: 2528..2598:1624725..   1624795: 71:1597475:
>21: 2599..2599:1596858..   1596858:  1:1624796:
>22: 2600..2607:1601263.

is BTRFS_IOC_DEFRAG behavior optimal?

2021-02-07 Thread Chris Murphy

systemd-journald journals on Btrfs default to nodatacow,  upon log
rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The
result looks curious. I can't tell what the logic is from the results.

The journal file starts out being fallocated with a size of 8MB, and
as it grows there is an append of 8MB increments, also fallocated.
This leads to a filefrag -v that looks like this (ext4 and btrfs
nodatacow follow the same behavior, both are provided for reference):

ext4
https://pastebin.com/6vuufwXt

btrfs
https://pastebin.com/Y18B2m4h

Following defragment with BTRFS_IOC_DEFRAG it looks like this:
https://pastebin.com/1ufErVMs

It appears at first glance to be significantly more fragmented. Closer
inspection shows that most of the extents weren't relocated. But
what's up with the peculiar interleaving? Is this an improvement over
the original allocation?

https://pastebin.com/1ufErVMs

If I unwind the interleaving, it looks like all the extents fall into
two localities and within each locality the extents aren't that far
apart - so my guess is that this file is also not meaningfully
fragmented, in practice. Surely the drive firmware will reorder the
reads to arrive at the least amount of seeks?

-- 
Chris Murphy

Re: btrfs becomes read only on removal of folders

2021-02-04 Thread Chris Murphy

On Thu, Feb 4, 2021 at 4:04 AM mig...@rozsas.eng.br
 wrote:

> https://susepaste.org/51166386

It's raid1 metadata on the same physical device, so depending on the
device, if the metadata writes are concurrent they may end up being
deduped by the drive firmware no matter that they're supposed to go to
separate partitions.

Feb 02 13:43:37 kimera.rozsas.eng.br kernel: BTRFS error (device
sdc2): unable to fixup (regular) error at logical 557651984384 on dev
/dev/sdc1
Feb 02 13:43:37 kimera.rozsas.eng.br kernel: BTRFS error (device
sdc2): unable to fixup (regular) error at logical 557651869696 on dev
/dev/sdc1

This suggests both copies are bad.

> So, what is going here ?
> How can I fix this FS ?

I would do a memory test, the longer the better. Memory defects can be evasive.

Take the opportunity to freshen backups while the file system still
mounts read-only. And then also provide the output from

btrfs check --readonly

It might be something that can be repaired, but until you've isolated
memory, any repair or new writes can end up with the same problem. But
if it's not just a bit flip, and both copies are bad, then it's
usually a case of backup, reformat, restore. Hence the backup needs to
be the top priority; and checking the memory the second priority.

-- 
Chris Murphy

Re: Need help for my Unraid cache drive

2021-02-03 Thread Chris Murphy

On Sat, Jan 30, 2021 at 1:59 AM Patrick Bihlmayer  wrote:
>
> Hello together,
>
> today i had an issue with my cache drive on my Unraid Server.
> I used a 500GB SSD as cache drive.
>
> Unfortunately i added another cache drive (wanted a separate drive for my VMs 
> and accidentally added into the cache device pool)
> After starting the array and all the setup for the cache device pool was done 
> i stopped the array again.
> I removed the second drive from my cache device pool again.
>
> I started the array again - formatted the removed drive mounted it with 
> unassigned devices.#
> And then i realized the following error in my Unraid Cache Devices
>
>
>
> Unfortunately i cannot mount it again.
> Can you please help me?

I don't know anything about unraid. The attached dmesg contains:
[ 3660.395013] BTRFS info (device sdb1): allowing degraded mounts
[ 3660.395014] BTRFS info (device sdb1): disk space caching is enabled
[ 3660.395014] BTRFS info (device sdb1): has skinny extents
[ 3660.395733] BTRFS error (device sdb1): failed to read chunk root
[ 3660.404212] BTRFS error (device sdb1): open_ctree failed

Is that sdb1 device part of the unraid? Is there a device missing? The
'allowing degraded mounts' message along with 'open_ctree failed'
suggests that there's still too many devices missing. I suggest a
relatively recent btrfs-progs, 5.7 or higher, and provide the output
from:

btrfs insp dump-s /dev/sdb1

-- 
Chris Murphy

Re: is back and forth incremental send/receive supported/stable?

2021-02-01 Thread Chris Murphy

It needs testing but I think -c option can work for this case, because
the parent on both source and destination are identical, even if the
new destination (the old source) has an unexpected received subvolume
uuid.

At least for me, it worked once and I didn't explore it further. I
also don't know if it'll set received uuid, such that subsequent send
can use -p instead of -c.

-c generally still confuses me... in particular multiple instances of -c

--
Chris Murphy

Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd

2021-01-27 Thread Chris Murphy

gt;
> Overall:
> Device size: 931.49GiB
> Device allocated:931.49GiB
> Device unallocated:1.00MiB
> Device missing:  0.00B
> Used:786.39GiB
> Free (estimated):107.69GiB  (min: 107.69GiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
> Multiple profiles:  no
>
> Data,single: Size:884.48GiB, Used:776.79GiB (87.82%)
>/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533  884.48GiB
>
> Metadata,single: Size:47.01GiB, Used:9.59GiB (20.41%)
>/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533   47.01GiB
>
> System,single: Size:4.00MiB, Used:144.00KiB (3.52%)
>/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce5334.00MiB
>
> Unallocated:
>/dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce5331.00MiB

Can you mount or remount with enospc_debug, and reproduce the problem?
That'll include some debug info that might be helpful to a developing
coming across this report. Also it might help:

cd /sys/fs/btrfs/$UUID/allocation
grep -R .

And post that too. The $UUID is the file system UUID for this specific
file system, as reported by blkid or lsblk -f.


-- 
Chris Murphy

Re: Only one subvolume can be mounted after replace/balance

2021-01-27 Thread Chris Murphy

On Wed, Jan 27, 2021 at 6:10 AM Jakob Schöttl  wrote:
>
> Thank you Chris, it's resolved now, see below.
>
> Am 25.01.21 um 23:47 schrieb Chris Murphy:
> > On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl  wrote:
> >> Hi,
> >>
> >> In short:
> >> When mounting a second subvolume from a pool, I get this error:
> >> "mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda,
> >> missing code page or helper program, or other."
> >> dmesg | grep BTRFS only shows this error:
> >> info (device sda): disk space caching is enabled
> >> error (device sda): Remounting read-write after error is not allowed
> > It went read-only before this because it's confused. You need to
> > unmount it before it can be mounted rw. In some cases a reboot is
> > needed.
> Oh, I didn't notice that the pool was already mounted (via fstab).
> The filesystem where out of space and I had to resize both disks
> separately. And I had to mount with -o skip_balance for that. Now it
> works again.
>
> >> What happened:
> >>
> >> In my RAID1 pool with two disk, I successfully replaced one disk with
> >>
> >> btrfs replace start 2 /dev/sdx
> >>
> >> After that, I mounted the pool and did
> > I don't understand this sequence. In order to do a replace, the file
> > system is already mounted.
> That was, what I did before my actual problem occurred. But it's
> resolved now.
>
> >> btrfs fi show /mnt
> >>
> >> which showed WARNINGs about
> >> "filesystems with multiple block group profiles detected"
> >> (don't remember exactly)
> >>
> >> I thought it is a good idea to do
> >>
> >> btrfs balance start /mnt
> >>
> >> which finished without errors.
> > Balance alone does not convert block groups to a new profile. You have
> > to explicitly select a conversion filter, e.g.
> >
> > btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt
> I didn't want to convert to a new profile. I thought btrfs replace
> automatically uses the same profile as the pool?

Btrfs replace does not change the profile. But you reported mixed
profile block groups, which means conversion is indicated to make sure
they're al the same. Please post:

sudo btrfs fi us /mnt

Let's see what the block groups are and what you want them to be and
then see what conversion command might be indicated.


-- 
Chris Murphy

Re: btrfs becomes read-only

2021-01-27 Thread Chris Murphy

On Wed, Jan 27, 2021 at 6:05 AM Alexey Isaev  wrote:
>
> I managed to run btrs check, but it didn't solve the problem:
>
> aleksey@host:~$ sudo btrfs check --repair /dev/sdg

OK it's risky to run --repair without a developer giving a go ahead,
in particular with older versions of btrfs-progs. There are warnings
in the man page about it.


> [sudo] password for aleksey:
> enabling repair mode
> Checking filesystem on /dev/sdg
> UUID: 070ce9af-6511-4b89-a501-0823514320c1
> checking extents
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> Ignoring transid failure
> leaf parent key incorrect 52180048330752
> bad block 52180048330752
> Errors found in extent allocation tree or chunk allocation
> parent transid verify failed on 52180048330752 wanted 132477 found 132432

Yeah it's not finding what it's expecting to find there.

Any power fail or crash in the history of the file system?

What do you get for:

btrfs insp dump-s -f /dev/sdg


-- 
Chris Murphy

Re: btrfs becomes read-only

2021-01-27 Thread Chris Murphy

On Wed, Jan 27, 2021 at 1:57 AM Alexey Isaev  wrote:
>
> kernel version:
>
> aleksey@host:~$ sudo uname --all
> Linux host 4.15.0-132-generic #136~16.04.1-Ubuntu SMP Tue Jan 12
> 18:22:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

This is an old and EOL kernel. It could be a long fixed Btrfs bug that
caused this problem, I'm not sure. I suggest 5.4.93+ if you need a
longterm kernel, otherwise 5.10.11 is the current stable kernel.

>
> drive make/model:
>
> Drive is external 5 bay HDD enclosure with raid-5 connected via usb-3
> (made by Orico https://www.orico.cc/us/product/detail/3622.html)
> with 5 WD Red 10 Tb. We use this drive for backups.
>
> When i try to run btrfs check i get error message:
>
> aleksey@host:~$ sudo btrfs check --readonly /dev/sdg
> Couldn't open file system

OK is it now on some other dev node? A relatively recent btrfs-progs
is also recommended, 5.10 is current and I probably wouldn't use
anything older than 5.6.1.

> aleksey@host:~$ sudo smartctl -x /dev/sdg

Yeah probably won't work since it's behind a raid5 controller. I think
there's smartctl commands to enable passthrough and get information
for each drive, so that you don't have to put it in JBOD mode. But I'm
not familiar with how to do that. Anyway it's a good idea to find out
if there's SMART reporting any problems about any drive, but not
urgent.

-- 
Chris Murphy

Re: btrfs becomes read-only

2021-01-26 Thread Chris Murphy

On Wed, Jan 27, 2021 at 12:22 AM Alexey Isaev  wrote:
>
> Hello!
>
> BTRFS volume becomes read-only with this messages in dmesg.
> What can i do to repair btrfs partition?
>
> [Jan25 08:18] BTRFS error (device sdg): parent transid verify failed on
> 52180048330752 wanted 132477 found 132432
> [  +0.007587] BTRFS error (device sdg): parent transid verify failed on
> 52180048330752 wanted 132477 found 132432
> [  +0.000132] BTRFS error (device sdg): qgroup scan failed with -5
>
> [Jan25 19:52] BTRFS error (device sdg): parent transid verify failed on
> 52180048330752 wanted 132477 found 132432
> [  +0.009783] BTRFS error (device sdg): parent transid verify failed on
> 52180048330752 wanted 132477 found 132432
> [  +0.000132] BTRFS: error (device sdg) in __btrfs_cow_block:1176:
> errno=-5 IO failure
> [  +0.60] BTRFS info (device sdg): forced readonly
> [  +0.04] BTRFS info (device sdg): failed to delete reference to
> ftrace.h, inode 2986197 parent 2989315
> [  +0.02] BTRFS: error (device sdg) in __btrfs_unlink_inode:4220:
> errno=-5 IO failure
> [  +0.006071] BTRFS error (device sdg): pending csums is 430080

What kernel version? What drive make/model?

wanted 132477 found 132432 indicates the drive has lost ~45
transactions, that's not good and also weird. There's no crash or any
other errors? A complete dmesg might be more revealing. And also

smartctl -x /dev/sdg
btrfs check --readonly /dev/sdg

After that I suggest
https://btrfs.wiki.kernel.org/index.php/Restore

And try to get any important data out if it's not backed up. You can
try btrfs-find-root to get a listing of roots, most recent to oldest.
Start at the top, and plug that address in as 'btrfs restore -t' and
see if it'll pull anything out. You likely need -i and -v options as
well.

-- 
Chris Murphy

Re: Only one subvolume can be mounted after replace/balance

2021-01-25 Thread Chris Murphy

On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl  wrote:
>
> Hi,
>
> In short:
> When mounting a second subvolume from a pool, I get this error:
> "mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda,
> missing code page or helper program, or other."
> dmesg | grep BTRFS only shows this error:
> info (device sda): disk space caching is enabled
> error (device sda): Remounting read-write after error is not allowed

It went read-only before this because it's confused. You need to
unmount it before it can be mounted rw. In some cases a reboot is
needed.

>
> What happened:
>
> In my RAID1 pool with two disk, I successfully replaced one disk with
>
> btrfs replace start 2 /dev/sdx
>
> After that, I mounted the pool and did

I don't understand this sequence. In order to do a replace, the file
system is already mounted.

>
> btrfs fi show /mnt
>
> which showed WARNINGs about
> "filesystems with multiple block group profiles detected"
> (don't remember exactly)
>
> I thought it is a good idea to do
>
> btrfs balance start /mnt
>
> which finished without errors.

Balance alone does not convert block groups to a new profile. You have
to explicitly select a conversion filter, e.g.

btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt

> Now, I can only mount one (sub)volume of the pool at a time. Others can
> only be mounted read-only. See error messages at top of this mail.
>
> Do you have any idea what happened or how to fix it?
>
> I already tried rescue zero-log and super-recovery which was successful
> but didn't help.

I advise anticipating the confusion will get worse, and take the
opportunity to refresh the backups. That's the top priority, not
fixing the file system.

Next let us know the following:

kernel version
btrfs-progs version
Output from commands:
btrfs fi us /mnt
btrfs check --readonly


-- 
Chris Murphy

Re: Recover data from damage disk in "array"

2021-01-22 Thread Chris Murphy

On Mon, Jan 18, 2021 at 5:02 PM Hérikz Nawarro  wrote:
>
> Hello everyone,
>
> I got an array of 4 disks with btrfs configured with data single and
> metadata dup, one disk of this array was plugged with a bad sata cable
> that broke the plastic part of the data port (the pins still intact),
> i still can read the disk with an adapter, but there's a way to
> "isolate" this disk, recover all data and later replace the fault disk
> in the array with a new one?

I'm not sure what you mean by isolate, or what's meant by recover all
data. To recover all data on all four disks suggests replicating all
of it to another file system - i.e. backup, rsync, snapshot(s) +
send/receive.

Are there any kernel messages reporting btrfs problems with this file
system? That should be resolved as a priority before anything else.

Also, DUP metadata for multiple device btrfs is suboptimal. It's a
single point of failure. I suggest converting to raid1 metadata so the
file system can correct for drive specific problems/bugs by getting a
good copy from another drive. If it's the case DUP metadata is on the
drive with the bad sata cable, that could easily result in loss or
corruption of both copies of metadata and the whole file system can
implode.

-- 
Chris Murphy

Re: nodatacow mount option is disregarded when mounting subvolume into same filesystem

2021-01-17 Thread Chris Murphy

On Sun, Jan 17, 2021 at 2:07 PM Damian Höster  wrote:
>
> The nodatacow mount option seems to have no effect when mounting a
> subvolume into the same filesystem.
>
> I did some testing:
>
> sudo mount -o compress=zstd /dev/sda /mnt -> compression enabled
> sudo mount -o compress=zstd,nodatacow /dev/sda /mnt -> compression disabled
> sudo mount -o nodatacow,compress=zstd /dev/sda /mnt -> compression enabled
> All as I would expect, setting compress or nodatacow disables the other.
>
> Compression gets enabled without problems when mounting a subvolume into
> the same filesystem:
> sudo mount /dev/sda /mnt; sudo mount -o subvol=@test,compress=zstd
> /dev/sda /mnt/test -> compression enabled
> sudo mount /dev/sda /mnt; sudo mount -o subvol=@/testsub,compress=zstd
> /dev/sda /mnt/testsub -> compression enabled
>
> But nodatacow apparently doesn't:
> sudo mount -o compress=zstd /dev/sda /mnt; sudo mount -o
> subvol=@test,nodatacow /dev/sda /mnt/test -> compression enabled
> sudo mount -o compress=zstd /dev/sda /mnt; sudo mount -o
> subvol=@/testsub,nodatacow /dev/sda /mnt/testsub -> compression enabled
>
> And I don't think it's because of the compress mount option, some
> benchmarks I did indicate that nodatacow never gets set when mounting a
> subvolume into the same filesystem.
>

Most btrfs mount options are file system wide, they're not per
subvolume options. In case of conflict, the most recent option is
what's used. i.e. the mount options have an order and are followed in
order, with the latest one having precedence in a conflict:

compress,nodatacow   means nodatacow
nodatacow,compress   means compress

nodatacow implies nodatasum and no compress.

If you want per subvolume options then you need to use 'chattr +C' per
subvolume or directory for nodatacow. And for compression you can use
+c (small c) which implies zlib, or use 'btrfs property set
/path/to/sub-dir-file compression zstd'



-- 
Chris Murphy

Re: received uuid not set btrfs send/receive

2021-01-17 Thread Chris Murphy

On Sun, Jan 17, 2021 at 11:51 AM Anders Halman  wrote:
>
> Hello,
>
> I try to backup my laptop over an unreliable slow internet connection to
> a even slower Raspberry Pi.
>
> To bootstrap the backup I used the following:
>
> # local
> btrfs send root.send.ro | pigz | split --verbose -d -b 1G
> rsync -aHAXxv --numeric-ids --partial --progress -e "ssh -T -o
> Compression=no -x" x* remote-host:/mnt/backup/btrfs-backup/
>
> # remote
> cat x* > split.gz
> pigz -d split.gz
> btrfs receive -f split
>
> worked nicely. But I don't understand why the "received uuid" on the
> remote site in blank.
> I tried it locally with smaller volumes and it worked.

I suggest using -v or -vv on the receive side to dig into why the
receive is failing. Setting the received uuid is one of the last
things performed on receive, so if it's not set it suggests the
receive isn't finished.

-- 
Chris Murphy

btrfs: shrink delalloc pages instead of full inodes, for 5.10.8?

2021-01-14 Thread Chris Murphy

Hi,

It looks like this didn't make it to 5.10.7. I see the PR for
5.11-rc4. Is it likely it'll make it into 5.10.8?

e076ab2a2ca70a0270232067cd49f76cd92efe64
btrfs: shrink delalloc pages instead of full inodes

Thanks,

-- 
Chris Murphy

Re: Reading files with bad data checksum

2021-01-10 Thread Chris Murphy

On Sun, Jan 10, 2021 at 4:54 AM David Woodhouse  wrote:
>
> I filed https://bugzilla.redhat.com/show_bug.cgi?id=1914433
>
> What I see is that *both* disks of the RAID-1 have data which is
> consistent, and does not match the checksum that btrfs expects:

Yeah either use nodatacow (chattr +C) or don't use O_DIRECT until
there's a proper fix.

> What's the best way to recover the data?

I'd say, kernel 5.11's new "mount -o ro,rescue=ignoredatacsums"
feature. You can copy it out normally, no special tools.

The alternative is 'btrfs restore'.

-- 
Chris Murphy

Re: btrfs receive eats CoW attributes

2021-01-04 Thread Chris Murphy

On Mon, Jan 4, 2021 at 7:42 PM Cerem Cem ASLAN  wrote:
>
> I need my backups exactly same data, including the file attributes.
> Apparently "btrfs receive" ignores the CoW attribute. Here is the
> reproduction:
>
> btrfs sub create ./a
> mkdir a/b
> chattr +C a/b
> echo "hello" > a/b/file
> btrfs sub snap -r ./a ./a.ro
> mkdir x
> btrfs send a.ro | btrfs receive x
> lsattr a.ro
> lsattr x/a.ro
>
> Result is:
>
> # lsattr a.ro
> ---C--- a.ro/b
> # lsattr x/a.ro
> --- x/a.ro/b
>
> Expected: x/a.ro/b folder should have CoW disabled (same as a.ro/b folder)
>
> How can I workaround this issue in order to have correct attributes in
> my backups?

It's the exact opposite issue with chattr +c (or btrfs property set
compression), you can't shake it off :)

I think we might need 'btrfs receive' to gain a new flag that filters
some or all of these? And the filter would be something like
--exclude=$1,$2,$3 and --exclude=all

I have no strong opinion on what should be the default. But I think
probably the default should be "do not preserve any" because these
features aren't mkfs or mount time defaults, so I'd make preservation
explicitly opt in like they were on the original file system.


-- 
Chris Murphy

Re: tldr; no BTRFS on dev, after a forced shutdown, help

2021-01-04 Thread Chris Murphy

On Mon, Jan 4, 2021 at 11:09 AM André Isidro da Silva
 wrote:
>
> I'm sure it used to be one, but indeed it seems that a TYPE is missing
> in /dev/sda10; gparted says it's unknown.
> It seems there is no trace of the fs. I'm trying to recall any other
> operations I might have done, but if it was something else I can't
> remember what could have been. I used cfdisk, to resize another
> partition, also tried to do a 'btrfs device add' with this missing one
> (to solve the no space left in another one), otherwise it was mount /,
> mount /home (/dev/sda10), umount, repeat. Oh well.
>
> [sudo blkid]
>
> /dev/sda1: UUID="03ff3132-dfc5-4dce-8add-cf5a6c854313" BLOCK_SIZE="4096"
> TYPE="ext4" PARTLABEL="LINUX"
> PARTUUID="a6042b9f-a3fe-49e2-8dc5-98a818454b6d"
>
> /dev/sdb4: UUID="5c7201df-ff3e-4cb7-8691-8ef0c6c806ed"
> UUID_SUB="bb677c3a-6270-420f-94ce-f5b89f2c40d2" BLOCK_SIZE="4096"
> TYPE="btrfs" PARTUUID="be4190e4-8e09-4dfc-a901-463f3e162727"
>
> /dev/sda10: PARTLABEL="HOME"
> PARTUUID="6045f3f0-47a7-4b38-a392-7bebb7f654bd"
>
> [sudo btrfs insp dump-s -F /dev/sda10]
>
> superblock: bytenr=65536, device=/dev/sda10
> -
> csum_type   0 (crc32c)
> csum_size   4
> csum0x [DON'T MATCH]
> bytenr  0
> flags   0x0
> magic    [DON'T MATCH]
> fsid----
> metadata_uuid   ----
> label
> generation  0
> root0
> sys_array_size  0
> chunk_root_generation   0
> root_level  0
> chunk_root  0
> chunk_root_level0
> log_root0
> log_root_transid0
> log_root_level  0
> total_bytes 0
> bytes_used  0
> sectorsize  0
> nodesize0
> leafsize (deprecated)   0
> stripesize  0
> root_dir0
> num_devices 0
> compat_flags0x0
> compat_ro_flags 0x0
> incompat_flags  0x0
> cache_generation0
> uuid_tree_generation0
> dev_item.uuid   ----
> dev_item.fsid   ---- [match]
> dev_item.type   0
> dev_item.total_bytes0
> dev_item.bytes_used 0
> dev_item.io_align   0
> dev_item.io_width   0
> dev_item.sector_size0
> dev_item.devid  0
> dev_item.dev_group  0
> dev_item.seek_speed 0
> dev_item.bandwidth  0
> dev_item.generation 0
>
> This as nothing to do with btrfs anymore, but: do you think a tool like
> foremost can recover the files, it'll be a mess, but better then nothing
> and I've used it before in a ntfs.

No idea.

You could scan the entire drive for the Btrfs magic, which is inside
the superblock. It will self identify its offset, the first superblock
is the one you want, which is offset 65536 (64KiB) from the start of
the block device/partition. And that superblock also says the device
size.



-- 
Chris Murphy

Re: tldr; no BTRFS on dev, after a forced shutdown, help

2021-01-03 Thread Chris Murphy

On Sun, Jan 3, 2021 at 9:30 PM André Isidro da Silva
 wrote:
>
> I might be in some panic, I'm sorry for the info I'm not experienced
> enough to give.
>
> I was in a live iso trying really hard to repair my root btrfs from
> which I had used all the space avaiable.. I was trying to move a /usr
> partition into the btrfs system, but I didn't check the space available
> with the tool, instead used normal tools, because I didn't understand or
> actually thought about how the subvolumes would change... sorry this
> isn't even the issue anymore; to move /usr I had a temporary /usr copy
> in another btrfs system (my /home data partition) and so mounted both
> partitions. However this was done in a linux "boot fail console" from
> which I didn't know how to proper shutdown.. so I eventually forced the
> shutdown withou umounting stuff (...), I think that forced shutdown
> might have broken the second partition that now isn't recognized with
> btrfs check or mountable. It might also have happen when using the live
> iso, but the forced shutdown seemed more likely, since I did almost no
> operations but mount/cp. This partition was my data partition, I thought
> it was safe to use for this process, since I was just copying files from
> it. I do have a backup, but it's old so I'll still lose a lot.. help.

First, make no changes, attempt no repairs. Next save history of what you did.

A forced shutdown does not make Btrfs unreadable, although if writes
are happening at the time of the shutdown and the drive firmware
doesn't properly honor write order, then it might be 'btrfs restore'
territory.

What do you get for:

btrfs filesystem show
kernel messages (dmesg) that appear when you try to mount the volume
but it fails.



-- 
Chris Murphy

Re: [BUG] 500-2000% performance regression w/ 5.10

2021-01-03 Thread Chris Murphy

The problem is worse on SSD than on HDD. It actually makes the SSD
*slower* than an HDD, on 5.10. For this workload

HDD
5.9.16-200.fc33.x86_64
mq-deadline kyber [bfq] none

$ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync

real1m27.299s
user0m27.294s
sys0m14.134s

real0m8.890s
user0m0.001s
sys0m0.344s


HDD
5.10.4-200.fc33.x86_64
mq-deadline kyber [bfq] none

$ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync

real2m14.936s
user0m54.396s
sys0m47.082s

real0m7.726s
user0m0.001s
sys0m0.382s



SSD, compress=zstd:1
5.9.16-200.fc33.x86_64
[mq-deadline] kyber bfq none

$ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync

real0m41.947s
user0m29.359s
sys0m18.088s

real0m2.042s
user0m0.000s
sys0m0.065s

SSD, compress=zstd:1
5.10.4-200.fc33.x86_64
[mq-deadline] kyber bfq none

$ time tar -xf /tmp/firefox-85.0b4.source.tar.xz && time sync

real2m59.581s
user1m4.097s
sys0m56.323s

real0m1.492s
user0m0.000s
sys0m0.077s

Re: cp --reflink of inline extent results in two DATA_EXTENT entries

2020-12-23 Thread Chris Murphy

On Tue, Dec 22, 2020 at 11:05 PM Andrei Borzenkov  wrote:
>
> 23.12.2020 06:48, Chris Murphy пишет:
> > Hi,
> >
> > kernel is 5.10.2
> >
> > cp --reflink hi hi2
> >
> > This results in two EXTENT_DATA items with different offsets,
> > therefore I think the data is duplicated in the leaf? Correct? Is it
> > expected?
> >
>
> I'd say yes. Inline data is contained in EXTEND_DATA item and
> EXTENT_DATA item cannot be shared by two different inodes (it is keyed
> by inode number).
>
> Even when cloning regular extent you will have two independent
> EXTENT_DATA items pointing to the same physical extent.


Thanks.

I saw this commit a long time ago and sorta just figured it meant
maybe inline extents would be cloned within a given leaf.

05a5a7621ce6
Btrfs: implement full reflink support for inline extents

But I only just now read the commit message, and it reads like cloning
now will be handled without error. It's not saying that it results in
shared inline data extents.

-- 
Chris Murphy

cp --reflink of inline extent results in two DATA_EXTENT entries

2020-12-22 Thread Chris Murphy

Hi,

kernel is 5.10.2

cp --reflink hi hi2

This results in two EXTENT_DATA items with different offsets,
therefore I think the data is duplicated in the leaf? Correct? Is it
expected?

item 9 key (257 EXTENT_DATA 0) itemoff 15673 itemsize 53
generation 435179 type 0 (inline)
inline extent data size 32 ram_bytes 174 compression 3 (zstd)

...
item 13 key (258 EXTENT_DATA 0) itemoff 15364 itemsize 53
generation 435179 type 0 (inline)
inline extent data size 32 ram_bytes 174 compression 3 (zstd)


The entire file tree containing only these two files follows:


file tree key (394 ROOT_ITEM 0)
leaf 26442252288 items 14 free space 15014 generation 435212 owner 394
leaf 26442252288 flags 0x1(WRITTEN) backref revision 1
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 435123 transid 435212 size 10 nbytes 0
block group 0 mode 40755 links 1 uid 1000 gid 1000
rdev 0
sequence 5267 flags 0x0(none)
atime 1608689569.708325037 (2020-12-22 19:12:49)
ctime 1608694856.721370147 (2020-12-22 20:40:56)
mtime 1608694856.721370147 (2020-12-22 20:40:56)
otime 1608689569.708325037 (2020-12-22 19:12:49)
item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
index 0 namelen 2 name: ..
item 2 key (256 DIR_ITEM 432062026) itemoff 16079 itemsize 32
location key (257 INODE_ITEM 0) type FILE
transid 435124 data_len 0 name_len 2
name: hi
item 3 key (256 DIR_ITEM 4216900732) itemoff 16046 itemsize 33
location key (258 INODE_ITEM 0) type FILE
transid 435196 data_len 0 name_len 3
name: hi2
item 4 key (256 DIR_INDEX 2) itemoff 16014 itemsize 32
location key (257 INODE_ITEM 0) type FILE
transid 435124 data_len 0 name_len 2
name: hi
item 5 key (256 DIR_INDEX 4) itemoff 15981 itemsize 33
location key (258 INODE_ITEM 0) type FILE
transid 435196 data_len 0 name_len 3
name: hi2
item 6 key (257 INODE_ITEM 0) itemoff 15821 itemsize 160
generation 435124 transid 435212 size 174 nbytes 174
block group 0 mode 100644 links 1 uid 1000 gid 1000
rdev 0
sequence 19 flags 0x0(none)
atime 1608689574.39809 (2020-12-22 19:12:54)
ctime 1608694856.721370147 (2020-12-22 20:40:56)
mtime 1608692923.231038818 (2020-12-22 20:08:43)
otime 1608689574.39809 (2020-12-22 19:12:54)
item 7 key (257 INODE_REF 256) itemoff 15809 itemsize 12
index 2 namelen 2 name: hi
item 8 key (257 XATTR_ITEM 3817753667) itemoff 15726 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 435124 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 9 key (257 EXTENT_DATA 0) itemoff 15673 itemsize 53
generation 435179 type 0 (inline)
inline extent data size 32 ram_bytes 174 compression 3 (zstd)
item 10 key (258 INODE_ITEM 0) itemoff 15513 itemsize 160
generation 435196 transid 435196 size 174 nbytes 174
block group 0 mode 100644 links 1 uid 1000 gid 1000 rdev 0
sequence 34 flags 0x0(none)
atime 1608693921.97510335 (2020-12-22 20:25:21)
ctime 1608693921.97510335 (2020-12-22 20:25:21)
mtime 1608693921.97510335 (2020-12-22 20:25:21)
otime 1608693921.97510335 (2020-12-22 20:25:21)
item 11 key (258 INODE_REF 256) itemoff 15500 itemsize 13
index 4 namelen 3 name: hi2
item 12 key (258 XATTR_ITEM 3817753667) itemoff 15417 itemsize 83
location key (0 UNKNOWN.0 0) type XATTR
transid 435196 data_len 37 name_len 16
name: security.selinux
data unconfined_u:object_r:unlabeled_t:s0
item 13 key (258 EXTENT_DATA 0) itemoff 15364 itemsize 53
generation 435179 type 0 (inline)
inline extent data size 32 ram_bytes 174 compression 3 (zstd)
total bytes 31005392896
bytes used 20153282560



-- 
Chris Murphy

memory bit flip not detected by write time tree check

2020-12-21 Thread Chris Murphy

Hi,

mount failure, WARNING at fs/btrfs/extent-tree.c:3060
__btrfs_free_extent.isra.0+0x5fd/0x8d0
https://bugzilla.redhat.com/show_bug.cgi?id=1905618#c9

In this bug, the user reports what looks like undetected memory bit
flip corruption, that makes it to disk, and then is caught at mount
time, resulting in mount failure.

I'm double checking with the user, but I'm pretty sure it had only
seen writes with relatively recent (5.8+) kernels.


-- 
Chris Murphy

what determines what /dev/ is mounted?

2020-12-18 Thread Chris Murphy

When I have a 2-device btrfs:

devid 1 = /dev/vdb1
devid 2 = /dev/vdc1

Regardless of the mount command, df and /proc/mounts shows /dev/vdb1 is mounted.

If I flip the backing assignments in qemu, such that:

devid 2 = /dev/vdb1
devid 1 = /dev/vdc1

Now, /dev/vdc1 is shown as mounted by df and /proc/mounts.

But this isn't scientific. Is there a predictable logic? Is it always
the lowest devid?




-- 
Chris Murphy

Re: feature request, explicit mount and unmount kernel messages

2019-10-22 Thread Chris Murphy

On Tue, Oct 22, 2019 at 1:33 PM Roman Mamedov  wrote:
>
> On Tue, 22 Oct 2019 11:00:07 +0200
> Chris Murphy  wrote:
>
> > Hi,
> >
> > So XFS has these
> >
> > [49621.415203] XFS (loop0): Mounting V5 Filesystem
> > [49621.58] XFS (loop0): Ending clean mount
> > ...
> > [49621.58] XFS (loop0): Ending clean mount
> > [49641.459463] XFS (loop0): Unmounting Filesystem
> >
> > It seems to me linguistically those last two should be reversed, but 
> > whatever.
>
> Just a minor note, there is no "last two", but only one "Unmounting" message
> on unmount: you copied the "Ending" mount-time message twice for the 2nd quote
> (as shown by the timestamp).

That's funny, I duplicated that line by mistake. User error!


-- 
Chris Murphy

Re: feature request, explicit mount and unmount kernel messages

2019-10-22 Thread Chris Murphy

On Tue, Oct 22, 2019 at 11:56 AM Anand Jain  wrote:
>
>
>   I agree, I sent patches for it in 2017.
>
>   VFS version.
> https://patchwork.kernel.org/patch/9745295/
>
>   btrfs version:
> https://patchwork.kernel.org/patch/9745295/
>
>   There wasn't response on btrfs-v2-patch.
>
>   This is not the first time that I am writing patches ahead of
>   users asking for it, but unfortunately there is no response or
>   there are disagreements on those patches.

I guess it could be a low priority for developers. But that's a big
part of why doing this in VFS might be useful, generically, for all
file systems? I have no idea what that boundary looks like between
native file system and VFS. But if the mount related messages were
removed from ext4, XFS, Btrfs, f2fs, FAT, that developers don't find
that useful, and add in a proper plain language "(u)mount completed"
in VFS, that would be, I think, useful for not just regular users, but
users like systemd/init users, and others who have to sort out mount
hangs and failures. Just exactly where did this  hang up? I can't tell
and it's different behavior for every file system.

I'm not opposed to each file system having their own (u)mount
completed message, indicating a boundary where the native code ends,
and VFS code begins. But again that's up to developers. I just want to
know if the hang means we're stuck somewhere in *kernel* mount code.
>From the prior example, I can't tell that at all, there just isn't
enough information.

-- 
Chris Murphy

Re: feature request, explicit mount and unmount kernel messages

2019-10-22 Thread Chris Murphy

(resending to list, I don't know why but I messed up the reply
directly to Nikolay)

On Tue, Oct 22, 2019 at 11:16 AM Nikolay Borisov  wrote:
>
> On 22.10.19 =D0=B3. 12:00 =D1=87., Chris Murphy wrote:
> > Hi,
> >
> > So XFS has these
> >
> > [49621.415203] XFS (loop0): Mounting V5 Filesystem
> > [49621.58] XFS (loop0): Ending clean mount
> > ...
> > [49621.58] XFS (loop0): Ending clean mount
> > [49641.459463] XFS (loop0): Unmounting Filesystem
> >
> > It seems to me linguistically those last two should be reversed, but wh=
atever.
> >
> > The Btrfs mount equivalent messages are:
> > [49896.176646] BTRFS: device fsid f7972e8c-b58a-4b95-9f03-1a08bbcb62a7
> > devid 1 transid 5 /dev/loop0
> > [49901.739591] BTRFS info (device loop0): disk space caching is enabled
> > [49901.739595] BTRFS info (device loop0): has skinny extents
> > [49901.767447] BTRFS info (device loop0): enabling ssd optimizations
> > [49901.767851] BTRFS info (device loop0): checking UUID tree
> >
> > So is it true that for sure there is nothing happening after the UUID
> > tree is checked, that the file system is definitely mounted at this
> > point? And always it's the UUID tree being checked that's the last
> > thing that happens? Or is it actually already mounted just prior to
> > disk space caching enabled message, and the subsequent messages are
> > not at all related to the mount process? See? I can't tell.
> >
> > For umount, zero messages at all.
>
> You are doing it wrong.

I'm doing what wrong?

> Those messages are sent from the given subsys to
> the console and printed whenever. You can never rely on the fact that
> those messages won't race with some code.

That possibility is implicit in all of the questions I asked.


> For example the checking UUID tree happens _before_
> btrfs_check_uuid_tree is called and there is no guarantee when it's
> finished.

Are these messages useful for developers? I don't see them as being
useful for users. They're kinda superfluous for them.


> > The feature request is something like what XFS does, so that we know
> > exactly when the file system is mounted and unmounted as far as Btrfs
> > code is concerned.
> >
> > I don't know that it needs the start and end of the mount and
> > unmounted (i.e. two messages). I'm mainly interested in having a
> > notification for "mount completed successfully" and "unmount completed
> > successfully". i.e. the end of each process, not the start of each.
>
> mount is a blocking syscall, same goes for umount your notifications are
> when the respective syscalls / system utilities return.

Right. Here is the example bug from 2015, that I just became aware of
as the impetus for posting the request; but I've wanted this explicit
notification for a while.

https://bugzilla.redhat.com/show_bug.cgi?id=3D1206874#c7

In that example, there's one Btrfs info message at
[2.727784] localhost.localdomain kernel: BTRFS info (device sda3):
disk space caching is enabled

And yet systemd times out on the mount unit. If it's true that only
mount blocking systemd could be the cause, then this is a Btrfs, VFS,
or mount related bug (however old it is by now and doesn't really
matter other than conceptually). But there isn't enough granularity in
the kernel messages to understand why the mount is taking so long. If
there were a Btrfs mount succeeded message, we'd know whether the
Btrfs portion of the mount process successfully completed or not, and
perhaps have a better idea where the hang is happening.

On Tue, Oct 22, 2019 at 11:16 AM Nikolay Borisov  wrote:
>
>
>
> On 22.10.19 г. 12:00 ч., Chris Murphy wrote:
> > Hi,
> >
> > So XFS has these
> >
> > [49621.415203] XFS (loop0): Mounting V5 Filesystem
> > [49621.58] XFS (loop0): Ending clean mount
> > ...
> > [49621.58] XFS (loop0): Ending clean mount
> > [49641.459463] XFS (loop0): Unmounting Filesystem
> >
> > It seems to me linguistically those last two should be reversed, but 
> > whatever.
> >
> > The Btrfs mount equivalent messages are:
> > [49896.176646] BTRFS: device fsid f7972e8c-b58a-4b95-9f03-1a08bbcb62a7
> > devid 1 transid 5 /dev/loop0
> > [49901.739591] BTRFS info (device loop0): disk space caching is enabled
> > [49901.739595] BTRFS info (device loop0): has skinny extents
> > [49901.767447] BTRFS info (device loop0): enabling ssd optimizations
> > [49901.767851] BTRFS info (device loop0): checking UUID tree
> >
> > So is it true that for sure there is nothing happening after the UUID
> > tree

feature request, explicit mount and unmount kernel messages

2019-10-22 Thread Chris Murphy

Hi,

So XFS has these

[49621.415203] XFS (loop0): Mounting V5 Filesystem
[49621.58] XFS (loop0): Ending clean mount
...
[49621.58] XFS (loop0): Ending clean mount
[49641.459463] XFS (loop0): Unmounting Filesystem

It seems to me linguistically those last two should be reversed, but whatever.

The Btrfs mount equivalent messages are:
[49896.176646] BTRFS: device fsid f7972e8c-b58a-4b95-9f03-1a08bbcb62a7
devid 1 transid 5 /dev/loop0
[49901.739591] BTRFS info (device loop0): disk space caching is enabled
[49901.739595] BTRFS info (device loop0): has skinny extents
[49901.767447] BTRFS info (device loop0): enabling ssd optimizations
[49901.767851] BTRFS info (device loop0): checking UUID tree

So is it true that for sure there is nothing happening after the UUID
tree is checked, that the file system is definitely mounted at this
point? And always it's the UUID tree being checked that's the last
thing that happens? Or is it actually already mounted just prior to
disk space caching enabled message, and the subsequent messages are
not at all related to the mount process? See? I can't tell.

For umount, zero messages at all.

The feature request is something like what XFS does, so that we know
exactly when the file system is mounted and unmounted as far as Btrfs
code is concerned.

I don't know that it needs the start and end of the mount and
unmounted (i.e. two messages). I'm mainly interested in having a
notification for "mount completed successfully" and "unmount completed
successfully". i.e. the end of each process, not the start of each.

In particular the unmount notice is somewhat important because as far
as I know there's no Btrfs dirty flag from which to infer whether it
was really unmounted cleanly. But I'm also not sure what the insertion
point for these messages would be. Looking at the mount code in
particular, it's a little complicated. And maybe with some of the
sanity checking and debug options it could get more complicated, and
wouldn't want to conflict with that - or any multiple device use case
either.


-- 
Chris Murphy

Re: MD RAID 5/6 vs BTRFS RAID 5/6

2019-10-20 Thread Chris Murphy

On Sat, Oct 19, 2019 at 12:18 AM Supercilious Dude
 wrote:
>
> It would be be useful to have the ability to scrub only the metadata. In many 
> cases the data is so large that a full scrub is not feasible. In my "little" 
> test system of 34TB a full scrub takes many hours and the IOPS saturate the 
> disks to the extent that the volume is unusable due to the high latencies. 
> Ideally there should be a way to rate limit the scrub operation so that it 
> can happen in the background without impacting the normal workload.

In effect a 'btrfs check' is a read only scrub of metadata, as all
metadata is needed to be read for that. Of course it's more expensive
than just confirm checksums are OK, because it's also doing a bunch of
sanity and logical tests that take much longer.

-- 
Chris Murphy

Re: MD RAID 5/6 vs BTRFS RAID 5/6

2019-10-20 Thread Chris Murphy

On Thu, Oct 17, 2019 at 8:23 PM Graham Cobb  wrote:
>
> On 17/10/2019 16:57, Chris Murphy wrote:
> > On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB  
> > wrote:
> >>
> >> It would be interesting to know the pros and cons of this setup that
> >> you are suggesting vs zfs.
> >> +zfs detects and corrects bitrot (
> >> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> >> +zfs has working raid56
> >> -modules out of kernel for license incompatibilities (a big minus)
> >>
> >> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> >> to find any conclusive doc about it right now)
> >
> > Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.
>
> Presumably this is dependent on checksums? So neither detection nor
> fixup happen for NOCOW files? Even a scrub won't notice because scrub
> doesn't attempt to compare both copies unless the first copy has a bad
> checksum -- is that correct?

Normal read (passive) it can't be detected if nocow, since nocow means
nodatasum. If the problem happens in metadata, it's detected because
metadata is always cow and always has csum.

I'm not sure what the scrub behavior is for nocow. There's enough
information to detect a mismatch if in normal (not degraded)
operation. But I don't know if Btrfs scrub warns on this case.

> If I understand correctly, metadata always has checksums so that is true
> for filesystem structure. But for no-checksum files (such as nocow
> files) the corruption will be silent, won't it?

Corruption is always silent for nocow data. Same as any other
filesystem, it's up to the application layer to detect it.

-- 
Chris Murphy

Re: MD RAID 5/6 vs BTRFS RAID 5/6

2019-10-17 Thread Chris Murphy

On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB  wrote:
>
> It would be interesting to know the pros and cons of this setup that
> you are suggesting vs zfs.
> +zfs detects and corrects bitrot (
> http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> +zfs has working raid56
> -modules out of kernel for license incompatibilities (a big minus)
>
> BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> to find any conclusive doc about it right now)

Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.

> I'm one of those that is waiting for the write hole bug to be fixed in
> order to use raid5 on my home setup. It's a shame it's taking so long.

For what it's worth, the write hole is considered to be rare.
https://lwn.net/Articles/665299/

Further, the write hole means a) parity is corrupt or stale compared
to data stripe elements which is caused by a crash or powerloss during
writes, and b) subsequently there is a missing device or bad sector in
the same stripe as the corrupt/stale parity stripe element. The effect
of b) is that reconstruction from parity is necessary, and the effect
of a) is that it's reconstructed incorrectly, thus corruption. But
Btrfs detects this corruption, whether it's metadata or data. The
corruption isn't propagated in any case. But it makes the filesystem
fragile if this happens with metadata. Any parity stripe element
staleness likely results in significantly bad reconstruction in this
case, and just can't be worked around, even btrfs check probably can't
fix it. If the write hole problem happens with data block group, then
EIO. But the good news is that this isn't going to result in silent
data or file system metadata corruption. For sure you'll know about
it.

This is why scrub after a crash or powerloss with raid56 is important,
while the array is still whole (not degraded). The two problems with
that are:

a) the scrub isn't initiated automatically, nor is it obvious to the
user it's necessary
b) the scrub can take a long time, Btrfs has no partial scrubbing.

Wheras mdadm arrays offer a write intent bitmap to know what blocks to
partially scrub, and to trigger it automatically following a crash or
powerloss.

It seems Btrfs already has enough on-disk metadata to infer a
functional equivalent to the write intent bitmap, via transid. Just
scrub the last ~50 generations the next time it's mounted. Either do
this every time a Btrfs raid56 is mounted. Or create some flag that
allows Btrfs to know if the filesystem was not cleanly shutdown. It's
possible 50 generations could be a lot of data, but since it's an
online scrub triggered after mount, it wouldn't add much to mount
times. I'm also picking 50 generations arbitrarily, there's no basis
for that number.

The above doesn't cover the case where partial stripe write (which
leads to write hole problem), and a crash or powerloss, and at the
same time one or more device failures. In that case there's no time
for a partial scrub to fix the problem leading to the write hole. So
even if the corruption is detected, it's too late to fix it. But at
least an automatic partial scrub, even degraded, will mean the user
would be flagged of the uncorrectable problem before they get too far
along.

-- 
Chris Murphy

Re: 5.3.0 deadlock: btrfs_sync_file / btrfs_async_reclaim_metadata_space / btrfs_page_mkwrite

2019-10-14 Thread Chris Murphy

On Mon, Oct 14, 2019 at 7:05 PM James Harvey  wrote:
>
> On Sun, Oct 13, 2019 at 9:46 PM Chris Murphy  wrote:
> >
> > On Sat, Oct 12, 2019 at 5:29 PM James Harvey  
> > wrote:
> > >
> > > Was using a temporary BTRFS volume to compile mongodb, which is quite
> > > intensive and takes quite a bit of time.  The volume has been
> > > deadlocked for about 12 hours.
> > >
> > > Being a temporary volume, I just used mount without options, so it
> > > used the defaults:  rw,relatime,ssd,space_cache,subvolid=5,subvol=/
> > >
> > > Apologies if upgrading to 5.3.5+ will fix this.  I didn't see
> > > discussions of a deadlock looking like this.
> >
> > I think it's a bug in any case, in particular because its all default
> > mount options, but it'd be interesting if any of the following make a
> > difference:
> >
> > - space_cache=v2
> > - noatime
>
> Interesting.
>
> This isn't 100% reproducible.  Before my original post, after my
> initial deadlock, I tried again and immediately hit another deadlock.
> But, yesterday, in response to your email, I tried again still without
> "space_cache=v2,noatime" to re-confirm the deadlock.  I had to
> re-compile mongodb about 6 times to hit another deadlock.  I was
> almost at the point of thinking I wouldn't see it again.
>
> After re-confirming it, I re-created the BTRFS volume to use
> "space_cache=v2,noatime" mount options.  It deadlocked during the
> first mongodb compilation.  w > sysrq_trigger is a little bit
> different.  No trace including "btrfs_sync_log" or
> "btrfs_async_reclaim_metadata_space".  Only traces including the
> "btrfs_btrfs_async_reclaim_metadata_space".  Viewable here:
> http://ix.io/1YGe

I think it's some kind of disk or lock contention, but I don't really
know much about it. The v1 space_cache is basically data extents, so
they use data chunks and I guess can conflict with heavy data writes.
Whereas v2 space_cache is a dedicated metadata btree. So yeah - and
I'm not sure if mongo builds use atime at all so the noatime could be
a goose chase, but figured it might help reduce unnecessary metadata
updates.


> Also, as I'm testing some issues with the mongodb compilation process
> (upstream always forces debug symbols...), as a workaround to be able
> to test its issues, I've used a temporary ext4 volume for it, which I
> haven't had a single issue with.

Adds to the notion this is some kind of bug.

-- 
Chris Murphy

Re: Massive filesystem corruption since kernel 5.2 (ARCH)

2019-10-14 Thread Chris Murphy

On Sun, Oct 13, 2019 at 8:07 PM Adam Bahe  wrote:
>
> > Until the fix gets merged to 5.2 kernels (and 5.3), I don't really 
> > recommend running 5.2 or 5.3.
>
> I know fixes went in to distro specific kernels. But wanted to verify
> if the fix went into the vanilla kernel.org kernel? If so, what
> version should be safe? ex:
> https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.3.6
>
> With 180 raw TB in raid1 I just want to be explicit. Thanks!

It's fixed upstream stable since 5.2.15, and includes all 5.3.x series.



-- 
Chris Murphy

Re: 5.3.0 deadlock: btrfs_sync_file / btrfs_async_reclaim_metadata_space / btrfs_page_mkwrite

2019-10-13 Thread Chris Murphy

On Sat, Oct 12, 2019 at 5:29 PM James Harvey  wrote:
>
> Was using a temporary BTRFS volume to compile mongodb, which is quite
> intensive and takes quite a bit of time.  The volume has been
> deadlocked for about 12 hours.
>
> Being a temporary volume, I just used mount without options, so it
> used the defaults:  rw,relatime,ssd,space_cache,subvolid=5,subvol=/
>
> Apologies if upgrading to 5.3.5+ will fix this.  I didn't see
> discussions of a deadlock looking like this.

I think it's a bug in any case, in particular because its all default
mount options, but it'd be interesting if any of the following make a
difference:

- space_cache=v2
- noatime



-- 
Chris Murphy

Re: BTRFS Raid5 error during Scrub.

2019-10-03 Thread Chris Murphy

On Thu, Oct 3, 2019 at 6:18 AM Robert Krig
 wrote:
>
> By the way, how serious is the error I've encountered?
> I've run a second scrub in the meantime, it aborted when it came close
> to the end, just like the first time.
> If the files that are corrupt have been deleted is this error going to
> go away?

Maybe.

> > > > Opening filesystem to check...
> > > > Checking filesystem on /dev/sda
> > > > UUID: f7573191-664f-4540-a830-71ad654d9301
> > > > [1/7] checking root items  (0:01:17 elapsed,
> > > > 5138533 items checked)
> > > > parent transid verify failed on 48781340082176 wanted 109181
> > > > found
> > > > 109008items checked)
> > > > parent transid verify failed on 48781340082176 wanted 109181
> > > > found
> > > > 109008
> > > > parent transid verify failed on 48781340082176 wanted 109181
> > > > found
> > > > 109008

These look suspiciously like the 5.2 regression:
https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdman...@kernel.org/T/#u

You should either revert to a 5.1 kernel, or use 5.2.15+.

As far as I'm aware it's not possible to fix this kind of corruption,
so I suggest refreshing your backups while you can still mount this
file system, and prepare to create it from scratch.

> > > > Ignoring transid failure
> > > > leaf parent key incorrect 48781340082176
> > > > bad block 48781340082176
> > > > [2/7] checking extents (0:03:22 elapsed,
> > > > 1143429 items checked)
> > > > ERROR: errors found in extent allocation tree or chunk allocation

That's usually not a good sign.

> > > > [3/7] checking free space cache(0:05:10 elapsed,
> > > > 7236
> > > > items checked)
> > > > parent transid verify failed on 48781340082176 wanted 109181
> > > > found
> > > > 109008ems checked)
> > > > Ignoring transid failure
> > > > root 15197 inode 81781 errors 1000, some csum missing48 elapsed,

That's inode 81781 in the subvolume with ID 15197. I'm not sure what
error 1000 is, but btrfs check is a bit fussy when it enounters files
that are marked +C (nocow) but have been compressed. This used to be
possible with older kernels when nocow files were defragmented while
the file system is mounted with compression enabled. If that sounds
like your use case, that might be what's going on here, and it's
actually a benign message. It's normal for nocow files to be missing
csums. To confirm you can use 'find /pathtosubvol/ -inum 81781' to
find the file, then lsattr it and see if +C is set.

You have a few options but the first thing is to refresh backups and
prepare to lose this file system:

a. bail now, and just create a new Btrfs from scratch and restore from backup
b. try 'btrfs check --repair' to see if the transid problems are fixed; if not
c. try 'btrfs check --repair --init-extent-tree' there's a good chance
this fails and makes things worse but probably faster to try than
restoring from backup

-- 
Chris Murphy

Re: BTRFS Raid5 error during Scrub.

2019-10-01 Thread Chris Murphy

On Mon, Sep 30, 2019 at 3:37 AM Robert Krig
 wrote:
>
> I've upgraded to btrfs-progs v5.2.1
> Here is the output from btrfs check -p --readonly /dev/sda
>
>
> Opening filesystem to check...
> Checking filesystem on /dev/sda
> UUID: f7573191-664f-4540-a830-71ad654d9301
> [1/7] checking root items  (0:01:17 elapsed,
> 5138533 items checked)
> parent transid verify failed on 48781340082176 wanted 109181 found
> 109008items checked)
> parent transid verify failed on 48781340082176 wanted 109181 found
> 109008
> parent transid verify failed on 48781340082176 wanted 109181 found
> 109008
> Ignoring transid failure
> leaf parent key incorrect 48781340082176
> bad block 48781340082176
> [2/7] checking extents (0:03:22 elapsed,
> 1143429 items checked)
> ERROR: errors found in extent allocation tree or chunk allocation
> [3/7] checking free space cache(0:05:10 elapsed, 7236
> items checked)
> parent transid verify failed on 48781340082176 wanted 109181 found
> 109008ems checked)
> Ignoring transid failure
> root 15197 inode 81781 errors 1000, some csum missing48 elapsed, 33952
> items checked)
> [4/7] checking fs roots(0:42:53 elapsed, 34145
> items checked)
> ERROR: errors found in fs roots
> found 22975533985792 bytes used, error(s) found
> total csum bytes: 16806711120
> total tree bytes: 18733842432
> total fs tree bytes: 130121728
> total extent tree bytes: 466305024
> btree space waste bytes: 1100711497
> file data blocks allocated: 3891333279744
>  referenced 1669470507008


What do you get for
# btrfs insp dump-t -b 48781340082176 /dev/

It's possible there will be filenames, it's OK to sanitize them by
just deleting the names from the output before posting it.



-- 
Chris Murphy

Re: BTRFS checksum mismatch - false positives

2019-09-26 Thread Chris Murphy

>From the log offlist

2019-09-08T17:27:02+02:00 MHPNAS kernel: [   22.396165] md: invalid
raid superblock magic on sda5
2019-09-08T17:27:02+02:00 MHPNAS kernel: [   22.401816] md: sda5 does
not have a valid v0.90 superblock, not importing!

That doesn't sound good. It's not a Btrfs problem but a md/mdadm
problem. You'll have to get support for this from Synology, only they
understand the design of the storage stack layout and whether these
error messages are important or not and how to fix them. Anyone else
speculating could end up causing damage to the NAS and data to be
lost.


2019-09-08T17:27:02+02:00 MHPNAS kernel: [   22.913298] md: sda2 has
different UUID to sda1

There are several messages like this. I can't tell if they're just
informational and benign or a problem. Also not related to Btrfs.


2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.419199] BTRFS warning
(device dm-1): BTRFS: dm-1 checksum verify failed on 375259512832
wanted EA1A10E3 found 3080B64F level 0
2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.419199]
2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.458453] BTRFS warning
(device dm-1): BTRFS: dm-1 checksum verify failed on 375259512832
wanted EA1A10E3 found 3080B64F level 0
2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.458453]
2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.528385] BTRFS: read
error corrected: ino 1 off 375259512832 (dev /dev/vg1/volume_1 sector
751819488)
2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.539631] BTRFS: read
error corrected: ino 1 off 375259516928 (dev /dev/vg1/volume_1 sector
751819496)
2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.550785] BTRFS: read
error corrected: ino 1 off 375259521024 (dev /dev/vg1/volume_1 sector
751819504)
2019-09-08T22:09:33+02:00 MHPNAS kernel: [16997.561990] BTRFS: read
error corrected: ino 1 off 375259525120 (dev /dev/vg1/volume_1 sector
751819512)

There are a bunch of messages like this. Btrfs is finding metadata
checksum errors, some kind of corruption has happened with one of the
copies, and it's been fixed up. But why are things being corrupt in
the first place? Ordinary bad sectors maybe? There's a lot of these  -
like really a lot. Hundreds of affected sectors. There are too many
for me to read through and see if all of them were corrected by DUP
metadata.



2019-09-22T21:24:27+02:00 MHPNAS kernel: [1224856.764098] md2:
syno_self_heal_is_valid_md_stat(496): md's current state is not
suitable for data correction

What does that mean? Also not a Btrfs problem. There are quite a few of these.



2019-09-23T11:49:20+02:00 MHPNAS kernel: [1276791.652946] BTRFS error
(device dm-1): BTRFS: dm-1 failed to repair btree csum error on
1353162506240, mirror = 1

OK and a few of these also. This means that some metadata could not be
repaired, likely because both copies are corrupt.

My recommendation is to freshen your backups now while you still can,
and prepare to rebuild the NAS. i.e. these are not likely repairable
problems. Once both copies of Btrfs metadata are bad, it's usually not
fixable you just have to recreate the file system from scratch.

You'll have to move everything off the NAS - and anything that's
really important you will want at least two independent copies of, of
course, and then you're going to obliterate the array and start from
scratch. While you're at it, you might as well make sure you've got
the latest supported version of the software for this product. Start
with that. Then follow the Synology procedure to wipe the NAS totally
and set it up again. You'll want to make sure the procedure you use
writes out all new metadata for everything: mdadm, lvm, Btrfs. Nothing
stale or old reused. And then you'll copy you data back over to the
NAS.

There's nothing in the provided log that helps me understand why this
is happening. I suspect hardware problems of some sort - maybe one of
the drives is starting to slowly die, by spitting out bad sectors. To
know more about that we'd need to see 'smartctl -x /dev/' for each
drive in the NAS and see if smart gives a clue. Somewhere around 50/50
shot that smart will predict a drive failure in advance. So my
suggestion again, without delay, is to make sure the NAS is backed up,
and keep those backups fresh. You can recreate the NAS when you have
free time - but these problems likely will get worse.



---
Chris Murphy

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2487 matches

Mail list logo