Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
On 30.03.2021 09:16 Wang Yugui wrote: H, On 30.03.21 г. 9:24, Wang Yugui wrote: Hi, Nikolay Borisov With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code, we find out the call stack for ENOMEM. see the file -btrfs-dump_stack-when-ENOMEM.patch #cat /usr/hpc-bio/xfstests/results//generic/476.dmesg ... [ 5759.102929] ENOMEM btrfs_drew_lock_init [ 5759.102943] ENOMEM btrfs_init_fs_root [ 5759.102947] [ cut here ] [ 5759.102950] BTRFS: Transaction aborted (error -12) [ 5759.103052] WARNING: CPU: 14 PID: 2741468 at /ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 create_pending_snapshot+0xb8c/0xd50 [btrfs] ... btrfs_drew_lock_init() return -ENOMEM, this is the source: /* * We might be called under a transaction (e.g. indirect backref * resolution) which could deadlock if it triggers memory reclaim */ nofs_flag = memalloc_nofs_save(); ret = btrfs_drew_lock_init(&root->snapshot_lock); memalloc_nofs_restore(nofs_flag); if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n"); if (ret) goto fail; And the souce come from: commit dcc3eb9638c3c927f1597075e851d0a16300a876 Author: Nikolay Borisov Date: Thu Jan 30 14:59:45 2020 +0200 btrfs: convert snapshot/nocow exlcusion to drew lock Any advice to fix this ENOMEM problem? This is likely coming from changed behavior in MM, doesn't seem related to btrfs. We have multiple places where nofs_save() is called. By the same token the failure might have occurred in any other place, in any other piece of code which uses memalloc_nofs_save, there is no indication that this is directly related to btrfs. top command show that this server have engough memory. The hardware of this server: CPU: Xeon(R) CPU E5-2660 v2(10 core) *2 memory: 192G, no swap You are showing that the server has 192G of installed memory, you have not shown any stats which prove at the time of failure what is the state of the MM subsystem. At the very least at the time of failure inspect the output of : cat /proc/meminfo and "free -m" commands. Only one xfstest job is running in this server. Had what looks like the same issue happinging on a server: [19146.391015] [ cut here ] [19146.391017] BTRFS: Transaction aborted (error -12) [19146.391035] WARNING: CPU: 13 PID: 1825871 at fs/btrfs/transaction.c:1684 create_pending_snapshot+0x912/0xd10 [19146.391036] Modules linked in: bcache crc64 loop dm_crypt bfq xfs dm_mod st sr_mod cdrom intel_powerclamp coretemp dcdbas kvm_intel snd_pcm snd_timer kvm snd irqbypass soundcore mgag200 serio_raw pcspkr drm_kms_helper evdev joydev iTCO_wdt iTCO_vendor_support i2c_algo_bit i7core_edac sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter button ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm configfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd ahci cryptd glue_helper mpt3sas libahci uhci_hcd ehci_pci psmouse ehci_hcd lpc_ich raid_class libata nvme scsi_transport_sas mfd_core usbcore nvme_core scsi_mod t10_pi bnx2 [19146.391092] CPU: 13 PID: 1825871 Comm: btrfs Tainted: G W I 5.10.26 #1 [19146.391093] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.14.0 05/30/2018 [19146.391095] RIP: 0010:create_pending_snapshot+0x912/0xd10 [19146.391097] Code: 48 0f ba aa 40 0a 00 00 02 72 28 83 f8 fb 74 48 83 f8 e2 74 43 89 c6 48 c7 c7 70 2d 10 82 48 89 85 78 ff ff ff e8 d5 65 55 00 <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 94 06 00 00 48 c7 c6 70 46 e4 [19146.391098] RSP: 0018:c900201c3b00 EFLAGS: 00010286 [19146.391099] RAX: RBX: 8881ba393200 RCX: 0fb98b88 [19146.391100] RDX: ffd8 RSI: 0027 RDI: 0fb98b80 [19146.391101] RBP: c900201c3bd0 R08: 825e2148 R09: 00027ffb [19146.391101] R10: 8000 R11: 3fff R12: 888119dd39c0 [19146.391102] R13: 888248c36800 R14: 888a1bf69800 R15: fff4 [19146.391103] FS: 7f1d7c9488c0() GS:0fb8() knlGS: [19146.391104] CS: 0010 DS: ES: CR0: 80050033 [19146.391105] CR2: 7fffef58d000 CR3: 00028c988004 CR4: 000206e0 [19146.391106] Call Trace: [19146.39] ? create_pending_snapshots+0xa2/0xc0 [19146.391112] create_pending_snapshots+0xa2/0xc0 [19146.391114] btrfs_commit_transaction+0x4b9/0xb40 [19146.391116] ? start_transaction+0xd2/0x580 [19146.391119] btrfs_mksubvol+0x29e/0x450 [19146.391122] btrfs_mksnapshot+0x7b/0xb0 [19146.391124] __btrfs_ioctl_snap_create+0x16f/0x180 [19146.391126] btrfs_ioctl_snap_create_v2+0xb3/0x130 [19146.391128] btrfs_ioctl+0x15f/0x3040 [19146.391131] ? __x64_sy
Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8
On 11.03.2021 18:58 Martin Raiber wrote: On 01.02.2021 23:08 Martin Raiber wrote: On 27.01.2021 22:03 Chris Murphy wrote: On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber wrote: Hi, seems 5.10.8 still has the ENOSPC issue when compression is used (compress-force=zstd,space_cache=v2): Jan 27 11:02:14 kernel: [248571.569840] [ cut here ] Jan 27 11:02:14 kernel: [248571.569843] BTRFS: Transaction aborted (error -28) Jan 27 11:02:14 kernel: [248571.569845] BTRFS: error (device dm-0) in add_to_free_space_tree:1039: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569848] BTRFS info (device dm-0): forced readonly Jan 27 11:02:14 kernel: [248571.569851] BTRFS: error (device dm-0) in add_to_free_space_tree:1039: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569852] BTRFS: error (device dm-0) in __btrfs_free_extent:3270: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569854] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2191: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130 Jan 27 11:02:14 kernel: [248571.569913] BTRFS: error (device dm-0) in __btrfs_free_extent:3270: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569939] Modules linked in: Jan 27 11:02:14 kernel: [248571.569966] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2191: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.569992] bfq zram bcache crc64 loop dm_crypt xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter bridge stp llc intel_powerclamp coretemp k$ Jan 27 11:02:14 kernel: [248571.570075] CPU: 3 PID: 21255 Comm: kworker/u50:22 Tainted: G I 5.10.8 #1 Jan 27 11:02:14 kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.13.0 03/02/2018 Jan 27 11:02:14 kernel: [248571.570079] Workqueue: events_unbound btrfs_async_reclaim_metadata_space Jan 27 11:02:14 kernel: [248571.570081] RIP: 0010:add_to_free_space_tree+0xe8/0x130 Jan 27 11:02:14 kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 24 04 e8 8a 99 4a 00 <0f> 0b 8$ Jan 27 11:02:14 kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: 00010282 Jan 27 11:02:14 kernel: [248571.570084] RAX: RBX: 4000 RCX: 0027 Jan 27 11:02:14 kernel: [248571.570085] RDX: 0027 RSI: 0004 RDI: 888617a58b88 Jan 27 11:02:14 kernel: [248571.570086] RBP: 8889ecb874e0 R08: 888617a58b80 R09: Jan 27 11:02:14 kernel: [248571.570087] R10: 0001 R11: 822372e0 R12: 00574151 Jan 27 11:02:14 kernel: [248571.570087] R13: 8884e05727e0 R14: 88815ae4fc00 R15: 88815ae4fdd8 Jan 27 11:02:14 kernel: [248571.570088] FS: () GS:888617a4() knlGS: Jan 27 11:02:14 kernel: [248571.570089] CS: 0010 DS: ES: CR0: 80050033 Jan 27 11:02:14 kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: 0260a005 CR4: 000206e0 Jan 27 11:02:14 kernel: [248571.570091] Call Trace: Jan 27 11:02:14 kernel: [248571.570097] __btrfs_free_extent.isra.0+0x56a/0xa10 Jan 27 11:02:14 kernel: [248571.570100] __btrfs_run_delayed_refs+0x659/0xf20 Jan 27 11:02:14 kernel: [248571.570102] btrfs_run_delayed_refs+0x73/0x200 Jan 27 11:02:14 kernel: [248571.570103] flush_space+0x4e8/0x5e0 Jan 27 11:02:14 kernel: [248571.570105] ? btrfs_get_alloc_profile+0x66/0x1b0 Jan 27 11:02:14 kernel: [248571.570106] ? btrfs_get_alloc_profile+0x66/0x1b0 Jan 27 11:02:14 kernel: [248571.570107] btrfs_async_reclaim_metadata_space+0x107/0x3a0 Jan 27 11:02:14 kernel: [248571.570111] process_one_work+0x1b6/0x350 Jan 27 11:02:14 kernel: [248571.570112] worker_thread+0x50/0x3b0 Jan 27 11:02:14 kernel: [248571.570114] ? process_one_work+0x350/0x350 Jan 27 11:02:14 kernel: [248571.570116] kthread+0xfe/0x140 Jan 27 11:02:14 kernel: [248571.570117] ? kthread_park+0x90/0x90 Jan 27 11:02:14 kernel: [248571.570120] ret_from_fork+0x22/0x30 Jan 27 11:02:14 kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]--- Jan 27 11:02:14 kernel: [248571.570123] BTRFS: error (device dm-0) in add_to_free_space_tree:1039: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.570151] BTRFS: error (device dm-0) in __btrfs_free_extent:3270: errno=-28 No space left Jan 27 11:02:14 kernel: [248571.570178] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2191: errno=-28 No space left btrfs fi usage: Overall: Device size: 931.49GiB Device allocated:931.49GiB Device unallocated:1.00MiB Device missing: 0.00B Used:786.39GiB Free (estimated):107.69GiB (min: 107.69GiB)
Re: btrfs-send format that contains binary diffs
On 29.03.2021 19:25 Henning Schild wrote: > Am Mon, 29 Mar 2021 19:30:34 +0300 > schrieb Andrei Borzenkov : > >> On 29.03.2021 16:16, Claudius Heine wrote: >>> Hi, >>> >>> I am currently investigating the possibility to use `btrfs-stream` >>> files (generated by `btrfs send`) for deploying a image based >>> update to systems (probably embedded ones). >>> >>> One of the issues I encountered here is that btrfs-send does not >>> use any diff algorithm on files that have changed from one snapshot >>> to the next. >> btrfs send works on block level. It sends blocks that differ between >> two snapshots. >> >>> One way to implement this would be to add some sort of 'patch' >>> command to the `btrfs-stream` format. >>> >> This would require reading complete content of both snapshots instead >> if just computing block diff using metadata. Unless I misunderstand >> what you mean. > On embedded systems it is common to update complete "firmware" images > as opposed to package based partial updates. You often have two root > filesystems to be able to always fall back to a working state in case > of any sort or error. > > Take the picture from > https://sbabic.github.io/swupdate/overview.html#double-copy > > and assume that "Application software" is a full blown OS with > everything that makes your device. > > That approach offers great "control" but unfortunately can also lead to > great downloads required for an update. The basic idea is to download > the binary-diff between the future and the current rootfs only. > Given a filesystem supports snapshots, it would be great to > "send/receive" them as diffs. > > Today most people that do such things with other fss script around with > xdelta etc. But btrfs is more "integrated", so when considering it for > such embedded usecases native support would most likely be better than > hacks on top. > > We have several use-cases in mind with btrfs. > - ro-base with rw overlays > - binary diff updates against such a ro-base > - backup/restore with snapshots of certain subvolumes > - factory reset with wiping certain submodules > > regards, > Henning I think I know what you want to accomplish and I've been doing it for a while now. But I don't know what the problem with btrfs send is? Do you want to have non-block based diff to make updates smaller? Have you overwritten files completely and need to dedupe or reflink them before sending them? Theoretically the btrfs send format would be able to support something like bsdiff (non-block based diff -- it is just a set of e.g. write commands with offset and binary data or using reflink to copy data from one file to another), but there currently isn't a tool to create this. How I've done it is: - Create a btrfs image with a rw sys_root_current subvol - E.g. debootstrap a Linux system into it - Create sys_root_v1 as ro snapshot of sys_root_current Use that system image on different systems. On update on the original image: - Modify sys_root_current - Create ro snapshot sys_root_v2 of sys_root_current - Create an btrfs send update that modifies sys_root_v1 to sys_root_v2: btrfs send -p sys_root_v1 sys_root_v2 | xz -c > update_v1.btrfs.xz - Publish update_v1.btrfs.xz On the systems: - Download update_v1.btrfs.xz (verify signature) - Create sys_root_v2 by applying differences to sys_root_v1: cat update_v1.btrfs.xz | xz -d -c | btrfs receive /rootfs - Rename (exchange) sys_root_current to sys_root_last - Create rw snapshot of sys_root_v2 as sys_root_current - Reboot into new system >>> Is this something upstream would be interested in? >>> >>> Lets say we introduce a new `btrfs-send` format, lets call it >>> `btrfs-delta-stream`, which could can be created from a >>> `btrfs-stream`: >>> >>> 1. For all `write` commands, check the requirements: >>> - Does the file already exists in the old snapshot? >>> - Is the file smaller than xMiB (this depends on the diff-algo >>> and the available resources) >>> 2. If the file fulfills those requirements, replace 'write' command >>> with 'patch' command, and calculate the binary delta. Also check >>> if the delta is actually smaller than the data of the new file. >>> Possible add the used binary diff algo as well as a checksum of the >>> 'old' file to the command as well. >>> >>> This file format can of course be converted back to `btrfs-stream` >>> and then applied with `btrfs-receive`. >>> >>> I would probably start with `bsdiff` for the diff algorithm, but >>> maybe we want to be flexible here. >>> >>> Of course if `btrfs-delta-stream` is implemented in `btrfs-progs` >>> then, we can create and apply this format directly. >>> >>> regards, >>> Claudius
Re: Multiple files with the same name in one directory
On 11.03.2021 15:43 Filipe Manana wrote: > On Wed, Mar 10, 2021 at 5:18 PM Martin Raiber wrote: >> Hi, >> >> I have this in a btrfs directory. Linux kernel 5.10.16, no errors in dmesg, >> no scrub errors: >> >> ls -lh >> total 19G >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat >> ... >> >> disk_config.dat gets written to using fsync rename ( write new version to >> disk_config.dat.new, fsync disk_config.dat.new, then rename to >> disk_config.dat -- it is missing the parent directory fsync). > That's interesting. > > I've just tried something like the following on 5.10.15 (and 5.12-rc2): > > create disk_config.dat > sync > for ((i = 0; i < 10; i++)); do > create disk_config.dat.new > write to disk_config.dat.new > fsync disk_config.dat.new > mv -f disk_config.dat.new disk_config.dat > done > > mount fs > list directory > > I only get one file with the name disk_config.dat and one file with > the name disk_config.dat.new. > File disk_config.dat has the data written at iteration 9 and > disk_config.dat.new has the data written at iteration 10 (expected). > > You haven't mentioned, but I suppose you had a power failure / unclean > shutdown somewhere after an fsync, right? > Is this something you can reproduce at will? I think I rebooted via "echo b > /proc/sysrq-trigger". But at that point it probably didn't write to disk_config.dat anymore (for more than the commit interval). I'm also not sure about the delay of me noticing those multiple files (since it doesn't cause any problems) -- can't reproduce. This is the same machine and file system with ENOSPC in btrfs_async_reclaim_metadata_space -> flush_space -> btrfs_run_delayed_refs. Could be that something went wrong with the error handling/remount-ro w.r.t. to the tree log? > >> So far no negative consequences... (except that programs might get confused). >> >> echo 3 > /proc/sys/vm/drop_caches doesn't help. >> >> Regards, >> Martin Raiber
Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8
On 01.02.2021 23:08 Martin Raiber wrote: > On 27.01.2021 22:03 Chris Murphy wrote: >> On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber wrote: >>> Hi, >>> >>> seems 5.10.8 still has the ENOSPC issue when compression is used >>> (compress-force=zstd,space_cache=v2): >>> >>> Jan 27 11:02:14 kernel: [248571.569840] [ cut here >>> ] >>> Jan 27 11:02:14 kernel: [248571.569843] BTRFS: Transaction aborted (error >>> -28) >>> Jan 27 11:02:14 kernel: [248571.569845] BTRFS: error (device dm-0) in >>> add_to_free_space_tree:1039: errno=-28 No space left >>> Jan 27 11:02:14 kernel: [248571.569848] BTRFS info (device dm-0): forced >>> readonly >>> Jan 27 11:02:14 kernel: [248571.569851] BTRFS: error (device dm-0) in >>> add_to_free_space_tree:1039: errno=-28 No space left >>> Jan 27 11:02:14 kernel: [248571.569852] BTRFS: error (device dm-0) in >>> __btrfs_free_extent:3270: errno=-28 No space left >>> Jan 27 11:02:14 kernel: [248571.569854] BTRFS: error (device dm-0) in >>> btrfs_run_delayed_refs:2191: errno=-28 No space left >>> Jan 27 11:02:14 kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at >>> fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130 >>> Jan 27 11:02:14 kernel: [248571.569913] BTRFS: error (device dm-0) in >>> __btrfs_free_extent:3270: errno=-28 No space left >>> Jan 27 11:02:14 kernel: [248571.569939] Modules linked in: >>> Jan 27 11:02:14 kernel: [248571.569966] BTRFS: error (device dm-0) in >>> btrfs_run_delayed_refs:2191: errno=-28 No space left >>> Jan 27 11:02:14 kernel: [248571.569992] bfq zram bcache crc64 loop >>> dm_crypt xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter >>> bridge stp llc intel_powerclamp coretemp k$ >>> Jan 27 11:02:14 kernel: [248571.570075] CPU: 3 PID: 21255 Comm: >>> kworker/u50:22 Tainted: G I 5.10.8 #1 >>> Jan 27 11:02:14 kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge >>> R510/0DPRKF, BIOS 1.13.0 03/02/2018 >>> Jan 27 11:02:14 kernel: [248571.570079] Workqueue: events_unbound >>> btrfs_async_reclaim_metadata_space >>> Jan 27 11:02:14 kernel: [248571.570081] RIP: >>> 0010:add_to_free_space_tree+0xe8/0x130 >>> Jan 27 11:02:14 kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a >>> 00 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 >>> 44 24 04 e8 8a 99 4a 00 <0f> 0b 8$ >>> Jan 27 11:02:14 kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: >>> 00010282 >>> Jan 27 11:02:14 kernel: [248571.570084] RAX: RBX: >>> 4000 RCX: 0027 >>> Jan 27 11:02:14 kernel: [248571.570085] RDX: 0027 RSI: >>> 0004 RDI: 888617a58b88 >>> Jan 27 11:02:14 kernel: [248571.570086] RBP: 8889ecb874e0 R08: >>> 888617a58b80 R09: >>> Jan 27 11:02:14 kernel: [248571.570087] R10: 0001 R11: >>> 822372e0 R12: 00574151 >>> Jan 27 11:02:14 kernel: [248571.570087] R13: 8884e05727e0 R14: >>> 88815ae4fc00 R15: 88815ae4fdd8 >>> Jan 27 11:02:14 kernel: [248571.570088] FS: () >>> GS:888617a4() knlGS: >>> Jan 27 11:02:14 kernel: [248571.570089] CS: 0010 DS: ES: CR0: >>> 80050033 >>> Jan 27 11:02:14 kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: >>> 0260a005 CR4: 000206e0 >>> Jan 27 11:02:14 kernel: [248571.570091] Call Trace: >>> Jan 27 11:02:14 kernel: [248571.570097] >>> __btrfs_free_extent.isra.0+0x56a/0xa10 >>> Jan 27 11:02:14 kernel: [248571.570100] >>> __btrfs_run_delayed_refs+0x659/0xf20 >>> Jan 27 11:02:14 kernel: [248571.570102] btrfs_run_delayed_refs+0x73/0x200 >>> Jan 27 11:02:14 kernel: [248571.570103] flush_space+0x4e8/0x5e0 >>> Jan 27 11:02:14 kernel: [248571.570105] ? >>> btrfs_get_alloc_profile+0x66/0x1b0 >>> Jan 27 11:02:14 kernel: [248571.570106] ? >>> btrfs_get_alloc_profile+0x66/0x1b0 >>> Jan 27 11:02:14 kernel: [248571.570107] >>> btrfs_async_reclaim_metadata_space+0x107/0x3a0 >>> Jan 27 11:02:14 kernel: [248571.570111] process_one_work+0x1b6/0x350 >>> Jan 27 11:02:14 kernel: [248571.570112] worker_thread+0x50/0x3b0 >>> Jan 27 11:02:14 kernel: [248571.570114] ? process_one_work+0x350/0x350 >>> Jan 27 11:02:14
Multiple files with the same name in one directory
Hi, I have this in a btrfs directory. Linux kernel 5.10.16, no errors in dmesg, no scrub errors: ls -lh total 19G -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat -rwxr-x--- 1 root root 783 Mar 10 14:56 disk_config.dat ... disk_config.dat gets written to using fsync rename ( write new version to disk_config.dat.new, fsync disk_config.dat.new, then rename to disk_config.dat -- it is missing the parent directory fsync). So far no negative consequences... (except that programs might get confused). echo 3 > /proc/sys/vm/drop_caches doesn't help. Regards, Martin Raiber
Re: [PATCH] btrfs: Prevent nowait or async read from doing sync IO
On 26.02.2021 18:00 David Sterba wrote: > On Fri, Jan 08, 2021 at 12:02:48AM +0000, Martin Raiber wrote: >> When reading from btrfs file via io_uring I get following >> call traces: >> >> [<0>] wait_on_page_bit+0x12b/0x270 >> [<0>] read_extent_buffer_pages+0x2ad/0x360 >> [<0>] btree_read_extent_buffer_pages+0x97/0x110 >> [<0>] read_tree_block+0x36/0x60 >> [<0>] read_block_for_search.isra.0+0x1a9/0x360 >> [<0>] btrfs_search_slot+0x23d/0x9f0 >> [<0>] btrfs_lookup_csum+0x75/0x170 >> [<0>] btrfs_lookup_bio_sums+0x23d/0x630 >> [<0>] btrfs_submit_data_bio+0x109/0x180 >> [<0>] submit_one_bio+0x44/0x70 >> [<0>] extent_readahead+0x37a/0x3a0 >> [<0>] read_pages+0x8e/0x1f0 >> [<0>] page_cache_ra_unbounded+0x1aa/0x1f0 >> [<0>] generic_file_buffered_read+0x3eb/0x830 >> [<0>] io_iter_do_read+0x1a/0x40 >> [<0>] io_read+0xde/0x350 >> [<0>] io_issue_sqe+0x5cd/0xed0 >> [<0>] __io_queue_sqe+0xf9/0x370 >> [<0>] io_submit_sqes+0x637/0x910 >> [<0>] __x64_sys_io_uring_enter+0x22e/0x390 >> [<0>] do_syscall_64+0x33/0x80 >> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> Prevent those by setting IOCB_NOIO before calling >> generic_file_buffered_read. >> >> Async read has the same problem. So disable that by removing >> FMODE_BUF_RASYNC. This was added with commit >> 8730f12b7962b21ea9ad2756abce1e205d22db84 ("btrfs: flag files as >> supporting buffered async reads") with 5.9. Io_uring will read >> the data via worker threads if it can't be read without sync IO >> this way. >> >> Signed-off-by: Martin Raiber >> --- >> fs/btrfs/file.c | 15 +-- >> 1 file changed, 13 insertions(+), 2 deletions(-) >> >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c >> index 0e41459b8..8bb561f6d 100644 >> --- a/fs/btrfs/file.c >> +++ b/fs/btrfs/file.c >> @@ -3589,7 +3589,7 @@ static loff_t btrfs_file_llseek(struct file *file, >> loff_t offset, int whence) >> >> static int btrfs_file_open(struct inode *inode, struct file *filp) >> { >> -filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC; >> +filp->f_mode |= FMODE_NOWAIT; >> return generic_file_open(inode, filp); >> } >> >> @@ -3639,7 +3639,18 @@ static ssize_t btrfs_file_read_iter(struct kiocb >> *iocb, struct iov_iter *to) >> return ret; >> } >> >> -return generic_file_buffered_read(iocb, to, ret); >> +if (iocb->ki_flags & IOCB_NOWAIT) >> +iocb->ki_flags |= IOCB_NOIO; >> + >> +ret = generic_file_buffered_read(iocb, to, ret); >> + >> +if (iocb->ki_flags & IOCB_NOWAIT) { >> +iocb->ki_flags &= ~IOCB_NOIO; >> +if (ret == 0) >> +ret = -EAGAIN; >> +} > Christoph has some doubts about the code, > https://lore.kernel.org/lkml/20210226051626.ga2...@lst.de/ > > The patch has been in for-next but as I'm not sure it's correct and > don't have a reproducer, I'll remove it again. We do want to fix the > warning, maybe there's only something trivial missing but we need to be > sure, I don't have enough expertise here. The general gist of the critism is kind of correct. It is generic_file_buffered_read/filemap_read that handles the IOCB_NOIO, however. It is only used from gfs2 since 5.8 and IOCB_NOIO was added to 5.8 with 41da51bce36f44eefc1e3d0f47d18841cbd065ba However, I cannot see how to find out if readahead was called with IOCB_NOWAIT from extent_readahead/btrfs_readahead/readahead_control. So add an additional parameter to address_space_operations.readahead ? As mentioned, not too relevant to btrfs (because of the CRC calculation), but making readahead async in all cases (incl. IOCB_WAITQ) would be the proper solution. W.r.t. testing: The most low-effort way I can think of is to add an io_uring switch to xfs_io, so that xfstests can be run using io_uring (where possible). Then check via tracing/perf that there aren't any call stacks with both io_uring_enter and wait_on_page_bit (or any other blocking call) in them.
Space cache
Hi, I've been looking a bit into the btrfs space cache and came to following conclusions. Please correct me if I'm wrong: 1. The space cache mount option only modifies how the space cache is persisted and not the in-memory structures (hence why I have 2,3 GiB btrfs_free_space_bitmap slab with a file system mounted with space_cache=v2) 2. In-memory it is mostly kept as bitmap. Space_cache=v1 persists those bitmaps directly to disk 3. If it's mounted with nospace_cache it still gets all the benefits of "space cache" _after_ those in-memory bitmaps have been filled, it just isn't persisted 4. In-memory space cache doesn't react to memory pressure/is unevictable This leads me to: If one can live with slow startup/initial performance, mounting with nospace_cache has the highest performance. Especially if I have a 1TB NVMe in a long-running server, I don't really care if it has to iterate over all block group metadata after mount for a few seconds, if that means it has less write IOs for every write. The calculus obivously changes for a hard disk where reading this metadata would talke forever due to low IOPS. Regards, Martin Raiber
Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8
On 27.01.2021 22:03 Chris Murphy wrote: > On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber wrote: >> Hi, >> >> seems 5.10.8 still has the ENOSPC issue when compression is used >> (compress-force=zstd,space_cache=v2): >> >> Jan 27 11:02:14 kernel: [248571.569840] [ cut here ] >> Jan 27 11:02:14 kernel: [248571.569843] BTRFS: Transaction aborted (error >> -28) >> Jan 27 11:02:14 kernel: [248571.569845] BTRFS: error (device dm-0) in >> add_to_free_space_tree:1039: errno=-28 No space left >> Jan 27 11:02:14 kernel: [248571.569848] BTRFS info (device dm-0): forced >> readonly >> Jan 27 11:02:14 kernel: [248571.569851] BTRFS: error (device dm-0) in >> add_to_free_space_tree:1039: errno=-28 No space left >> Jan 27 11:02:14 kernel: [248571.569852] BTRFS: error (device dm-0) in >> __btrfs_free_extent:3270: errno=-28 No space left >> Jan 27 11:02:14 kernel: [248571.569854] BTRFS: error (device dm-0) in >> btrfs_run_delayed_refs:2191: errno=-28 No space left >> Jan 27 11:02:14 kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at >> fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130 >> Jan 27 11:02:14 kernel: [248571.569913] BTRFS: error (device dm-0) in >> __btrfs_free_extent:3270: errno=-28 No space left >> Jan 27 11:02:14 kernel: [248571.569939] Modules linked in: >> Jan 27 11:02:14 kernel: [248571.569966] BTRFS: error (device dm-0) in >> btrfs_run_delayed_refs:2191: errno=-28 No space left >> Jan 27 11:02:14 kernel: [248571.569992] bfq zram bcache crc64 loop >> dm_crypt xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter >> bridge stp llc intel_powerclamp coretemp k$ >> Jan 27 11:02:14 kernel: [248571.570075] CPU: 3 PID: 21255 Comm: >> kworker/u50:22 Tainted: G I 5.10.8 #1 >> Jan 27 11:02:14 kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge >> R510/0DPRKF, BIOS 1.13.0 03/02/2018 >> Jan 27 11:02:14 kernel: [248571.570079] Workqueue: events_unbound >> btrfs_async_reclaim_metadata_space >> Jan 27 11:02:14 kernel: [248571.570081] RIP: >> 0010:add_to_free_space_tree+0xe8/0x130 >> Jan 27 11:02:14 kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 >> 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 >> 24 04 e8 8a 99 4a 00 <0f> 0b 8$ >> Jan 27 11:02:14 kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: >> 00010282 >> Jan 27 11:02:14 kernel: [248571.570084] RAX: RBX: >> 4000 RCX: 0027 >> Jan 27 11:02:14 kernel: [248571.570085] RDX: 0027 RSI: >> 0004 RDI: 888617a58b88 >> Jan 27 11:02:14 kernel: [248571.570086] RBP: 8889ecb874e0 R08: >> 888617a58b80 R09: >> Jan 27 11:02:14 kernel: [248571.570087] R10: 0001 R11: >> 822372e0 R12: 00574151 >> Jan 27 11:02:14 kernel: [248571.570087] R13: 8884e05727e0 R14: >> 88815ae4fc00 R15: 88815ae4fdd8 >> Jan 27 11:02:14 kernel: [248571.570088] FS: () >> GS:888617a4() knlGS: >> Jan 27 11:02:14 kernel: [248571.570089] CS: 0010 DS: ES: CR0: >> 80050033 >> Jan 27 11:02:14 kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: >> 0260a005 CR4: 000206e0 >> Jan 27 11:02:14 kernel: [248571.570091] Call Trace: >> Jan 27 11:02:14 kernel: [248571.570097] >> __btrfs_free_extent.isra.0+0x56a/0xa10 >> Jan 27 11:02:14 kernel: [248571.570100] >> __btrfs_run_delayed_refs+0x659/0xf20 >> Jan 27 11:02:14 kernel: [248571.570102] btrfs_run_delayed_refs+0x73/0x200 >> Jan 27 11:02:14 kernel: [248571.570103] flush_space+0x4e8/0x5e0 >> Jan 27 11:02:14 kernel: [248571.570105] ? >> btrfs_get_alloc_profile+0x66/0x1b0 >> Jan 27 11:02:14 kernel: [248571.570106] ? >> btrfs_get_alloc_profile+0x66/0x1b0 >> Jan 27 11:02:14 kernel: [248571.570107] >> btrfs_async_reclaim_metadata_space+0x107/0x3a0 >> Jan 27 11:02:14 kernel: [248571.570111] process_one_work+0x1b6/0x350 >> Jan 27 11:02:14 kernel: [248571.570112] worker_thread+0x50/0x3b0 >> Jan 27 11:02:14 kernel: [248571.570114] ? process_one_work+0x350/0x350 >> Jan 27 11:02:14 kernel: [248571.570116] kthread+0xfe/0x140 >> Jan 27 11:02:14 kernel: [248571.570117] ? kthread_park+0x90/0x90 >> Jan 27 11:02:14 kernel: [248571.570120] ret_from_fork+0x22/0x30 >> Jan 27 11:02:14 kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]--- >> Jan 27 11:02:14 kernel: [248571.570123] BTRFS: error (device dm-0)
ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd
Data,single: Size:884.48GiB, Used:776.79GiB (87.82%) /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533 884.48GiB Metadata,single: Size:47.01GiB, Used:9.59GiB (20.41%) /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533 47.01GiB System,single: Size:4.00MiB, Used:144.00KiB (3.52%) /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533 4.00MiB Unallocated: /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533 1.00MiB Regards, Martin Raiber
Re: [PATCH] btrfs: Prevent nowait or async read from doing sync IO
On 12.01.2021 18:01 Pavel Begunkov wrote: On 12/01/2021 15:36, David Sterba wrote: On Fri, Jan 08, 2021 at 12:02:48AM +, Martin Raiber wrote: When reading from btrfs file via io_uring I get following call traces: Is there a way to reproduce by common tools (fio) or is a specialized one needed? I'm not familiar with this particular issue, but: should _probably_ be reproducible with fio with io_uring engine or fio/t/io_uring tool. [<0>] wait_on_page_bit+0x12b/0x270 [<0>] read_extent_buffer_pages+0x2ad/0x360 [<0>] btree_read_extent_buffer_pages+0x97/0x110 [<0>] read_tree_block+0x36/0x60 [<0>] read_block_for_search.isra.0+0x1a9/0x360 [<0>] btrfs_search_slot+0x23d/0x9f0 [<0>] btrfs_lookup_csum+0x75/0x170 [<0>] btrfs_lookup_bio_sums+0x23d/0x630 [<0>] btrfs_submit_data_bio+0x109/0x180 [<0>] submit_one_bio+0x44/0x70 [<0>] extent_readahead+0x37a/0x3a0 [<0>] read_pages+0x8e/0x1f0 [<0>] page_cache_ra_unbounded+0x1aa/0x1f0 [<0>] generic_file_buffered_read+0x3eb/0x830 [<0>] io_iter_do_read+0x1a/0x40 [<0>] io_read+0xde/0x350 [<0>] io_issue_sqe+0x5cd/0xed0 [<0>] __io_queue_sqe+0xf9/0x370 [<0>] io_submit_sqes+0x637/0x910 [<0>] __x64_sys_io_uring_enter+0x22e/0x390 [<0>] do_syscall_64+0x33/0x80 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Prevent those by setting IOCB_NOIO before calling generic_file_buffered_read. Async read has the same problem. So disable that by removing FMODE_BUF_RASYNC. This was added with commit 8730f12b7962b21ea9ad2756abce1e205d22db84 ("btrfs: flag files as Oh yeah that's the commit that went to btrfs code out-of-band. I am not familiar with the io_uring support and have no good idea what the new flag was supposed to do. iirc, Jens did make buffered IO asynchronous by waiting on a page with wait_page_queue, but don't remember well enough. supporting buffered async reads") with 5.9. Io_uring will read the data via worker threads if it can't be read without sync IO this way. What are the implications of that? Like more context switching (due to the worker threads) or other potential performance related problems? io_uring splits submission and completion steps and usually expect submissions to happen quick and not block (at least for long), otherwise it can't submit other requests, that reduces QD and so forth. In the worst case it can serialise it to QD1. I guess the same can be applied to AIO. Io_submit historically had the problem that it is truely async only for certain operations. That's why everyone only uses it only for async direct I/O with preallocated files (and even then e.g. Mysql has innodb_use_native_aio as tuning option that replaces io_submit with a userspace thread pool). Io_uring is fixing that by making everything async, so the thread calling io_uring_enter never should do any io (only read from page cache etc.). The idea is that one can build e.g. a web server that uses only one thread and does all (former blocking) syscalls via io_uring and handles a large amount of connections. If btrfs does blocking io in this one thread this web server wouldn't work well with btrfs since the blocking call would e.g. delay accepting new connections. Specifically w.r.t. read() io_uring has following logic: * Try read_iter with RWF_NOWAIT/IOCB_NOWAIT. * If read_iter returns -EAGAIN. Look at the FMODE_BUF_RASYNC flag. If set do the read with IOCB_WAITQ and callback set (AIO). * If FMODE_BUF_RASYNC is not set, sync read in a io_uring worker thread. My guess is that since btrfs needs to do the checksum calculations in a worker anyway, that it's best/simpler to not support the AIO submission (so not set the FMODE_BUF_RASYNC). W.r.t. RWF_NOWAIT the problem is that it synchronously reads the csum before async submitting the page reads. When reading randomly from a (large) file this means at least one synchronous read per io_uring submission. I guess the same happens for preadv2 with RWF_NOWAIT (and io_submit), man page: *RWF_NOWAIT *(since Linux 4.14) Do not wait for data which is not immediately available. If this flag is specified, the*preadv2*() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set /errno <https://man7.org/linux/man-pages/man3/errno.3.html>/ to*EAGAIN*. Currently, this flag is meaningful only for*preadv2*(). I haven't tested this, but the same would probably happen if it doesn't have the extents in cache, though that might happen seldom enough that it's not worth fixing (for now). I did also look at how ext4 with fs-ve
Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata
Hi all, Dne 20.01.2021 v 0:12 Zygo Blaxell napsal(a): > With the 4 device types we can trivially specify this arrangement. > > The sorted device lists would be: > > Metadata sort order Data sort order > metadata only (3) data only (2) > metadata preferred (1) data preferred (0) > data preferred (0) metadata preferred (1) > other devices (2 or other) other devices (3 or other) > > We keep 3 device counts for the first 3 sort orders. If the number of all > preferred devices (type != 0) is zero, we just return ndevs; otherwise, > we pick the first device count that is >= mindevs. If all the device > counts are < mindevs then we return the 3rd count (metadata only + > metadata preferred + data preferred) and the caller will find ENOSPC. > > More sophisticated future implementations can alter the sort order, or > operate in entirely separate parts of btrfs, without conflicting with > this scheme. If there is no mount option, then future implementations > can't conflict with it. I agree with Zygo and Josef that the mount option is ugly and needless. This should be a _per-device_ setting as suggested by Zygo (metadata only, metadata preferred, data preferred, data only, unspecified). Maybe in the future it might be useful to generalize this setting to something like a 0..255 priority but the 4 device types look like a sufficient solution for now. I would personally prefer a read-write sysfs option to change the device preference but btrfs-progs approach is fine for me too. Anyway, I'm REALLY happy that there's finally a patchset being actively discussed. I maintain a naive patch implementing "preferred_metadata=metadata" option for years and the impact e.g. for rsync backups is huge. Martin
[PATCH] btrfs: Prevent nowait or async read from doing sync IO
When reading from btrfs file via io_uring I get following call traces: [<0>] wait_on_page_bit+0x12b/0x270 [<0>] read_extent_buffer_pages+0x2ad/0x360 [<0>] btree_read_extent_buffer_pages+0x97/0x110 [<0>] read_tree_block+0x36/0x60 [<0>] read_block_for_search.isra.0+0x1a9/0x360 [<0>] btrfs_search_slot+0x23d/0x9f0 [<0>] btrfs_lookup_csum+0x75/0x170 [<0>] btrfs_lookup_bio_sums+0x23d/0x630 [<0>] btrfs_submit_data_bio+0x109/0x180 [<0>] submit_one_bio+0x44/0x70 [<0>] extent_readahead+0x37a/0x3a0 [<0>] read_pages+0x8e/0x1f0 [<0>] page_cache_ra_unbounded+0x1aa/0x1f0 [<0>] generic_file_buffered_read+0x3eb/0x830 [<0>] io_iter_do_read+0x1a/0x40 [<0>] io_read+0xde/0x350 [<0>] io_issue_sqe+0x5cd/0xed0 [<0>] __io_queue_sqe+0xf9/0x370 [<0>] io_submit_sqes+0x637/0x910 [<0>] __x64_sys_io_uring_enter+0x22e/0x390 [<0>] do_syscall_64+0x33/0x80 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Prevent those by setting IOCB_NOIO before calling generic_file_buffered_read. Async read has the same problem. So disable that by removing FMODE_BUF_RASYNC. This was added with commit 8730f12b7962b21ea9ad2756abce1e205d22db84 ("btrfs: flag files as supporting buffered async reads") with 5.9. Io_uring will read the data via worker threads if it can't be read without sync IO this way. Signed-off-by: Martin Raiber --- fs/btrfs/file.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 0e41459b8..8bb561f6d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3589,7 +3589,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) static int btrfs_file_open(struct inode *inode, struct file *filp) { - filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC; + filp->f_mode |= FMODE_NOWAIT; return generic_file_open(inode, filp); } @@ -3639,7 +3639,18 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; } - return generic_file_buffered_read(iocb, to, ret); + if (iocb->ki_flags & IOCB_NOWAIT) + iocb->ki_flags |= IOCB_NOIO; + + ret = generic_file_buffered_read(iocb, to, ret); + + if (iocb->ki_flags & IOCB_NOWAIT) { + iocb->ki_flags &= ~IOCB_NOIO; + if (ret == 0) + ret = -EAGAIN; + } + + return ret; } const struct file_operations btrfs_file_operations = { -- 2.30.0
Re: 5.6-5.10 balance regression?
Qu Wenruo - 29.12.20, 01:44:07 CET: > So what I can do is only to add a warning message to the problem. > > To solve your problem, I also submitted a patch to btrfs-progs, to > force v1 space cache cleaning even if the fs has v2 space cache > enabled. > > Or, you can disable v2 space cache first, using "btrfs check > --clear-space-cache v2" first, then "btrfs check --clear-space_cache > v1", and finally mount the fs with "space_cache=v2" again. > > To verify there is no space cache v1 left, you can run the following > command to verify: > > # btrfs ins dump-tree -t root | grep EXTENT_DATA > > It should output nothing. I have v1 space_cache stuff on filesystems which use v2 space_cache as well, so… the fully working way to completely switch to spacecache_v2 for any BTRFS filesystem with space cache v1, is what you wrote above? Or would it be more straight forward than that with a newer kernel? Best, -- Martin
Re: syncfs() returns no error on fs failure
On 07.07.2019 14:15 Qu Wenruo wrote: > > On 2019/7/6 上午4:28, Martin Raiber wrote: >> More research on this. Seems a generic error reporting mechanism for >> this is in the works https://lkml.org/lkml/2018/6/1/640 . > sync() system call is defined as void sync(void); thus it has no error > reporting. > > syncfs() could report error. > >> Wrt. to btrfs one should always use BTRFS_IOC_SYNC because only this one >> seems to wait for delalloc work to finish: >> https://patchwork.kernel.org/patch/2927491/ (five year old patch where >> Filipe Manana added this to BTRFS_IOC_SYNC and with v2->v3 not to >> syncfs() ). >> >> I was smart enough to check if the filesystem is still writable after a >> syncfs() (so the missing error return doesn't matter that much) but I >> guess the missing wait for delalloc can cause the application to think >> data is on disk even though it isn't. > Isn't syncfs() enough to return error for your use case? > > Another solution is fsync(). It's ensured to return error if data > writeback or metadata update path has something wrong. > IIRC there are quite some fstests test cases using this way to detect fs > misbehavior. > > Testing if the fs can be written after sync() is not enough in fact. > If you're doing buffer write, it only covers the buffered write part, > which normally just includes space preallocation and copy data to page > cache, doesn't include the data write back nor metadata update. > > So I'd recommend to stick to fsync() if you want to make sure your data > reach disk. This does not only apply to btrfs, but all Linux filesystems. > > Thanks, > Qu This is for UrBackup (Open Source backup software). What it does is, create btrfs snapshot of last backup of a client, sync the current client fs to the btrfs snapshot, then call syncfs(btrfs snapshot), then check if the snapshot is still writable, then set the backup to complete in its internal database. Calling fsync() on every file would kill performance (especially on btrfs). The problem I had was that there was a (complete in database) backup that had files with wrong checksums (UrBackup does its own checksums, the btrfs ones were okay), and missing files. On the day the corrupted backup completed the btrfs went read-only a few hours after the backup completed and the syncfs() was called with: [253018.670661] BTRFS: error (device md1) in btrfs_run_delayed_refs:2950: errno=-5 IO failure So my guess is using BTRFS_IOC_SYNC instead of syncfs() fixes the problem in my case, while it would probably be nice if syncfs() returns an error if the fs fails (it doesn't -- I tested it) and waits for everything to be written to disk (as expected, and the man-page somewhat confirms). > >> On 05.07.2019 16:22 Martin Raiber wrote: >>> Hi, >>> >>> I realize this isn't a btrfs specific problem but syncfs() returns no >>> error even on complete fs failure. The problem is (I think) that the >>> return value of sb->s_op->sync_fs is being ignored in fs/sync.c. I kind >>> of assumed it would return an error if it fails to write the file system >>> changes to disk. >>> >>> For btrfs there is a work-around of using BTRFS_IOC_SYNC (which I am >>> going to use now) but that is obviously less user friendly than syncfs(). >>> >>> Regards, >>> Martin Raiber >>
Re: syncfs() returns no error on fs failure
More research on this. Seems a generic error reporting mechanism for this is in the works https://lkml.org/lkml/2018/6/1/640 . Wrt. to btrfs one should always use BTRFS_IOC_SYNC because only this one seems to wait for delalloc work to finish: https://patchwork.kernel.org/patch/2927491/ (five year old patch where Filipe Manana added this to BTRFS_IOC_SYNC and with v2->v3 not to syncfs() ). I was smart enough to check if the filesystem is still writable after a syncfs() (so the missing error return doesn't matter that much) but I guess the missing wait for delalloc can cause the application to think data is on disk even though it isn't. On 05.07.2019 16:22 Martin Raiber wrote: > Hi, > > I realize this isn't a btrfs specific problem but syncfs() returns no > error even on complete fs failure. The problem is (I think) that the > return value of sb->s_op->sync_fs is being ignored in fs/sync.c. I kind > of assumed it would return an error if it fails to write the file system > changes to disk. > > For btrfs there is a work-around of using BTRFS_IOC_SYNC (which I am > going to use now) but that is obviously less user friendly than syncfs(). > > Regards, > Martin Raiber
syncfs() returns no error on fs failure
Hi, I realize this isn't a btrfs specific problem but syncfs() returns no error even on complete fs failure. The problem is (I think) that the return value of sb->s_op->sync_fs is being ignored in fs/sync.c. I kind of assumed it would return an error if it fails to write the file system changes to disk. For btrfs there is a work-around of using BTRFS_IOC_SYNC (which I am going to use now) but that is obviously less user friendly than syncfs(). Regards, Martin Raiber
Re: Global reserve and ENOSPC while deleting snapshots on 5.0.9 - still happens on 5.1.11
I've fixed the same problem(s) by increasing the global metadata size as well. Though I haven't encountered them since Josef Bacik's block rsv rework in 5.0. Another problem with increasing the global metadata size is, that I think it is the only way dirty metadata is throttled. If increased too much (as a percentage of RAM) the system goes OOM depending on work load. As far as I can see dirty metdata isn't included into the dirty_ratio calculation as well, causing issues on that front as well. Another thing that I think helps is to run with "nodatasum" -- probably because then less metadata needs to be changed when deleting. On 23.06.2019 16:14 Zygo Blaxell wrote: > On Tue, Apr 23, 2019 at 07:06:51PM -0400, Zygo Blaxell wrote: >> I had a test filesystem that ran out of unallocated space, then ran >> out of metadata space during a snapshot delete, and forced readonly. >> The workload before the failure was a lot of rsync and bees dedupe >> combined with random snapshot creates and deletes. > Had this happen again on a production filesystem, this time on 5.1.11, > and it happened during orphan inode cleanup instead of snapshot delete: > > [14303.076134][T20882] BTRFS: error (device dm-21) in > add_to_free_space_tree:1037: errno=-28 No space left > [14303.076144][T20882] BTRFS: error (device dm-21) in > __btrfs_free_extent:7196: errno=-28 No space left > [14303.076157][T20882] BTRFS: error (device dm-21) in > btrfs_run_delayed_refs:3008: errno=-28 No space left > [14303.076203][T20882] BTRFS error (device dm-21): Error removing > orphan entry, stopping orphan cleanup > [14303.076210][T20882] BTRFS error (device dm-21): could not do orphan > cleanup -22 > [14303.076281][T20882] BTRFS error (device dm-21): commit super ret -30 > [14303.357337][T20882] BTRFS error (device dm-21): open_ctree failed > > Same fix: I bumped the reserved size limit from 512M to 2G and mounted > normally. (OK, technically, I booted my old 5.0.21 kernel--but my 5.0.21 > kernel has the 2G reserved space patch below in it.) > > I've not been able to repeat this ENOSPC behavior under test conditions > in the last two months of trying, but it's now happened twice in different > places, so it has non-zero repeatability. > >> I tried the usual fix strategies: >> >> 1. Immediately after mount, try to balance to free space for >> metadata >> >> 2. Immediately after mount, add additional disks to provide >> unallocated space for metadata >> >> 3. Mount -o nossd to increase metadata density >> >> #3 had no effect. #1 failed consistently. >> >> #2 was successful, but the additional space was not used because >> btrfs couldn't allocate chunks for metadata because it ran out of >> metadata space for new metadata chunks. >> >> When btrfs-cleaner tried to remove the first pending deleted snapshot, >> it started a transaction that failed due to lack of metadata space. >> Since the transaction failed, the filesystem reverts to its earlier state, >> and exactly the same thing happens on the next mount. The 'btrfs dev >> add' in #2 is successful only if it is executed immediately after mount, >> before the btrfs-cleaner thread wakes up. >> >> Here's what the kernel said during one of the attempts: >> >> [41263.822252] BTRFS info (device dm-3): use zstd compression, level 0 >> [41263.825135] BTRFS info (device dm-3): using free space tree >> [41263.827319] BTRFS info (device dm-3): has skinny extents >> [42046.463356] [ cut here ] >> [42046.463387] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: >> errno=-28 No space left >> [42046.463404] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: >> errno=-28 No space left >> [42046.463407] BTRFS info (device dm-3): forced readonly >> [42046.463414] BTRFS: error (device dm-3) in >> btrfs_run_delayed_refs:3011: errno=-28 No space left >> [42046.463429] BTRFS: error (device dm-3) in >> btrfs_create_pending_block_groups:10517: errno=-28 No space left >> [42046.463548] BTRFS: error (device dm-3) in >> btrfs_create_pending_block_groups:10520: errno=-28 No space left >> [42046.471363] BTRFS: error (device dm-3) in >> btrfs_run_delayed_refs:3011: errno=-28 No space left >> [42046.471475] BTRFS: error (device dm-3) in >> btrfs_create_pending_block_groups:10517: errno=-28 No space left >> [42046.471506] BTRFS: error (device dm-3) in >> btrfs_create_pending_block_groups:10520: errno=-28 No space left >> [42046.473672] BTRFS: error (device dm-3) in btrfs_drop_snapshot:9489: >> errno=-28 No space left >> [42046.475643] WARNING: CPU: 0 PID: 10187 at >> fs/btrfs/extent-tree.c:7056 __btrfs_free_extent+0x364/0xf60 >> [42046.475645] Modules linked in: mq_deadline bfq dm_cache_smq dm_cache >> dm_persistent_data dm_bio_prison dm_bufio joydev ppdev crct10dif_pclmul >> crc32_pclmul crc32c_intel ghash_clmulni_intel dm_mod snd_p
Re: Citation Needed: BTRFS Failure Resistance
On 23.05.2019 19:41 Austin S. Hemmelgarn wrote: > On 2019-05-23 13:31, Martin Raiber wrote: >> On 23.05.2019 19:13 Austin S. Hemmelgarn wrote: >>> On 2019-05-23 12:24, Chris Murphy wrote: >>>> On Thu, May 23, 2019 at 5:19 AM Austin S. Hemmelgarn >>>> wrote: >>>>> >>>>> On 2019-05-22 14:46, Cerem Cem ASLAN wrote: >>>>>> Could you confirm or disclaim the following explanation: >>>>>> https://unix.stackexchange.com/a/520063/65781 >>>>>> >>>>> Aside from what Hugo mentioned (which is correct), it's worth >>>>> mentioning >>>>> that the example listed in the answer of how hardware issues could >>>>> screw >>>>> things up assumes that for some reason write barriers aren't honored. >>>>> BTRFS explicitly requests write barriers to prevent that type of >>>>> reordering of writes from happening, and it's actually pretty >>>>> unusual on >>>>> modern hardware for those write barriers to not be honored unless the >>>>> user is doing something stupid (like mounting with 'nobarrier' or >>>>> using >>>>> LVM with write barrier support disabled). >>>> >>>> 'man xfs' >>>> >>>> barrier|nobarrier >>>> Note: This option has been deprecated as of kernel >>>> v4.10; in that version, integrity operations are always performed and >>>> the mount option is ignored. These mount options will be removed no >>>> earlier than kernel v4.15. >>>> >>>> Since they're getting rid of it, I wonder if it's sane for most any >>>> sane file system use case. >>>> >>> As Adam mentioned, it's mostly volatile storage that benefits from >>> this. For example, on the systems where I have /var/cache configured >>> as a separate filesystem, I mount it with barriers disabled because >>> the data there just doesn't matter (all of it can be regenerated >>> easily) and it gives me a few percent better performance. In essence, >>> it's the mostly same type of stuff where you might consider running >>> ext4 without a journal for performance reasons. >>> >>> In the case of XFS, it probably got removed to keep people who fancy >>> themselves to be power users but really have no clue what they're >>> doing from shooting themselves in the foot to try and get some more >>> performance. >>> >>> IIRC, the option originally got added to both XFS and ext* because >>> early write barrier support was a bigger performance hit than it is >>> today, and BTRFS just kind of inherited it. >> >> When I google for it I find that flushing the device can also be >> disabled via >> >> echo "write through" > /sys/block/$device/queue/write_cache > Disabling write caching (which is what that does) is not really the > same as mounting with 'nobarrier'. Write caching actually improves > performance in most cases, it just makes things a bit riskier because > of the possibility of write reordering (which barriers prevent). According to documentation it doesn't change any caching. This changes how the kernel sees what kind of caching the device does. If the device claims it does "write through" caching (e.g. battery backed RAID card) the kernel doesn't need to send device cache flushes, otherwise is does. If you set a device that has "write back" there to "write through", the kernel will think it does not require flushes and not send any, thus causing data loss at power loss (because the device obviously still does write back caching). >> >> I actually used nobarrier recently (albeit with ext4), because a steam >> download was taking forever (hours), when remounting with nobarrier it >> went down to minutes (next time I started it with eatmydata). But ext4 >> fsck is probably able to recover nobarrier file systems with unfortunate >> powerlosses and btrfs fsck... isn't. So combined with the above I'd >> remove nobarrier. >> > Yeah, Steam is another pathological case actually, though that's > mostly because their distribution format is generously described as > 'excessively segmented' and they fsync after _every single file_. If > you ever use Steam's game backup feature, you'll see similar results > because it actually serializes the data to the same format that is > used when downloading the game in the first place.
Re: Citation Needed: BTRFS Failure Resistance
On 23.05.2019 19:13 Austin S. Hemmelgarn wrote: > On 2019-05-23 12:24, Chris Murphy wrote: >> On Thu, May 23, 2019 at 5:19 AM Austin S. Hemmelgarn >> wrote: >>> >>> On 2019-05-22 14:46, Cerem Cem ASLAN wrote: Could you confirm or disclaim the following explanation: https://unix.stackexchange.com/a/520063/65781 >>> Aside from what Hugo mentioned (which is correct), it's worth >>> mentioning >>> that the example listed in the answer of how hardware issues could >>> screw >>> things up assumes that for some reason write barriers aren't honored. >>> BTRFS explicitly requests write barriers to prevent that type of >>> reordering of writes from happening, and it's actually pretty >>> unusual on >>> modern hardware for those write barriers to not be honored unless the >>> user is doing something stupid (like mounting with 'nobarrier' or using >>> LVM with write barrier support disabled). >> >> 'man xfs' >> >> barrier|nobarrier >> Note: This option has been deprecated as of kernel >> v4.10; in that version, integrity operations are always performed and >> the mount option is ignored. These mount options will be removed no >> earlier than kernel v4.15. >> >> Since they're getting rid of it, I wonder if it's sane for most any >> sane file system use case. >> > As Adam mentioned, it's mostly volatile storage that benefits from > this. For example, on the systems where I have /var/cache configured > as a separate filesystem, I mount it with barriers disabled because > the data there just doesn't matter (all of it can be regenerated > easily) and it gives me a few percent better performance. In essence, > it's the mostly same type of stuff where you might consider running > ext4 without a journal for performance reasons. > > In the case of XFS, it probably got removed to keep people who fancy > themselves to be power users but really have no clue what they're > doing from shooting themselves in the foot to try and get some more > performance. > > IIRC, the option originally got added to both XFS and ext* because > early write barrier support was a bigger performance hit than it is > today, and BTRFS just kind of inherited it. When I google for it I find that flushing the device can also be disabled via echo "write through" > /sys/block/$device/queue/write_cache I actually used nobarrier recently (albeit with ext4), because a steam download was taking forever (hours), when remounting with nobarrier it went down to minutes (next time I started it with eatmydata). But ext4 fsck is probably able to recover nobarrier file systems with unfortunate powerlosses and btrfs fsck... isn't. So combined with the above I'd remove nobarrier.
Re: backup uuid_tree generation not consistent across multi device (raid0) btrfs - won´t mount
On 26.03.2019 14:37 Qu Wenruo wrote: > On 2019/3/26 下午6:24, berodual_xyz wrote: >> Mount messages below. >> >> Thanks for your input, Qu! >> >> ## >> [42763.884134] BTRFS info (device sdd): disabling free space tree >> [42763.884138] BTRFS info (device sdd): force clearing of disk cache >> [42763.884140] BTRFS info (device sdd): has skinny extents >> [42763.885207] BTRFS error (device sdd): parent transid verify failed on >> 1048576 wanted 60234 found 60230 > So btrfs is using the latest superblock while the good one should be the > old superblock. > > Btrfs-progs is able to just ignore the transid mismatch, but kernel > doesn't and shouldn't. > > In fact we should allow btrfs rescue super to use super blocks from > other device to replace the old one. > > So my patch won't help at all, the failure happens at the very beginning > of the devices list initialization. > > BTW, if btrfs restore can't recover certain files, I don't believe any > rescue kernel mount option can do more. > > Thanks, > Qu I have made btrfs limp along (till a rebuild) in the past by commenting out/removing the transid checks. Obviously you should still mount it read-only (and with no log replay) and it might crash, but there is a small chance this would work. > >> [42763.885263] BTRFS error (device sdd): failed to read chunk root >> [42763.900922] BTRFS error (device sdd): open_ctree failed >> ## >> >> >> >> >> Sent with ProtonMail Secure Email. >> >> ‐‐‐ Original Message ‐‐‐ >> On Tuesday, 26. March 2019 10:21, Qu Wenruo wrote: >> >>> On 2019/3/26 下午4:52, berodual_xyz wrote: >>> Thank you both for your input. see below. >> You sda and sdb are at gen 60233 while sdd and sde are at gen 60234. >> It's possible to allow kernel to manually assemble its device list using >> "device=" mount option. >> Since you're using RAID6, it's possible to recover using 2 devices only, >> but in that case you need "degraded" mount option. > He has btrfs raid0 profile on top of hardware RAID6 devices. Correct, my FS is a "raid0" across four hardware-raid based raid6 devices. The underlying devices of the raid controller are fine, same as the volumes themselves. >>> Then there is not much we can do. >>> >>> The super blocks shows all your 4 devices are in 2 different states. >>> (older generation with dirt log, newer generation without log). >>> >>> This means some writes didn't reach all devices. >>> Only corruption seems to be on the btrfs side. >>> Please provide the kernel message when trying to mount the fs. >>> Does your tip regarding mounting by explicitly specifying the devices still make sense? >>> Not really. For RAID0 case, it doesn't make much sense. >>> Will this figure out automatically which generation to use? >>> You could try, as all the mount option is making btrfs completely RO (no >>> log replay), so it should be pretty safe. >>> I am at the moment in the process of using "btrfs restore" to pull more data from the filesystem without making any further changes. After that I am happy to continue testing, and will happily test your mentioned "skip_bg" patch - but if you think that there is some other way to mount (just for recovery purpose - read only is fine!) while having different gens on the devices, I highly appreciate it. >>> With mounting failure dmesg, it should be pretty easy to determine >>> whether my skip_bg will work. >>> >>> Thanks, >>> Qu >>> Thanks Qu and Andrei!
Re: psa, wiki needs updating now that Btrfs supports swapfiles in 5.0
On 14.03.2019 23:20 Chris Murphy wrote: > If you install btrfs-progs 4.20+ you'll see the documentation for > supporting swapfiles on Btrfs, supported in kernel 5.0+. `man 5 btrfs` > > Anyone with access to the wiki should update the FAQ > https://btrfs.wiki.kernel.org/index.php/FAQ#Does_btrfs_support_swap_files.3F Yeah, and remove that tip about swap file via loop device. That will only cause memory allocation lock-ups and is not advisable.
Allow sending of rw-subvols if file system is mounted ro
Hi, I know there are corner cases that probably make this difficult (such as remounting the file system rw while a send is in progress), but it would be nice if one could send all subvolumes as long as a file system is mounted read-only (pretend every subvol ist read-only if the file system is mounted read-only). Background/use case: Through no fault of btrfs, metadata got damaged, which makes a file system go read-only after a while and I'd like to btrfs send/receive the subvolumes and snapshots that are still readable to another btrfs file system (btrfs send/receive being the only option that does this somewhat efficiently). Only I cannot send the subvolumes that are not set to read-only prior to the file system going read-only. I patched the kernel and btrfs-tools (just commenting the checks out) to support this in my case, but it would be great if this would be possible without patching. Regards, Martin Raiber
Re: [LSF/MM TOPIC] More async operations for file systems - async discard?
Roman, >> Consequently, many of the modern devices that claim to support >> discard to make us software folks happy (or to satisfy a purchase >> order requirements) complete the commands without doing anything at >> all. We're simply wasting queue slots. > > Any example of such devices? Let alone "many"? Where you would issue a > full-device blkdiscard, but then just read back old data. I obviously can't mention names or go into implementation details. But there are many drives out there that return old data. And that's perfectly within spec. At least some of the pain in the industry in this department can be attributed to us Linux folks and RAID device vendors. We all wanted deterministic zeroes on completion of DSM TRIM, UNMAP, or DEALLOCATE. The device vendors weren't happy about that and we ended up with weasel language in the specs. This lead to the current libata whitelist mess for SATA SSDs and ongoing vendor implementation confusion in SCSI and NVMe devices. On the Linux side the problem was that we originally used discard for two distinct purposes: Clearing block ranges and deallocating block ranges. We cleaned that up a while back and now have BLKZEROOUT and BLKDISCARD. Those operations get translated to different operations depending on the device. We also cleaned up several of the inconsistencies in the SCSI and NVMe specs to facilitate making this distinction possible in the kernel. In the meantime the SSD vendors made great strides in refining their flash management. To the point where pretty much all enterprise device vendors will ask you not to issue discards. The benefits simply do not outweigh the costs. If you have special workloads where write amplification is a major concern it may still be advantageous to do the discards and reduce WA and prolong drive life. However, these workloads are increasingly moving away from the classic LBA read/write model. Open Channel originally targeted this space. Right now work is underway on Zoned Namespaces and Key-Value command sets in NVMe. These curated application workload protocols are fundamental departures from the traditional way of accessing storage. And my postulate is that where tail latency and drive lifetime management is important, those new command sets offer much better bang for the buck. And they make the notion of discard completely moot. That's why I don't think it's going to be terribly important in the long term. This leaves consumer devices and enterprise devices using the traditional LBA I/O model. For consumer devices I still think fstrim is a good compromise. Lack of queuing for DSM hurt us for a long time. And when it was finally added to the ATA command set, many device vendors got their implementations wrong. So it sucked for a lot longer than it should have. And of course FTL implementations differ. For enterprise devices we're still in the situation where vendors generally prefer for us not to use discard. I would love for the DEALLOCATE/WRITE ZEROES mess to be sorted out in their FTLs, but I have fairly low confidence that it's going to happen. Case in point: Despite a lot of leverage and purchasing power, the cloud industry has not been terribly successful in compelling the drive manufacturers to make DEALLOCATE perform well for typical application workloads. So I'm not holding my breath... -- Martin K. Petersen Oracle Linux Engineering
Re: [LSF/MM TOPIC] More async operations for file systems - async discard?
Jeff, > We've always been told "don't worry about what the internal block size > is, that only matters to the FTL." That's obviously not true, but > when devices only report a 512 byte granularity, we believe them and > will issue discard for the smallest size that makes sense for the file > system regardless of whether it makes sense (internally) for the SSD. > That means 4k for pretty much anything except btrfs metadata nodes, > which are 16k. The devices are free to report a bigger discard granularity. We already support and honor that (for SCSI, anyway). It's completely orthogonal to reported the logical block size, although it obviously needs to be a multiple. The real problem is that vendors have zero interest in optimizing for discard. They are so confident in their FTL and overprovisioning that they don't view it as an important feature. At all. Consequently, many of the modern devices that claim to support discard to make us software folks happy (or to satisfy a purchase order requirements) complete the commands without doing anything at all. We're simply wasting queue slots. Personally, I think discard is dead on anything but the cheapest devices. And on those it is probably going to be performance-prohibitive to use it in any other way than a weekly fstrim. -- Martin K. Petersen Oracle Linux Engineering
Re: [LSF/MM TOPIC] More async operations for file systems - async discard?
Keith, > With respect to fs block sizes, one thing making discards suck is that > many high capacity SSDs' physical page sizes are larger than the fs > block size, and a sub-page discard is worse than doing nothing. That ties into the whole zeroing as a side-effect thing. The devices really need to distinguish between discard-as-a-hint where it is free to ignore anything that's not a whole multiple of whatever the internal granularity is, and the WRITE ZEROES use case where the end result needs to be deterministic. -- Martin K. Petersen Oracle Linux Engineering
Re: Btrfs corruption: Cannot mount partition
DS: ES: CR0: 80050033 [ 51.357362] CR2: 7f252cd16af0 CR3: 00020140a006 CR4: 003606e0 [ 51.357362] Call Trace: [ 51.357364] ? _raw_spin_lock+0x13/0x30 [ 51.357365] ? _raw_spin_unlock+0x16/0x30 [ 51.357376] ? btrfs_merge_delayed_refs+0x315/0x350 [btrfs] [ 51.357401] __btrfs_run_delayed_refs+0x6f2/0x10e0 [btrfs] [ 51.357403] ? preempt_count_add+0x79/0xb0 [ 51.357411] btrfs_run_delayed_refs+0x64/0x180 [btrfs] [ 51.357418] delayed_ref_async_start+0x81/0x90 [btrfs] [ 51.357428] normal_work_helper+0xbd/0x350 [btrfs] [ 51.357430] process_one_work+0x1eb/0x410 [ 51.357432] worker_thread+0x2d/0x3d0 [ 51.357433] ? process_one_work+0x410/0x410 [ 51.357434] kthread+0x112/0x130 [ 51.357435] ? kthread_park+0x80/0x80 [ 51.357437] ret_from_fork+0x35/0x40 [ 51.357438] ---[ end trace 0be7e900e0369796 ]--- [ 51.357439] BTRFS: error (device dm-0) in __btrfs_free_extent:6828: errno=-2 No such entry [ 51.357441] BTRFS info (device dm-0): forced readonly [ 51.357442] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2978: errno=-2 No such entry On Sun, Feb 17, 2019 at 5:27 PM Martin Pöhlmann wrote: > > Tried zero-log. After reboot the system booted again. But all > sub-volumes are mounted read-only. > > This should be the relevant dmesg excerpt (note to last lines, there > it mentions forced to ro mode) > > [ 51.356769] WARNING: CPU: 3 PID: 54 at fs/btrfs/extent-tree.c:6822 > __btrfs_free_extent.isra.25+0x61e/0x940 [btrfs] > [ 51.356770] Modules linked in: isofs thunderbolt ccm rfcomm fuse > cmac snd_hda_codec_hdmi bnep snd_hda_codec_realtek > snd_hda_codec_generic hid_multitouch joydev arc4 iTCO_wdt > iTCO_vendor_support nls_iso8859_1 nls_cp437 vfat fat uvcvideo btusb > btrtl videobuf2_vmalloc btbcm videobuf2_memops videobuf2_v4l2 btintel > videobuf2_common ath10k_pci bluetooth ath10k_core videodev mousedev > i915 intel_rapl ath snd_soc_skl snd_soc_hdac_hda ecdh_generic > x86_pkg_temp_thermal intel_powerclamp snd_hda_ext_core media crc16 > coretemp snd_soc_skl_ipc mac80211 uas snd_soc_sst_ipc kvm_intel > snd_soc_sst_dsp snd_soc_acpi_intel_match kvmgt snd_soc_acpi vfio_mdev > mdev mei_wdt dell_laptop vfio_iommu_type1 dell_wmi wmi_bmof > snd_soc_core vfio intel_wmi_thunderbolt dell_smbios > dell_wmi_descriptor snd_compress i2c_algo_bit dcdbas kvm ac97_bus > snd_pcm_dmaengine drm_kms_helper snd_hda_intel snd_hda_codec cfg80211 > irqbypass intel_cstate intel_uncore snd_hda_core snd_hwdep input_leds > snd_pcm intel_rapl_perf drm snd_timer > [ 51.356792] psmouse rtsx_pci_ms pcspkr memstick idma64 rfkill snd > intel_gtt mei_me processor_thermal_device agpgart intel_soc_dts_iosf > soundcore mei syscopyarea sysfillrect i2c_i801 sysimgblt fb_sys_fops > intel_lpss_pci intel_lpss intel_pch_thermal ucsi_acpi tpm_crb > typec_ucsi wmi typec i2c_hid battery soc_button_array intel_vbtn > tpm_tis tpm_tis_core tpm int3403_thermal evdev int340x_thermal_zone > mac_hid intel_hid rng_core ac int3400_thermal acpi_thermal_rel > sparse_keymap pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) > vboxdrv(OE) sg crypto_user ip_tables x_tables btrfs libcrc32c > crc32c_generic xor raid6_pq algif_skcipher af_alg sd_mod usb_storage > scsi_mod hid_generic usbhid hid dm_crypt dm_mod crct10dif_pclmul > crc32_pclmul crc32c_intel ghash_clmulni_intel rtsx_pci_sdmmc mmc_core > serio_raw atkbd libps2 aesni_intel aes_x86_64 xhci_pci crypto_simd > cryptd glue_helper xhci_hcd rtsx_pci i8042 serio > [ 51.356817] CPU: 3 PID: 54 Comm: kworker/u8:1 Tainted: G U > OE 4.20.6-arch1-1-ARCH #1 > [ 51.356817] Hardware name: Dell Inc. XPS 13 9360/0PF86Y, BIOS 2.1.0 > 08/02/2017 > [ 51.356830] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] > [ 51.356839] RIP: 0010:__btrfs_free_extent.isra.25+0x61e/0x940 [btrfs] > [ 51.356840] Code: b8 00 00 00 48 8b 7c 24 08 e8 ae 1a ff ff 41 89 > c5 58 c6 44 24 2c 00 45 85 ed 0f 84 f2 fa ff ff 41 83 fd fe 0f 85 e1 > fb ff ff <0f> 0b 49 8b 3c 24 e8 87 32 00 00 49 89 d9 4d 89 f8 4c 89 f1 > ff b4 > [ 51.356841] RSP: 0018:acf5c1b37c38 EFLAGS: 00010246 > [ 51.356842] RAX: fffe RBX: RCX: > > [ 51.356842] RDX: fffe RSI: RDI: > 9b03dbb6f3b0 > [ 51.356843] RBP: 005ae1ce8000 R08: R09: > 009b > [ 51.356844] R10: 003c R11: R12: > 9b03da498ee0 > [ 51.356844] R13: fffe R14: R15: > 0002 > [ 51.356845] FS: () GS:9b04ae38() > knlGS: > [ 51.356846] CS: 0010 DS: ES: CR0: 80050033 > [ 51.356847] CR2: 7f252cd16af0 CR3: 00020140a006 CR4: > 003606e0 > [
Re: Btrfs corruption: Cannot mount partition
ror -2) [ 51.357304] WARNING: CPU: 3 PID: 54 at fs/btrfs/extent-tree.c:6828 __btrfs_free_extent.isra.25+0x67b/0x940 [btrfs] [ 51.357304] Modules linked in: isofs thunderbolt ccm rfcomm fuse cmac snd_hda_codec_hdmi bnep snd_hda_codec_realtek snd_hda_codec_generic hid_multitouch joydev arc4 iTCO_wdt iTCO_vendor_support nls_iso8859_1 nls_cp437 vfat fat uvcvideo btusb btrtl videobuf2_vmalloc btbcm videobuf2_memops videobuf2_v4l2 btintel videobuf2_common ath10k_pci bluetooth ath10k_core videodev mousedev i915 intel_rapl ath snd_soc_skl snd_soc_hdac_hda ecdh_generic x86_pkg_temp_thermal intel_powerclamp snd_hda_ext_core media crc16 coretemp snd_soc_skl_ipc mac80211 uas snd_soc_sst_ipc kvm_intel snd_soc_sst_dsp snd_soc_acpi_intel_match kvmgt snd_soc_acpi vfio_mdev mdev mei_wdt dell_laptop vfio_iommu_type1 dell_wmi wmi_bmof snd_soc_core vfio intel_wmi_thunderbolt dell_smbios dell_wmi_descriptor snd_compress i2c_algo_bit dcdbas kvm ac97_bus snd_pcm_dmaengine drm_kms_helper snd_hda_intel snd_hda_codec cfg80211 irqbypass intel_cstate intel_uncore snd_hda_core snd_hwdep input_leds snd_pcm intel_rapl_perf drm snd_timer [ 51.357319] psmouse rtsx_pci_ms pcspkr memstick idma64 rfkill snd intel_gtt mei_me processor_thermal_device agpgart intel_soc_dts_iosf soundcore mei syscopyarea sysfillrect i2c_i801 sysimgblt fb_sys_fops intel_lpss_pci intel_lpss intel_pch_thermal ucsi_acpi tpm_crb typec_ucsi wmi typec i2c_hid battery soc_button_array intel_vbtn tpm_tis tpm_tis_core tpm int3403_thermal evdev int340x_thermal_zone mac_hid intel_hid rng_core ac int3400_thermal acpi_thermal_rel sparse_keymap pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) sg crypto_user ip_tables x_tables btrfs libcrc32c crc32c_generic xor raid6_pq algif_skcipher af_alg sd_mod usb_storage scsi_mod hid_generic usbhid hid dm_crypt dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel rtsx_pci_sdmmc mmc_core serio_raw atkbd libps2 aesni_intel aes_x86_64 xhci_pci crypto_simd cryptd glue_helper xhci_hcd rtsx_pci i8042 serio [ 51.357335] CPU: 3 PID: 54 Comm: kworker/u8:1 Tainted: G U W OE 4.20.6-arch1-1-ARCH #1 [ 51.357335] Hardware name: Dell Inc. XPS 13 9360/0PF86Y, BIOS 2.1.0 08/02/2017 [ 51.357347] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [ 51.357355] RIP: 0010:__btrfs_free_extent.isra.25+0x67b/0x940 [btrfs] [ 51.357356] Code: 08 48 8b 40 50 f0 48 0f ba a8 90 12 00 00 02 0f 92 c0 5f 84 c0 0f 85 cc 0f 09 00 44 89 ee 48 c7 c7 70 82 35 c0 e8 af 7e 5d df <0f> 0b e9 b6 0f 09 00 4c 89 e7 e8 a6 7d fe ff 48 8b 3c 24 4d 89 f8 [ 51.357356] RSP: 0018:acf5c1b37c38 EFLAGS: 00010282 [ 51.357357] RAX: RBX: RCX: [ 51.357358] RDX: 0007 RSI: a08a427e RDI: [ 51.357358] RBP: 005ae1ce8000 R08: 0001 R09: 06b2 [ 51.357359] R10: 0004 R11: R12: 9b03da498ee0 [ 51.357359] R13: fffe R14: R15: 0002 [ 51.357360] FS: () GS:9b04ae38() knlGS: [ 51.357361] CS: 0010 DS: ES: CR0: 80050033 [ 51.357362] CR2: 7f252cd16af0 CR3: 00020140a006 CR4: 003606e0 [ 51.357362] Call Trace: [ 51.357364] ? _raw_spin_lock+0x13/0x30 [ 51.357365] ? _raw_spin_unlock+0x16/0x30 [ 51.357376] ? btrfs_merge_delayed_refs+0x315/0x350 [btrfs] [ 51.357401] __btrfs_run_delayed_refs+0x6f2/0x10e0 [btrfs] [ 51.357403] ? preempt_count_add+0x79/0xb0 [ 51.357411] btrfs_run_delayed_refs+0x64/0x180 [btrfs] [ 51.357418] delayed_ref_async_start+0x81/0x90 [btrfs] [ 51.357428] normal_work_helper+0xbd/0x350 [btrfs] [ 51.357430] process_one_work+0x1eb/0x410 [ 51.357432] worker_thread+0x2d/0x3d0 [ 51.357433] ? process_one_work+0x410/0x410 [ 51.357434] kthread+0x112/0x130 [ 51.357435] ? kthread_park+0x80/0x80 [ 51.357437] ret_from_fork+0x35/0x40 [ 51.357438] ---[ end trace 0be7e900e0369796 ]--- [ 51.357439] BTRFS: error (device dm-0) in __btrfs_free_extent:6828: errno=-2 No such entry [ 51.357441] BTRFS info (device dm-0): forced readonly [ 51.357442] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2978: errno=-2 No such entry On Sat, Feb 16, 2019 at 9:46 PM Martin Pöhlmann wrote: > > Thanks a lot for your help. > > @Qu Wenruo: WIll zero log after completing the backup > @Chris Murphy: First of all, mount -ro,nologreplay works. > > dump-tree displays two items: > > # btrfs insp dump-tree -b 88560877568 --follow /dev/mapper/cryptroot > btrfs-progs v4.19.1 > leaf 88560877568 items 2 free space 15355 generation 554510 owner TREE_LOG > leaf 88560877568 flags 0x1(WRITTEN) backref revision 1 > fs uuid bbd941a4-5525-4ba6-a4d8-3ead02b8aae1 > chunk uuid 25cacaa1-59ec-4c71-92e0-4b31f7937521 > item 0 key (TREE_LOG ROOT_ITEM 258) itemoff 15844 itemsize 439 > generation 554510
Re: Btrfs corruption: Cannot mount partition
Thanks a lot for your help. @Qu Wenruo: WIll zero log after completing the backup @Chris Murphy: First of all, mount -ro,nologreplay works. dump-tree displays two items: # btrfs insp dump-tree -b 88560877568 --follow /dev/mapper/cryptroot btrfs-progs v4.19.1 leaf 88560877568 items 2 free space 15355 generation 554510 owner TREE_LOG leaf 88560877568 flags 0x1(WRITTEN) backref revision 1 fs uuid bbd941a4-5525-4ba6-a4d8-3ead02b8aae1 chunk uuid 25cacaa1-59ec-4c71-92e0-4b31f7937521 item 0 key (TREE_LOG ROOT_ITEM 258) itemoff 15844 itemsize 439 generation 554510 root_dirid 0 bytenr 88560812032 level 1 refs 0 lastsnap 0 byte_limit 0 bytes_used 376832 flags 0x0(none) uuid ---- drop key (0 UNKNOWN.0 0) level 0 item 1 key (TREE_LOG ROOT_ITEM 259) itemoff 15405 itemsize 439 generation 554510 root_dirid 0 bytenr 917389312 level 0 refs 0 lastsnap 0 byte_limit 0 bytes_used 0 flags 0x0(none) uuid ---- drop key (0 UNKNOWN.0 0) level 0 Regards 2nd mail: 1. as mentioned, mount with nologreplay works. Will update backups with that. 2. Used btrfs restore already for initial backup. Did a good job. 3. Have to figure out how to get a usb-bootable recovery system w/ 5.0rc6 first. On Sat, Feb 16, 2019 at 1:54 AM Qu Wenruo wrote: > > > > On 2019/2/16 上午5:31, Martin Pöhlmann wrote: > > Hello, > > > > After a reboot I am lost with an unmountable BTRFS partition. Before > > reboot I had first compile problems with freezing IntelliJ. These > > persisted after a first reboot, after a second reboot I am faced with > > the following error after entering the dm-crypt password (also after > > manual mount with -o ro,recovery, see attached dmesg): > > [Move check result here] > > # btrfs check --readonly /dev/mapper/cryptroot > > [1/7] checking root items > > [2/7] checking extents > > [3/7] checking free space cache > > [4/7] checking fs roots > > root 258 inode 776 errors 200, dir isize wrong > > root 258 inode 1131031 errors 1, no inode item > > unresolved ref dir 776 index 87215 namelen 17 name > > TransportSecurity filetype 1 errors 5, no dir item, no inode ref > > root 258 inode 2911226 errors 1, no inode item > > unresolved ref dir 776 index 160611 namelen 17 name > > TransportSecurity filetype 1 errors 5, no dir item, no inode ref > > ERROR: errors found in fs roots > > Opening filesystem to check... > > Checking filesystem on /dev/mapper/cryptroot > > UUID: bbd941a4-5525-4ba6-a4d8-3ead02b8aae1 > > found 409699909636 bytes used, error(s) found > > total csum bytes: 390595732 > > total tree bytes: 5061541888 > > total fs tree bytes: 4224024576 > > total extent tree bytes: 339312640 > > btree space waste bytes: 892618468 > > file data blocks allocated: 529336496128 > > referenced 490479570944 > > > So there is just some minor problem in fs trees, not a big problem, and > your extent tree passes the check, so it's not on-disk data corruption. > > > > > [ 6098.921985] BTRFS error (device dm-0): unable to find ref byte nr > > 390335463424 parent 0 root 2 > > [ 6098.922473] BTRFS: error (device dm-0) in __btrfs_free_extent:6828: > > errno=-2 No such entry > > [ 6098.922526] BTRFS: error (device dm-0) in > > btrfs_run_delayed_refs:2978: errno=-2 No such entry > > [ 6098.922601] BTRFS: error (device dm-0) in btrfs_replay_log:2267: > > errno=-2 No such entry (Failed to recover log tree) > > [ 6098.972326] BTRFS error (device dm-0): open_ctree failed > > It's log recovery causing problem. > > You could just use "btrfs rescue zero-log" to recovery it. > > Thanks, > Qu > > > > > I've searched for a solution on the web, but most articles tell to do > > nothing, but write to this mailing list. So my hopes are that you can > > shed some light into what I can do. > > > > I've found a quite recent thread here > > (https://lore.kernel.org/linux-btrfs/5b0d2e94-6e4e-aecd-3eda-459c4a96b...@mokrynskyi.com/) > > but this just mentions a fix for 'Fix missing reference aborts when > > resuming snapshot delete' and is not further specific. > > > > Setup of my SSD looks like: > > > > * efi > > * dm-crypt plain. Contains BTRFS (w/o lvm or similar). Several > > subvolumes (/, /home, ...) > > * swap > > > > I've already run btrfs restore on volid 258 (home) and gathered lots > > of data from the disk (>200GB). I also have a dd backup of the > > cryptroot after the failure happened (in case something goes wrong). > > Besides I did not do any fix attempts yet. If there is anything I can > >
Btrfs corruption: Cannot mount partition
Hello, After a reboot I am lost with an unmountable BTRFS partition. Before reboot I had first compile problems with freezing IntelliJ. These persisted after a first reboot, after a second reboot I am faced with the following error after entering the dm-crypt password (also after manual mount with -o ro,recovery, see attached dmesg): [ 6098.921985] BTRFS error (device dm-0): unable to find ref byte nr 390335463424 parent 0 root 2 [ 6098.922473] BTRFS: error (device dm-0) in __btrfs_free_extent:6828: errno=-2 No such entry [ 6098.922526] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2978: errno=-2 No such entry [ 6098.922601] BTRFS: error (device dm-0) in btrfs_replay_log:2267: errno=-2 No such entry (Failed to recover log tree) [ 6098.972326] BTRFS error (device dm-0): open_ctree failed I've searched for a solution on the web, but most articles tell to do nothing, but write to this mailing list. So my hopes are that you can shed some light into what I can do. I've found a quite recent thread here (https://lore.kernel.org/linux-btrfs/5b0d2e94-6e4e-aecd-3eda-459c4a96b...@mokrynskyi.com/) but this just mentions a fix for 'Fix missing reference aborts when resuming snapshot delete' and is not further specific. Setup of my SSD looks like: * efi * dm-crypt plain. Contains BTRFS (w/o lvm or similar). Several subvolumes (/, /home, ...) * swap I've already run btrfs restore on volid 258 (home) and gathered lots of data from the disk (>200GB). I also have a dd backup of the cryptroot after the failure happened (in case something goes wrong). Besides I did not do any fix attempts yet. If there is anything I can do to get the system working again, I'm happy to hear. Thanks! My Linux system is Arch Linux (up to date), logs below come from the Arch install medium . # uname -a Linux archiso 4.20.6-arch1-1-ARCH #1 SMP PREEMPT Thu Jan 31 08:22:01 UTC 2019 x86_64 GNU/Linux # btrfs --version btrfs-progs v4.19.1 # btrfs fi show Label: 'root' uuid: bbd941a4-5525-4ba6-a4d8-3ead02b8aae1 Total devices 1 FS bytes used 381.56GiB devid1 size 460.39GiB used 393.01GiB path /dev/mapper/cryptroot # btrfs check --readonly /dev/mapper/cryptroot [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots root 258 inode 776 errors 200, dir isize wrong root 258 inode 1131031 errors 1, no inode item unresolved ref dir 776 index 87215 namelen 17 name TransportSecurity filetype 1 errors 5, no dir item, no inode ref root 258 inode 2911226 errors 1, no inode item unresolved ref dir 776 index 160611 namelen 17 name TransportSecurity filetype 1 errors 5, no dir item, no inode ref ERROR: errors found in fs roots Opening filesystem to check... Checking filesystem on /dev/mapper/cryptroot UUID: bbd941a4-5525-4ba6-a4d8-3ead02b8aae1 found 409699909636 bytes used, error(s) found total csum bytes: 390595732 total tree bytes: 5061541888 total fs tree bytes: 4224024576 total extent tree bytes: 339312640 btree space waste bytes: 892618468 file data blocks allocated: 529336496128 referenced 490479570944 [ 6098.200152] BTRFS warning (device dm-0): 'recovery' is deprecated, use 'usebackuproot' instead [ 6098.200155] BTRFS info (device dm-0): trying to use backup root at mount time [ 6098.200158] BTRFS info (device dm-0): disk space caching is enabled [ 6098.200161] BTRFS info (device dm-0): has skinny extents [ 6098.318699] BTRFS info (device dm-0): enabling ssd optimizations [ 6098.920655] WARNING: CPU: 2 PID: 1581 at fs/btrfs/extent-tree.c:6822 __btrfs_free_extent.isra.25+0x61e/0x940 [btrfs] [ 6098.920657] Modules linked in: btrfs libcrc32c crc32c_generic xor raid6_pq dm_crypt algif_skcipher af_alg dm_mod hid_multitouch hid_generic snd_hda_codec_hdmi joydev mousedev arc4 wl(POE) ath10k_pci ath10k_core snd_soc_skl ath snd_soc_hdac_hda intel_rapl snd_hda_ext_core mac80211 snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp coretemp snd_soc_sst_ipc btusb snd_soc_sst_dsp kvm_intel snd_soc_acpi_intel_match btrtl snd_soc_acpi btbcm btintel snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic bluetooth snd_compress ac97_bus cfg80211 snd_pcm_dmaengine uvcvideo mei_wdt crct10dif_pclmul snd_hda_intel videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 iTCO_wdt iTCO_vendor_support videobuf2_common ghash_clmulni_intel dell_laptop snd_hda_codec videodev ecdh_generic rtsx_pci_ms intel_cstate snd_hda_core psmouse tpm_crb intel_uncore rfkill media memstick intel_rapl_perf crc16 snd_hwdep snd_pcm pcspkr input_leds i2c_hid snd_timer intel_wmi_thunderbolt snd hid idma64 soundcore mei_me [ 6098.920703] soc_button_array i2c_i801 tpm_tis mei tpm_tis_core intel_lpss_pci intel_vbtn intel_lpss intel_hid dell_wmi tpm battery dell_smbios processor_thermal_device evdev sparse_keymap intel_pch_thermal dcdbas intel_soc_dts_iosf rng_core ac ucsi_acpi int3400_thermal typec_ucsi int3403_thermal acpi_thermal_rel int340x_thermal_zone wmi_bmof dell_wmi_descriptor typec mac_hid pcc_cpufreq
Re: btrfs as / filesystem in RAID1
Chris Murphy - 07.02.19, 18:15: > > So please change the normal behavior > > In the case of no device loss, but device delay, with 'degraded' set > in fstab you risk a non-deterministic degraded mount. And there is no > automatic balance (sync) after recovering from a degraded mount. And > as far as I know there's no automatic transition from degraded to > normal operation upon later discovery of a previously missing device. > It's just begging for data loss. That's why it's not the default. > That's why it's not recommended. Still the current behavior is not really user-friendly. And does not meet expectations that users usually have about how RAID 1 works. I know BTRFS RAID 1 is no RAID 1, although it is called like this. I also somewhat get that with the current state of BTRFS the current behavior of not allowing a degraded mount may be better… however… I see clearly room for improvement here. And there very likely will be discussions like this on this list… until BTRFS acts in a more user friendly way here. I faced this myself during recovery from a failure of one SSD of a dual SSD BTRFS RAID 1 and it caused me having to spend *hours* instead of what in my eyes could be minutes to recover the machine to a working state again. Luckily the SSDs I use do not tend to fail all that often. And the Intel SSD 320 that has this "Look, I am 8 MiB big and all your data is gone" firmware bug – even with the firmware version that was supposed to fix this issue – is out of service now. Although I was able to bring it back to a working (but blank) state with a secure erase, I am just not going to use such a SSD for anything serious. Thanks, -- Martin
Re: New hang (Re: Kernel traces), sysreq+w output
On 06.02.2019 01:22 Qu Wenruo wrote: > On 2019/2/6 上午6:18, Stephen R. van den Berg wrote: >> Are these Sysreq+w dumps not usable? >> > Sorry for the late reply. > > The hang looks pretty strange, and doesn't really look like previous > deadlock caused by tree block locking. > But some strange behavior about metadata dirty pages: > > This looks like to be the cause of the problem. > > kworker/u16:1 D0 19178 2 0x8000 > Workqueue: btrfs-endio-write btrfs_endio_write_helper > Call Trace: > ? __schedule+0x4db/0x524 > ? schedule+0x60/0x71 > ? schedule_timeout+0xb2/0xec > ? __next_timer_interrupt+0xae/0xae > ? io_schedule_timeout+0x1b/0x3d > ? balance_dirty_pages+0x7a7/0x861 > ? usleep_range+0x7e/0x7e > ? schedule+0x60/0x71 > ? schedule_timeout+0x32/0xec > ? balance_dirty_pages_ratelimited+0x204/0x225 > ? btrfs_finish_ordered_io+0x584/0x5ac > ? normal_work_helper+0xfe/0x243 > ? process_one_work+0x18d/0x271 > ? rescuer_thread+0x278/0x278 > ? worker_thread+0x194/0x23f > ? kthread+0xeb/0xf0 > ? kthread_associate_blkcg+0x86/0x86 > ? ret_from_fork+0x35/0x40 > > But I'm not familiar with balance_dirty_pages part, thus can't provide > much details about this. That balance_dirty_pages call was removed with the latest stable kernels ( https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/fs/btrfs?h=linux-4.20.y&id=480c6fb23eb80e88eba7e4603304710ee7a9416f ).
Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
On 14.12.2018 09:07 ethanlien wrote: > Martin Raiber 於 2018-12-12 23:22 寫到: >> On 12.12.2018 15:47 Chris Mason wrote: >>> On 28 May 2018, at 1:48, Ethan Lien wrote: >>> >>> It took me a while to trigger, but this actually deadlocks ;) More >>> below. >>> >>>> [Problem description and how we fix it] >>>> We should balance dirty metadata pages at the end of >>>> btrfs_finish_ordered_io, since a small, unmergeable random write can >>>> potentially produce dirty metadata which is multiple times larger than >>>> the data itself. For example, a small, unmergeable 4KiB write may >>>> produce: >>>> >>>> 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree >>>> 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree >>>> 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree >>>> >>>> Although we do call balance dirty pages in write side, but in the >>>> buffered write path, most metadata are dirtied only after we reach the >>>> dirty background limit (which by far only counts dirty data pages) and >>>> wakeup the flusher thread. If there are many small, unmergeable random >>>> writes spread in a large btree, we'll find a burst of dirty pages >>>> exceeds the dirty_bytes limit after we wakeup the flusher thread - >>>> which >>>> is not what we expect. In our machine, it caused out-of-memory problem >>>> since a page cannot be dropped if it is marked dirty. >>>> >>>> Someone may worry about we may sleep in >>>> btrfs_btree_balance_dirty_nodelay, >>>> but since we do btrfs_finish_ordered_io in a separate worker, it will >>>> not >>>> stop the flusher consuming dirty pages. Also, we use different worker >>>> for >>>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us >>>> throttle >>>> the size of dirty metadata pages. >>> In general, slowing down btrfs_finish_ordered_io isn't ideal because it >>> adds latency to places we need to finish quickly. Also, >>> btrfs_finish_ordered_io is used by the free space cache. Even though >>> this happens from its own workqueue, it means completing free space >>> cache writeback may end up waiting on balance_dirty_pages, something >>> like this stack trace: >>> >>> [..] >>> >>> Eventually, we have every process in the system waiting on >>> balance_dirty_pages(), and nobody is able to make progress on >>> paclear page's writebackge >>> writeback. >>> >> I had lockups with this patch as well. If you put e.g. a loop device on >> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back >> loop causing delays. The task balancing dirty pages in >> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In >> my case it managed to cause a feedback loop where it queues other >> btrfs_finish_ordered_io and gets stuck completely. >> > > The data writepage endio will queue a work for > btrfs_finish_ordered_io() in a separate workqueue and clear page's > writeback, so throttling in btrfs_finish_ordered_io() should not slow > down flusher thread. One suspicious point is while the caller is > waiting a range of ordered_extents to complete, they will be > blocked until balance_dirty_pages_ratelimited() make some > progress, since we finish ordered_extents in > btrfs_finish_ordered_io(). > Do you have call stack information for stuck processes or using > fsync/sync frequently? If this is the case, maybe we should pull > this thing out and try balance dirty metadata pages somewhere. Yeah like, [875317.071433] Call Trace: [875317.071438] ? __schedule+0x306/0x7f0 [875317.071442] schedule+0x32/0x80 [875317.071447] btrfs_start_ordered_extent+0xed/0x120 [875317.071450] ? remove_wait_queue+0x60/0x60 [875317.071454] btrfs_wait_ordered_range+0xa0/0x100 [875317.071457] btrfs_sync_file+0x1d6/0x400 [875317.071461] ? do_fsync+0x38/0x60 [875317.071463] ? btrfs_fdatawrite_range+0x50/0x50 [875317.071465] do_fsync+0x38/0x60 [875317.071468] __x64_sys_fsync+0x10/0x20 [875317.071470] do_syscall_64+0x55/0x100 [875317.071473] entry_SYSCALL_64_after_hwframe+0x44/0xa9 so I guess the problem is that the calling balance_dirty_pages causes fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those fsyncs are deadlocked because they are called indirectly from btrfs_finish_ordered_io... It is a unusal setup, which is why I did not post it to the mailing list initially.
Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
On 12.12.2018 15:47 Chris Mason wrote: > On 28 May 2018, at 1:48, Ethan Lien wrote: > > It took me a while to trigger, but this actually deadlocks ;) More > below. > >> [Problem description and how we fix it] >> We should balance dirty metadata pages at the end of >> btrfs_finish_ordered_io, since a small, unmergeable random write can >> potentially produce dirty metadata which is multiple times larger than >> the data itself. For example, a small, unmergeable 4KiB write may >> produce: >> >> 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree >> 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree >> 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree >> >> Although we do call balance dirty pages in write side, but in the >> buffered write path, most metadata are dirtied only after we reach the >> dirty background limit (which by far only counts dirty data pages) and >> wakeup the flusher thread. If there are many small, unmergeable random >> writes spread in a large btree, we'll find a burst of dirty pages >> exceeds the dirty_bytes limit after we wakeup the flusher thread - >> which >> is not what we expect. In our machine, it caused out-of-memory problem >> since a page cannot be dropped if it is marked dirty. >> >> Someone may worry about we may sleep in >> btrfs_btree_balance_dirty_nodelay, >> but since we do btrfs_finish_ordered_io in a separate worker, it will >> not >> stop the flusher consuming dirty pages. Also, we use different worker >> for >> metadata writeback endio, sleep in btrfs_finish_ordered_io help us >> throttle >> the size of dirty metadata pages. > In general, slowing down btrfs_finish_ordered_io isn't ideal because it > adds latency to places we need to finish quickly. Also, > btrfs_finish_ordered_io is used by the free space cache. Even though > this happens from its own workqueue, it means completing free space > cache writeback may end up waiting on balance_dirty_pages, something > like this stack trace: > > [..] > > Eventually, we have every process in the system waiting on > balance_dirty_pages(), and nobody is able to make progress on page > writeback. > I had lockups with this patch as well. If you put e.g. a loop device on top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back loop causing delays. The task balancing dirty pages in btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In my case it managed to cause a feedback loop where it queues other btrfs_finish_ordered_io and gets stuck completely. Regards, Martin Raiber
Re: Possible deadlock when writing
I was having the same issue with kernels 4.19.2 and 4.19.4. I don’t appear to have the issue with 4.20.0-0.rc1 on Fedora Server 29. The issue is very easy to reproduce on my setup, not sure how much of it is actually relevant, but here it is: - 3 drive RAID5 created - Some data moved to it - Expanded to 7 drives - No balancing The issue is easily reproduced (within 30 mins) by starting multiple transfers to the volume (several TB in the form of many 30GB+ files). Multiple concurrent ‘rsync’ transfers seems to take a bit longer to trigger the issue, but multiple ‘cp’ commands will do it much quicker (again not sure if relevant). I have not seen the issue occur with a single ‘rsync’ or ‘cp’ transfer, but I haven’t left one running alone for too long (copying the data from multiple drives, so there is a lot to be gained from parallelizing the transfers). I’m not sure what state the FS is left in after Magic SysRq reboot after it deadlocks, but seemingly it’s fine. No problems mounting and ‘btrfs check’ passes OK. I’m sure some of the data doesn’t get flushed, but it’s no problem for my use case. I’ve been running nonstop concurrent transfers with kernel 4.20.0-0.rc1 for 24hr nonstop and I haven’t experienced the issue. Hope this helps.
Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"
Misono Tomohiro - 27.11.18, 06:24: > Importantly, in order to make output consistent for both root and > non-privileged user, this changes the behavior of "subvolume list": > - (default) Only list in subvolume under the specified path. >Path needs to be a subvolume. Does that work recursively? I wound find it quite unexpected if I did btrfs subvol list in or on the root directory of a BTRFS filesystem would not display any subvolumes on that filesystem no matter where they are. Thanks, -- Martin
Re: Interpreting `btrfs filesystem show'
Hugo Mills - 15.10.18, 16:26: > On Mon, Oct 15, 2018 at 05:24:08PM +0300, Anton Shepelev wrote: > > Hello, all > > > > While trying to resolve free space problems, and found that > > > > I cannot interpret the output of: > > > btrfs filesystem show > > > > Label: none uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8 > > Total devices 1 FS bytes used 34.06GiB > > devid1 size 40.00GiB used 37.82GiB path /dev/sda2 > > > > How come the total used value is less than the value listed > > for the only device? > >"Used" on the device is the mount of space allocated. "Used" on the > FS is the total amount of actual data and metadata in that > allocation. > >You will also need to look at the output of "btrfs fi df" to see > the breakdown of the 37.82 GiB into data, metadata and currently > unused. I usually use btrfs fi usage -T, cause 1. It has all the information. 2. It differentiates between used and allocated. % btrfs fi usage -T / Overall: Device size: 100.00GiB Device allocated: 54.06GiB Device unallocated: 45.94GiB Device missing: 0.00B Used: 46.24GiB Free (estimated): 25.58GiB (min: 25.58GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 70.91MiB (used: 0.00B) Data Metadata System Id Path RAID1RAID1 RAID1Unallocated -- - --- 2 /dev/mapper/msata-debian 25.00GiB 2.00GiB 32.00MiB22.97GiB 1 /dev/mapper/sata-debian 25.00GiB 2.00GiB 32.00MiB22.97GiB -- - --- Total25.00GiB 2.00GiB 32.00MiB45.94GiB Used 22.38GiB 754.66MiB 16.00KiB For RAID it in some place reports the raw size and sometimes the logical size. Especially in the "Total" line I find this a bit inconsistent. "RAID1" columns show logical size, "Unallocated" shows raw size. Also "Used:" in the global section shows raw size and "Free (estimated):" shows logical size. Thanks -- Martin
Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery
Filipe Manana - 05.10.18, 17:21: > On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald wrote: > > Hello! > > > > On ThinkPad T520 after battery was discharged and machine just > > blacked out. > > > > Is that some sign of regular consistency check / replay or something > > to investigate further? > > I think it's harmless, if anything were messed up with link counts or > mismatches between those and dir entries, fsck (btrfs check) should > have reported something. > I'll dig a big further and remove the warning if it's really harmless. I just scrubbed the filesystem. I did not run btrfs check on it. > > I already scrubbed all data and there are no errors. Also btrfs > > device stats reports no errors. SMART status appears to be okay as > > well on both SSD. > > > > [4.524355] BTRFS info (device dm-4): disk space caching is > > enabled [… backtrace …] -- Martin
BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery
83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3e e4 0b 00 f7 d8 64 89 01 48 [6.123872] RSP: 002b:7ffc0e3466a8 EFLAGS: 0202 ORIG_RAX: 00a5 [6.131285] RAX: ffda RBX: 55f3ed7ee9c0 RCX: 7f0715b89a1a [6.131286] RDX: 55f3ed7eebc0 RSI: 55f3ed7eec40 RDI: 55f3ed7ef900 [6.131287] RBP: 7f0715ecff04 R08: 55f3ed7eec00 R09: 55f3ed7eebc0 [6.131288] R10: c0ed0400 R11: 0202 R12: [6.131289] R13: c0ed0400 R14: 55f3ed7ef900 R15: 55f3ed7eebc0 [6.131292] ---[ end trace bd5d30b2fea7fb77 ]--- [6.251219] BTRFS info (device dm-3): checking UUID tree Thanks, -- Martin
Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)
Hans van Kranenburg - 19.09.18, 19:58: > However, as soon as we remount the filesystem with space_cache=v2 - > > > writes drop to just around 3-10 MB/s to each disk. If we remount to > > space_cache - lots of writes, system unresponsive. Again remount to > > space_cache=v2 - low writes, system responsive. > > > > That's a huuge, 10x overhead! Is it expected? Especially that > > space_cache=v1 is still the default mount option? > > Yes, that does not surprise me. > > https://events.static.linuxfound.org/sites/events/files/slides/vault20 > 16_0.pdf > > Free space cache v1 is the default because of issues with btrfs-progs, > not because it's unwise to use the kernel code. I can totally > recommend using it. The linked presentation above gives some good > background information. What issues in btrfs-progs are that? I am wondering whether to switch to freespace tree v2. Would it provide benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 on a laptop? Thanks, -- Martin
Re: Transactional btrfs
Am 08.09.2018 um 18:24 schrieb Adam Borowski: > On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote: >> On 2018-09-06 03:23, Nathan Dehnel wrote: >>> So I guess my question is, does btrfs support atomic writes across >>> multiple files? Or is anyone interested in such a feature? >>> >> I'm fairly certain that it does not currently, but in theory it would not be >> hard to add. >> >> Realistically, the only cases I can think of where cross-file atomic >> _writes_ would be of any benefit are database systems. >> >> However, if this were extended to include rename, unlink, touch, and a >> handful of other VFS operations, then I can easily think of a few dozen use >> cases. Package managers in particular would likely be very interested in >> being able to atomically rename a group of files as a single transaction, as >> it would make their job _much_ easier. > I wonder, what about: > sync; mount -o remount,commit=999,flushoncommit > eatmydata apt dist-upgrade > sync; mount -o remount,commit=30,noflushoncommit > > Obviously, this gets fooled by fsyncs, and makes the transaction affects the > whole system (if you have unrelated writes they won't get committed until > the end of transaction). Then there are nocow files, but you already made > the decision to disable most features of btrfs for them. > > So unless something forces a commit, this should already work, giving > cross-file atomic writes, renames and so on. Now combine this with snapshot root, then on success rename exchange to root and you are there. Btrfs had in the past TRANS_START and TRANS_END ioctls (for ceph, I think), but no rollback (and therefore no error handling incl. ENOSPC). If you want to look at a working file system transaction mechanism, you should look at transactional NTFS (TxF). They are writing they are deprecating it, so it's perhaps not very widely used. Windows uses it for updates, I think. Specifically for btrfs, the problem would be that it really needs to support multiple simultaneous writers, otherwise one transaction can block the whole system.
Re: lazytime mount option—no support in Btrfs
waxhead - 18.08.18, 22:45: > Adam Hunt wrote: > > Back in 2014 Ted Tso introduced the lazytime mount option for ext4 > > and shortly thereafter a more generic VFS implementation which was > > then merged into mainline. His early patches included support for > > Btrfs but those changes were removed prior to the feature being > > merged. His> > > changelog includes the following note about the removal: > >- Per Christoph's suggestion, drop support for btrfs and xfs for > >now, > > > > issues with how btrfs and xfs handle dirty inode tracking. We > > can add btrfs and xfs support back later or at the end of this > > series if we want to revisit this decision. > > > > My reading of the current mainline shows that Btrfs still lacks any > > support for lazytime. Has any thought been given to adding support > > for lazytime to Btrfs? […] > Is there any new regarding this? I´d like to know whether there is any news about this as well. If I understand it correctly this could even help BTRFS performance a lot cause it is COW´ing metadata. Thanks, -- Martin
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
Roman Mamedov - 18.08.18, 09:12: > On Fri, 17 Aug 2018 23:17:33 +0200 > > Martin Steigerwald wrote: > > > Do not consider SSD "compression" as a factor in any of your > > > calculations or planning. Modern controllers do not do it anymore, > > > the last ones that did are SandForce, and that's 2010 era stuff. > > > You > > > can check for yourself by comparing write speeds of compressible > > > vs > > > incompressible data, it should be the same. At most, the modern > > > ones > > > know to recognize a stream of binary zeroes and have a special > > > case > > > for that. > > > > Interesting. Do you have any backup for your claim? > > Just "something I read". I follow quote a bit of SSD-related articles > and reviews which often also include a section to talk about the > controller utilized, its background and technological > improvements/changes -- and the compression going out of fashion > after SandForce seems to be considered a well-known fact. > > Incidentally, your old Intel 320 SSDs actually seem to be based on > that old SandForce controller (or at least license some of that IP to > extend on it), and hence those indeed might perform compression. Interesting. Back then I read the Intel SSD 320 would not compress. I think its difficult to know for sure with those proprietary controllers. > > As the data still needs to be transferred to the SSD at least when > > the SATA connection is maxed out I bet you won´t see any difference > > in write speed whether the SSD compresses in real time or not. > > Most controllers expose two readings in SMART: > > - Lifetime writes from host (SMART attribute 241) > - Lifetime writes to flash (attribute 233, or 177, or 173...) > > It might be difficult to get the second one, as often it needs to be > decoded from others such as "Average block erase count" or "Wear > leveling count". (And seems to be impossible on Samsung NVMe ones, > for example) I got the impression every manufacturer does their own thing here. And I would not even be surprised when its different between different generations of SSDs by one manufacturer. # Crucial mSATA SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 000Pre-fail Always - 0 9 Power_On_Hours 0x0032 100 100 000Old_age Always - 16345 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 4193 171 Program_Fail_Count 0x0032 100 100 000Old_age Always - 0 172 Erase_Fail_Count0x0032 100 100 000Old_age Always - 0 173 Wear_Leveling_Count 0x0032 078 078 000Old_age Always - 663 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000Old_age Always - 362 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 000 000 000Pre-fail Always - 8219 183 SATA_Iface_Downshift0x0032 100 100 000Old_age Always - 1 184 End-to-End_Error0x0032 100 100 000Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000Old_age Always - 0 194 Temperature_Celsius 0x0022 046 020 000Old_age Always - 54 (Min/Max -10/80) 196 Reallocated_Event_Count 0x0032 100 100 000Old_age Always - 16 197 Current_Pending_Sector 0x0032 100 100 000Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x0032 100 100 000Old_age Always - 0 202 Percent_Lifetime_Used 0x0031 078 078 000Pre-fail Offline - 22 I expect the raw value of this to raise more slowly now there are almost 100 GiB completely unused and there is lots of free space in the filesystems. But even if not, the SSD is in use since March 2014. So it has plenty of time to go. 206 Write_Error_Rate0x000e 100 100 000Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000Old_age Always - 0 246 Total_Host_Sector_Write 0x0032 100 100 ---Old_age Always - 91288276930 ^^ In sectors. 91288276930 * 512 / 1024 / 1024 / 1024 ~= 43529 GiB Could be 4 KiB… but as its telling about Host_Sector and the value multiplied by eight does not make any sense, I bet its 512 Bytes. % smartctl /dev/sdb --all |grep
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
Austin S. Hemmelgarn - 17.08.18, 14:55: > On 2018-08-17 08:28, Martin Steigerwald wrote: > > Thanks for your detailed answer. > > > > Austin S. Hemmelgarn - 17.08.18, 13:58: > >> On 2018-08-17 05:08, Martin Steigerwald wrote: […] > >>> Anyway, creating a new filesystem may have been better here > >>> anyway, > >>> cause it replaced an BTRFS that aged over several years with a new > >>> one. Due to the increased capacity and due to me thinking that > >>> Samsung 860 Pro compresses itself, I removed LZO compression. This > >>> would also give larger extents on files that are not fragmented or > >>> only slightly fragmented. I think that Intel SSD 320 did not > >>> compress, but Crucial m500 mSATA SSD does. That has been the > >>> secondary SSD that still had all the data after the outage of the > >>> Intel SSD 320. > >> > >> First off, keep in mind that the SSD firmware doing compression > >> only > >> really helps with wear-leveling. Doing it in the filesystem will > >> help not only with that, but will also give you more space to work > >> with.> > > While also reducing the ability of the SSD to wear-level. The more > > data I fit on the SSD, the less it can wear-level. And the better I > > compress that data, the less it can wear-level. > > No, the better you compress the data, the _less_ data you are > physically putting on the SSD, just like compressing a file makes it > take up less space. This actually makes it easier for the firmware > to do wear-leveling. Wear-leveling is entirely about picking where > to put data, and by reducing the total amount of data you are writing > to the SSD, you're making that decision easier for the firmware, and > also reducing the number of blocks of flash memory needed (which also > helps with SSD life expectancy because it translates to fewer erase > cycles). On one hand I can go with this, but: If I fill the SSD 99% with already compressed data, in case it compresses itself for wear leveling, it has less chance to wear level than with 99% of not yet compressed data that it could compress itself. That was the point I was trying to make. Sure, with a fill rate of about 46% for home, compression would help the wear leveling. And if the controller does not compress at all, it would also. Hmmm, maybe I enable "zstd", but on the other hand I save CPU cycles with not enabling it. > > However… I am not all that convinced that it would benefit me as > > long as I have enough space. That SSD replacement more than doubled > > capacity from about 680 TB to 1480 TB. I have ton of free space in > > the filesystems – usage of /home is only 46% for example – and > > there are 96 GiB completely unused in LVM on the Crucial SSD and > > even more than 183 GiB completely unused on Samsung SSD. The system > > is doing weekly "fstrim" on all filesystems. I think that this is > > more than is needed for the longevity of the SSDs, but well > > actually I just don´t need the space, so… > > > > Of course, in case I manage to fill up all that space, I consider > > using compression. Until then, I am not all that convinced that I´d > > benefit from it. > > > > Of course it may increase read speeds and in case of nicely > > compressible data also write speeds, I am not sure whether it even > > matters. Also it uses up some CPU cycles on a dual core (+ > > hyperthreading) Sandybridge mobile i5. While I am not sure about > > it, I bet also having larger possible extent sizes may help a bit. > > As well as no compression may also help a bit with fragmentation. > > It generally does actually. Less data physically on the device means > lower chances of fragmentation. In your case, it may not improve I thought "no compression" may help with fragmentation, but I think you think that "compression" helps with fragmentation and misunderstood what I wrote. > speed much though (your i5 _probably_ can't compress data much faster > than it can access your SSD's, which means you likely won't see much > performance benefit other than reducing fragmentation). > > > Well putting this to a (non-scientific) test: > > > > […]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head > > -5 3,1Gparttable.ibd > > > > […]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd > > parttable.ibd: 11583 extents found > > > > Hmmm, already quite many extents after just about one week with the > > new filesystem. On the old filesystem I had somewhat around >
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
Hi Roman. Now with proper CC. Roman Mamedov - 17.08.18, 14:50: > On Fri, 17 Aug 2018 14:28:25 +0200 > > Martin Steigerwald wrote: > > > First off, keep in mind that the SSD firmware doing compression > > > only > > > really helps with wear-leveling. Doing it in the filesystem will > > > help not only with that, but will also give you more space to > > > work with.> > > While also reducing the ability of the SSD to wear-level. The more > > data I fit on the SSD, the less it can wear-level. And the better I > > compress that data, the less it can wear-level. > > Do not consider SSD "compression" as a factor in any of your > calculations or planning. Modern controllers do not do it anymore, > the last ones that did are SandForce, and that's 2010 era stuff. You > can check for yourself by comparing write speeds of compressible vs > incompressible data, it should be the same. At most, the modern ones > know to recognize a stream of binary zeroes and have a special case > for that. Interesting. Do you have any backup for your claim? > As for general comment on this thread, always try to save the exact > messages you get when troubleshooting or getting failures from your > system. Saying just "was not able to add" or "btrfs replace not > working" without any exact details isn't really helpful as a bug > report or even as a general "experiences" story, as we don't know > what was the exact cause of those, could that have been avoided or > worked around, not to mention what was your FS state at the time (as > in "btrfs fi show" and "fi df"). I had a screen.log, but I put it on the filesystem after the backup was made, so it was lost. Anyway, the reason for not being able to add the device was the read only state of the BTRFS, as I wrote. Same goes for replace. I was able to read the error message just fine. AFAIR the exact wording was "read only filesystem". In any case: It was a experience report, no request for help, so I don´t see why exact error messages are absolutely needed. If I had a support inquiry that would be different, I agree. Thanks, -- Martin
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
Austin S. Hemmelgarn - 17.08.18, 15:01: > On 2018-08-17 08:50, Roman Mamedov wrote: > > On Fri, 17 Aug 2018 14:28:25 +0200 > > > > Martin Steigerwald wrote: > >>> First off, keep in mind that the SSD firmware doing compression > >>> only > >>> really helps with wear-leveling. Doing it in the filesystem will > >>> help not only with that, but will also give you more space to > >>> work with.>> > >> While also reducing the ability of the SSD to wear-level. The more > >> data I fit on the SSD, the less it can wear-level. And the better > >> I compress that data, the less it can wear-level. > > > > Do not consider SSD "compression" as a factor in any of your > > calculations or planning. Modern controllers do not do it anymore, > > the last ones that did are SandForce, and that's 2010 era stuff. > > You can check for yourself by comparing write speeds of > > compressible vs incompressible data, it should be the same. At > > most, the modern ones know to recognize a stream of binary zeroes > > and have a special case for that. > > All that testing write speeds forz compressible versus incompressible > data tells you is if the SSD is doing real-time compression of data, > not if they are doing any compression at all.. Also, this test only > works if you turn the write-cache on the device off. As the data still needs to be transferred to the SSD at least when the SATA connection is maxed out I bet you won´t see any difference in write speed whether the SSD compresses in real time or not. > Besides, you can't prove 100% for certain that any manufacturer who > does not sell their controller chips isn't doing this, which means > there are a few manufacturers that may still be doing it. Who really knows what SSD controller manufacturers are doing? I have not seen any Open Channel SSD stuff for laptops so far. Thanks, -- Martin
Hang after growing file system (4.14.48)
Hi, after growing a single btrfs file system (on a loop device, with btrfs fi resize max /fs ) it hangs later (sometimes much later). Symptoms: * A unkillable btrfs process using 100% (of one) CPU in R state (no kernel trace, cannot attach with strace, gdb or run linux perf) * Other processes with following stack trace: Fri Aug 17 16:21:06 2018] INFO: task python3:46794 blocked for more than 120 seconds. [Fri Aug 17 16:21:06 2018] Not tainted 4.14.48 #2 [Fri Aug 17 16:21:06 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Aug 17 16:21:06 2018] python3 D 0 46794 46702 0x [Fri Aug 17 16:21:06 2018] Call Trace: [Fri Aug 17 16:21:06 2018] ? __schedule+0x2de/0x7b0 [Fri Aug 17 16:21:06 2018] schedule+0x32/0x80 [Fri Aug 17 16:21:06 2018] schedule_preempt_disabled+0xa/0x10 [Fri Aug 17 16:21:06 2018] __mutex_lock.isra.1+0x295/0x4c0 [Fri Aug 17 16:21:06 2018] ? btrfs_show_devname+0x25/0xd0 [Fri Aug 17 16:21:06 2018] btrfs_show_devname+0x25/0xd0 [Fri Aug 17 16:21:06 2018] show_vfsmnt+0x44/0x150 [Fri Aug 17 16:21:06 2018] seq_read+0x314/0x3d0 [Fri Aug 17 16:21:06 2018] __vfs_read+0x26/0x130 [Fri Aug 17 16:21:06 2018] vfs_read+0x91/0x130 [Fri Aug 17 16:21:06 2018] SyS_read+0x42/0x90 [Fri Aug 17 16:21:06 2018] do_syscall_64+0x6e/0x120 [Fri Aug 17 16:21:06 2018] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [Fri Aug 17 16:21:06 2018] RIP: 0033:0x7f67fd41b6d0 [Fri Aug 17 16:21:06 2018] RSP: 002b:7ffd80be2678 EFLAGS: 0246 ORIG_RAX: [Fri Aug 17 16:21:06 2018] RAX: ffda RBX: 56521bf7bb00 RCX: 7f67fd41b6d0 [Fri Aug 17 16:21:06 2018] RDX: 0400 RSI: 56521bf7bd30 RDI: 0004 [Fri Aug 17 16:21:06 2018] RBP: 0d68 R08: 7f67fe655700 R09: 0101 [Fri Aug 17 16:21:06 2018] R10: 56521bf7c0cc R11: 0246 R12: 7f67fd6d6440 [Fri Aug 17 16:21:06 2018] R13: 7f67fd6d5900 R14: 0064 R15: 0000 Regards, Martin Raiber
Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
Thanks for your detailed answer. Austin S. Hemmelgarn - 17.08.18, 13:58: > On 2018-08-17 05:08, Martin Steigerwald wrote: […] > > I have seen a discussion about the limitation in point 2. That > > allowing to add a device and make it into RAID 1 again might be > > dangerous, cause of system chunk and probably other reasons. I did > > not completely read and understand it tough. > > > > So I still don´t get it, cause: > > > > Either it is a RAID 1, then, one disk may fail and I still have > > *all* > > data. Also for the system chunk, which according to btrfs fi df / > > btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see > > why it would need to disallow me to make it into an RAID 1 again > > after one device has been lost. > > > > Or it is no RAID 1 and then what is the point to begin with? As I > > was > > able to copy of all date of the degraded mount, I´d say it was a > > RAID 1. > > > > (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just > > does two copies regardless of how many drives you use.) > > So, what's happening here is a bit complicated. The issue is entirely > with older kernels that are missing a couple of specific patches, but > it appears that not all distributions have their kernels updated to > include those patches yet. > > In short, when you have a volume consisting of _exactly_ two devices > using raid1 profiles that is missing one device, and you mount it > writable and degraded on such a kernel, newly created chunks will be > single-profile chunks instead of raid1 chunks with one half missing. > Any write has the potential to trigger allocation of a new chunk, and > more importantly any _read_ has the potential to trigger allocation of > a new chunk if you don't use the `noatime` mount option (because a > read will trigger an atime update, which results in a write). > > When older kernels then go and try to mount that volume a second time, > they see that there are single-profile chunks (which can't tolerate > _any_ device failures), and refuse to mount at all (because they > can't guarantee that metadata is intact). Newer kernels fix this > part by checking per-chunk if a chunk is degraded/complete/missing, > which avoids this because all the single chunks are on the remaining > device. How new the kernel needs to be for that to happen? Do I get this right that it would be the kernel used for recovery, i.e. the one on the live distro that needs to be new enough? To one on this laptop meanwhile is already 4.18.1. I used latest GRML stable release 2017.05 which has an 4.9 kernel. > As far as avoiding this in the future: I hope that with the new Samsung Pro 860 together with the existing Crucial m500 I am spared from this for years to come. That Crucial SSD according to SMART status about lifetime used has still quite some time to go. > * If you're just pulling data off the device, mark the device > read-only in the _block layer_, not the filesystem, before you mount > it. If you're using LVM, just mark the LV read-only using LVM > commands This will make 100% certain that nothing gets written to > the device, and thus makes sure that you won't accidentally cause > issues like this. > * If you're going to convert to a single device, > just do it and don't stop it part way through. In particular, make > sure that your system will not lose power. > * Otherwise, don't mount the volume unless you know you're going to > repair it. Thanks for those. Good to keep in mind. > > For this laptop it was not all that important but I wonder about > > BTRFS RAID 1 in enterprise environment, cause restoring from backup > > adds a significantly higher downtime. > > > > Anyway, creating a new filesystem may have been better here anyway, > > cause it replaced an BTRFS that aged over several years with a new > > one. Due to the increased capacity and due to me thinking that > > Samsung 860 Pro compresses itself, I removed LZO compression. This > > would also give larger extents on files that are not fragmented or > > only slightly fragmented. I think that Intel SSD 320 did not > > compress, but Crucial m500 mSATA SSD does. That has been the > > secondary SSD that still had all the data after the outage of the > > Intel SSD 320. > > First off, keep in mind that the SSD firmware doing compression only > really helps with wear-leveling. Doing it in the filesystem will help > not only with that, but will also give you more space to work with. While also reducing the ability of the SSD to wear-level. The more data I fit on the SSD, the less it can wear-level. And the b
Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD
Hi! This happened about two weeks ago. I already dealt with it and all is well. Linux hung on suspend so I switched off this ThinkPad T520 forcefully. After that it did not boot the operating system anymore. Intel SSD 320, latest firmware, which should patch this bug, but apparently does not, is only 8 MiB big. Those 8 MiB just contain zeros. Access via GRML and "mount -fo degraded" worked. I initially was even able to write onto this degraded filesystem. First I copied all data to a backup drive. I even started a balance to "single" so that it would work with one SSD. But later I learned that secure erase may recover the Intel SSD 320 and since I had no other SSD at hand, did that. And yes, it did. So I canceled the balance. I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But at that time I was not able to mount the degraded BTRFS on the other SSD as writable anymore, not even with "-f" "I know what I am doing". Thus I was not able to add a device to it and btrfs balance it to RAID 1. Even "btrfs replace" was not working. I thus formatted a new BTRFS RAID 1 and restored. A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again via one full backup and restore cycle. However, this time I was able to copy most of the data of the Intel SSD 320 with "mount -fo degraded" via eSATA and thus the copy operation was way faster. So conclusion: 1. Pro: BTRFS RAID 1 really protected my data against a complete SSD outage. 2. Con: It does not allow me to add a device and balance to RAID 1 or replace one device that is already missing at this time. 3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical data. 4. And yes, I know it does not replace a backup. As it was holidays and I was lazy backup was two weeks old already, so I was happy to have all my data still on the other SSD. 5. The error messages in kernel when mounting without "-o degraded" are less than helpful. They indicate a corrupted filesystem instead of just telling that one device is missing and "-o degraded" would help here. I have seen a discussion about the limitation in point 2. That allowing to add a device and make it into RAID 1 again might be dangerous, cause of system chunk and probably other reasons. I did not completely read and understand it tough. So I still don´t get it, cause: Either it is a RAID 1, then, one disk may fail and I still have *all* data. Also for the system chunk, which according to btrfs fi df / btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see why it would need to disallow me to make it into an RAID 1 again after one device has been lost. Or it is no RAID 1 and then what is the point to begin with? As I was able to copy of all date of the degraded mount, I´d say it was a RAID 1. (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does two copies regardless of how many drives you use.) For this laptop it was not all that important but I wonder about BTRFS RAID 1 in enterprise environment, cause restoring from backup adds a significantly higher downtime. Anyway, creating a new filesystem may have been better here anyway, cause it replaced an BTRFS that aged over several years with a new one. Due to the increased capacity and due to me thinking that Samsung 860 Pro compresses itself, I removed LZO compression. This would also give larger extents on files that are not fragmented or only slightly fragmented. I think that Intel SSD 320 did not compress, but Crucial m500 mSATA SSD does. That has been the secondary SSD that still had all the data after the outage of the Intel SSD 320. Overall I am happy, cause BTRFS RAID 1 gave me access to the data after the SSD outage. That is the most important thing about it for me. Thanks, -- Martin
Re: BTRFS and databases
On 02.08.2018 14:27 Austin S. Hemmelgarn wrote: > On 2018-08-02 06:56, Qu Wenruo wrote: >> >> On 2018年08月02日 18:45, Andrei Borzenkov wrote: >>> >>> Отправлено с iPhone >>> 2 авг. 2018 г., в 10:02, Qu Wenruo написал(а): > On 2018年08月01日 11:45, MegaBrutal wrote: > Hi all, > > I know it's a decade-old question, but I'd like to hear your thoughts > of today. By now, I became a heavy BTRFS user. Almost everywhere I > use > BTRFS, except in situations when it is obvious there is no benefit > (e.g. /var/log, /boot). At home, all my desktop, laptop and server > computers are mainly running on BTRFS with only a few file systems on > ext4. I even installed BTRFS in corporate productive systems (in > those > cases, the systems were mainly on ext4; but there were some specific > file systems those exploited BTRFS features). > > But there is still one question that I can't get over: if you store a > database (e.g. MySQL), would you prefer having a BTRFS volume mounted > with nodatacow, or would you just simply use ext4? > > I know that with nodatacow, I take away most of the benefits of BTRFS > (those are actually hurting database performance – the exact CoW > nature that is elsewhere a blessing, with databases it's a drawback). > But are there any advantages of still sticking to BTRFS for a > database > albeit CoW is disabled, or should I just return to the old and > reliable ext4 for those applications? Since I'm not a expert in database, so I can totally be wrong, but what about completely disabling database write-ahead-log (WAL), and let btrfs' data CoW to handle data consistency completely? >>> >>> This would make content of database after crash completely >>> unpredictable, thus making it impossible to reliably roll back >>> transaction. >> >> Btrfs itself (with datacow) can ensure the fs is updated completely. >> >> That's to say, even a crash happens, the content of the fs will be the >> same state as previous btrfs transaction (btrfs sync). >> >> Thus there is no need to rollback database transaction though. >> (Unless database transaction is not sync to btrfs transaction) >> > Two issues with this statement: > > 1. Not all database software properly groups logically related > operations that need to be atomic as a unit into transactions. > 2. Even aside from point 1 and the possibility of database corruption, > there are other legitimate reasons that you might need to roll-back a > transaction (for example, the rather obvious case of a transaction > that should not have happened in the first place). I thought of a database transaction scheme that is based on btrfs features before. It has practical issues, though. One would put a b-tree database file into a subvolume (e.g. trans_0). When changing the b-tree database one would create a snapshot (trans_1), then change the file in the snapshot. On commit sync trans_1, then delete trans_0. On rollback, delete trans_1. Problems: * Large overhead for small transactions (OLTP) -- problem in general for copy-on-write b-tree databases * Only root can create or destroy snapshots * Per default the Linux memory system starts write-back pretty much immediately, so pages that get overwritten more than once in a transaction (and not kept in RAM) unless Linux is tuned to not do this. I have used this method, albeit by reflinking the database, then modifying the reflink, but I think reflinking it slower than creating a snapshot? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and databases
Andrei Borzenkov - 02.08.18, 12:35: > Отправлено с iPhone > > > 2 авг. 2018 г., в 12:16, Martin Steigerwald > > написал(а):> > > Hugo Mills - 01.08.18, 10:56: > >>> On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote: > >>> I know it's a decade-old question, but I'd like to hear your > >>> thoughts > >>> of today. By now, I became a heavy BTRFS user. Almost everywhere I > >>> use BTRFS, except in situations when it is obvious there is no > >>> benefit (e.g. /var/log, /boot). At home, all my desktop, laptop > >>> and > >>> server computers are mainly running on BTRFS with only a few file > >>> systems on ext4. I even installed BTRFS in corporate productive > >>> systems (in those cases, the systems were mainly on ext4; but > >>> there > >>> were some specific file systems those exploited BTRFS features). > >>> > >>> But there is still one question that I can't get over: if you > >>> store > >>> a > >>> database (e.g. MySQL), would you prefer having a BTRFS volume > >>> mounted > >>> with nodatacow, or would you just simply use ext4? > >>> > >> Personally, I'd start with btrfs with autodefrag. It has some > >> > >> degree of I/O overhead, but if the database isn't > >> performance-critical and already near the limits of the hardware, > >> it's unlikely to make much difference. Autodefrag should keep the > >> fragmentation down to a minimum. > > > > I read that autodefrag would only help with small databases. > > I wonder if anyone actually > > a) quantified performance impact > b) analyzed the cause > > I work with NetApp for a long time and I can say from first hand > experience that fragmentation had zero impact on OLTP workload. It > did affect backup performance as was expected, but this could be > fixed by periodic reallocation (defragmentation). > > And even that needed quite some time to observe (years) on pretty high > load database with regular backup and replication snapshots. > > If btrfs is so susceptible to fragmentation, what is the reason for > it? In the end of my original mail I mentioned a blog article that also had some performance graphs. Did you actually read it? Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS and databases
Hugo Mills - 01.08.18, 10:56: > On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote: > > I know it's a decade-old question, but I'd like to hear your > > thoughts > > of today. By now, I became a heavy BTRFS user. Almost everywhere I > > use BTRFS, except in situations when it is obvious there is no > > benefit (e.g. /var/log, /boot). At home, all my desktop, laptop and > > server computers are mainly running on BTRFS with only a few file > > systems on ext4. I even installed BTRFS in corporate productive > > systems (in those cases, the systems were mainly on ext4; but there > > were some specific file systems those exploited BTRFS features). > > > > But there is still one question that I can't get over: if you store > > a > > database (e.g. MySQL), would you prefer having a BTRFS volume > > mounted > > with nodatacow, or would you just simply use ext4? > >Personally, I'd start with btrfs with autodefrag. It has some > degree of I/O overhead, but if the database isn't performance-critical > and already near the limits of the hardware, it's unlikely to make > much difference. Autodefrag should keep the fragmentation down to a > minimum. I read that autodefrag would only help with small databases. I also read that even on SSDs there is a notable performance penalty. 4.2 GiB akonadi database for tons of mails appears to work okayish on dual SSD BTRFS RAID 1 here with LZO compression here. However I have no comparison, for example how it would run on XFS. And its fragmented quite a bit, example for the largest file of 3 GiB – I know this in part is also due to LZO compression. […].local/share/akonadi/db_data/akonadi> time /usr/sbin/filefrag parttable.ibd parttable.ibd: 45380 extents found /usr/sbin/filefrag parttable.ibd 0,00s user 0,86s system 41% cpu 2,054 total However it digs out those extents quite fast. I would not feel comfortable with setting this file to nodatacow. However I wonder: Is this it? Is there nothing that can be improved in BTRFS to handle database and VM files in a better way, without altering any default settings? Is it also an issue on ZFS? ZFS does also copy on write. How does ZFS handle this? Can anything be learned from it? I never head people complain about poor database performance on ZFS, but… I don´t use it and I am not subscribed to any ZFS mailing lists, so they may have similar issues and I just do not know it. Well there seems to be a performance penalty at least when compared to XFS: About ZFS Performance Yves Trudeau, May 15, 2018 https://www.percona.com/blog/2018/05/15/about-zfs-performance/ The article described how you can use NVMe devices as cache to mitigate the performance impact. That would hint that BTRFS with VFS Hot Data Tracking and relocating data to SSD or NVMe devices could be a way to set this up. But as said I read about bad database performance even on SSDs with BTRFS. I do not find the original reference at the moment, but I got this for example, however it is from 2015 (on kernel 4.0 which is a bit old): Friends don't let friends use BTRFS for OLTP 2015/09/16 by Tomas Vondra https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp Interestingly it also compares with ZFS which is doing much better. So maybe there is really something to be learned from ZFS. I did not get clearly whether the benchmark was on an SSD, as Tomas notes the "ssd" mount option, it might have been. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
Nikolay Borisov - 17.07.18, 10:16: > On 17.07.2018 11:02, Martin Steigerwald wrote: > > Nikolay Borisov - 17.07.18, 09:20: > >> On 16.07.2018 23:58, Wolf wrote: > >>> Greetings, > >>> I would like to ask what what is healthy amount of free space to > >>> keep on each device for btrfs to be happy? > >>> > >>> This is how my disk array currently looks like > >>> > >>> [root@dennas ~]# btrfs fi usage /raid > >>> > >>> Overall: > >>> Device size: 29.11TiB > >>> Device allocated: 21.26TiB > >>> Device unallocated:7.85TiB > >>> Device missing: 0.00B > >>> Used: 21.18TiB > >>> Free (estimated): 3.96TiB (min: 3.96TiB) > >>> Data ratio: 2.00 > >>> Metadata ratio: 2.00 > >>> Global reserve: 512.00MiB (used: 0.00B) > > > > […] > > > >>> Btrfs does quite good job of evenly using space on all devices. > >>> No, > >>> how low can I let that go? In other words, with how much space > >>> free/unallocated remaining space should I consider adding new > >>> disk? > >> > >> Btrfs will start running into problems when you run out of > >> unallocated space. So the best advice will be monitor your device > >> unallocated, once it gets really low - like 2-3 gb I will suggest > >> you run balance which will try to free up unallocated space by > >> rewriting data more compactly into sparsely populated block > >> groups. If after running balance you haven't really freed any > >> space then you should consider adding a new drive and running > >> balance to even out the spread of data/metadata. > > > > What are these issues exactly? > > For example if you have plenty of data space but your metadata is full > then you will be getting ENOSPC. Of that one I am aware. This just did not happen so far. I did not yet add it explicitly to the training slides, but I just make myself a note to do that. Anything else? > > I have > > > > % btrfs fi us -T /home > > > > Overall: > > Device size: 340.00GiB > > Device allocated:340.00GiB > > Device unallocated:2.00MiB > > Device missing: 0.00B > > Used:308.37GiB > > Free (estimated): 14.65GiB (min: 14.65GiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) > > > > Data Metadata System > > > > Id Path RAID1 RAID1RAID1Unallocated > > -- -- - --- > > > > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > > > -- -- - --- > > > >Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB > >Used 151.24GiB 2.95GiB 48.00KiB > > You already have only 33% of your metadata full so if your workload > turned out to actually be making more metadata-heavy changed i.e > snapshots you could exhaust this and get ENOSPC, despite having around > 14gb of free data space. Furthermore this data space is spread around > multiple data chunks, depending on how populated they are a balance > could be able to free up unallocated space which later could be > re-purposed for metadata (again, depending on what you are doing). The filesystem above IMO is not fit for snapshots. It would fill up rather quickly, I think even when I balance metadata. Actually I tried this and as I remember it took at most a day until it was full. If I read above figures currently at maximum I could gain one additional GiB by balancing metadata. That would not make a huge difference. I bet I am already running this filesystem beyond recommendation, as I bet many would argue it is to full already for regular usage… I do not see the benefit of squeezing the last free space out of it just to fit in another GiB. So I still do not get the point why it would make sense to balance it at this point in time. Especially as this 1 GiB I could regain is not even needed. And I do not s
Re: Healthy amount of free space?
Hi Nikolay. Nikolay Borisov - 17.07.18, 09:20: > On 16.07.2018 23:58, Wolf wrote: > > Greetings, > > I would like to ask what what is healthy amount of free space to > > keep on each device for btrfs to be happy? > > > > This is how my disk array currently looks like > > > > [root@dennas ~]# btrfs fi usage /raid > > > > Overall: > > Device size: 29.11TiB > > Device allocated: 21.26TiB > > Device unallocated:7.85TiB > > Device missing: 0.00B > > Used: 21.18TiB > > Free (estimated): 3.96TiB (min: 3.96TiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) […] > > Btrfs does quite good job of evenly using space on all devices. No, > > how low can I let that go? In other words, with how much space > > free/unallocated remaining space should I consider adding new disk? > > Btrfs will start running into problems when you run out of unallocated > space. So the best advice will be monitor your device unallocated, > once it gets really low - like 2-3 gb I will suggest you run balance > which will try to free up unallocated space by rewriting data more > compactly into sparsely populated block groups. If after running > balance you haven't really freed any space then you should consider > adding a new drive and running balance to even out the spread of > data/metadata. What are these issues exactly? I have % btrfs fi us -T /home Overall: Device size: 340.00GiB Device allocated:340.00GiB Device unallocated:2.00MiB Device missing: 0.00B Used:308.37GiB Free (estimated): 14.65GiB (min: 14.65GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data Metadata System Id Path RAID1 RAID1RAID1Unallocated -- -- - --- 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB -- -- - --- Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB Used 151.24GiB 2.95GiB 48.00KiB on a RAID-1 filesystem one, part of the time two Plasma desktops + KDEPIM and Akonadi + Baloo desktop search + you name it write to like mad. Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS crawled to an halt on searching for free blocks, and I had to switch off the laptop uncleanly. If that happened, a balance helped for a while. But since 4.5 or 4.6 this did not happen anymore. I found with SLES 12 SP 3 or so there is btrfsmaintenance running a balance weekly. Which created an issue on our Proxmox + Ceph on Intel NUC based opensource demo lab. This is for sure no recommended configuration for Ceph and Ceph is quite slow on these 2,5 inch harddisks and 1 GBit network link, despite albeit somewhat minimal, limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled to a halt and the kernel gave task hung for more than 120 seconds messages. The VM was basically unusable during the balance. Sure that should not happen with a "proper" setup, also it also did not happen without the automatic balance. Also what would happen on a hypervisor setup with several thousands of VMs with BTRFS, when several 100 of them decide to start the balance at a similar time? It could probably bring the I/O system below to an halt, as many enterprise storage systems are designed to sustain burst I/O loads, but not maximum utilization during an extended period of time. I am really wondering what to recommend in my Linux performance tuning and analysis courses. On my own laptop I do not do regular balances so far. Due to my thinking: If it is not broken, do not fix it. My personal opinion here also is: If the filesystem degrades that much that it becomes unusable without regular maintenance from user space, the filesystem needs to be fixed. Ideally I would not have to worry on whether to regularly balance an BTRFS or not. In other words: I should not have to visit a performance analysis and tuning course in order to use a computer with BTRFS filesystem. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 10.07.2018 09:04 Pete wrote: > I've just had the error in the subject which caused the file system to > go read-only. > > Further part of error message: > WARNING: CPU: 14 PID: 1351 at fs/btrfs/extent-tree.c:3076 > btrfs_run_delayed_refs*0x163/0x190 > > 'Screenshot' here: > https://drive.google.com/file/d/1qw7TE1bec8BKcmffrOmg2LS15IOq8Jwc/view?usp=sharing > > The kernel is 4.17.4. There are three hard drives in the file system. > dmcrypt (luks) is used between btrfs and the disks. This is probably a known issue. See https://www.spinics.net/lists/linux-btrfs/msg75647.html You could apply the patch in this thread and mount with enospc_debug to confirm it is the same issue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] btrfs: Check each block group has corresponding chunk at mount time
Nikolay Borisov - 03.07.18, 11:08: > On 3.07.2018 11:47, Qu Wenruo wrote: > > On 2018年07月03日 16:33, Nikolay Borisov wrote: > >> On 3.07.2018 11:08, Qu Wenruo wrote: > >>> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199837, if > >>> a > >>> crafted btrfs with incorrect chunk<->block group mapping, it could > >>> leads to a lot of unexpected behavior. > >>> > >>> Although the crafted image can be catched by block group item > >>> checker > >>> added in "[PATCH] btrfs: tree-checker: Verify block_group_item", > >>> if one crafted a valid enough block group item which can pass > >>> above check but still mismatch with existing chunk, it could > >>> cause a lot of undefined behavior. > >>> > >>> This patch will add extra block group -> chunk mapping check, to > >>> ensure we have a completely matching (start, len, flags) chunk > >>> for each block group at mount time. > >>> > >>> Reported-by: Xu Wen > >>> Signed-off-by: Qu Wenruo > >>> --- > >>> changelog: > >>> > >>> v2: > >>> Add better error message for each mismatch case. > >>> Rename function name, to co-operate with later patch. > >>> Add flags mismatch check. > >>> > >>> --- > >> > >> It's getting really hard to keep track of the various validation > >> patches you sent with multiple versions + new checks. Please batch > >> everything in a topic series i.e "Making checks stricter" or some > >> such and send everything again nicely packed, otherwise the risk > >> of mis-merging is increased. > > > > Indeed, I'll send the branch and push it to github. > > > >> I now see that Gu Jinxiang from fujitsu also started sending > >> validation fixes. > > > > No need to worry, that will be the only patch related to that thread > > of bugzilla from Fujitsu. > > As all the other cases can be addressed by my patches, sorry Fujitsu > > guys :)> > >> Also for evry patch which fixes a specific issue from one of the > >> reported on bugzilla.kernel.org just use the Link: tag to point to > >> the original report on bugzilla that will make it easier to relate > >> the fixes to the original report. > > > > Never heard of "Link:" tag. > > Maybe it's a good idea to added it to "submitting-patches.rst"? > > I guess it's not officially documented but if you do git log --grep > "Link:" you'd see quite a lot of patches actually have a Link pointing > to the original thread if it has sparked some pertinent discussion. > In this case those patches are a direct result of a bugzilla > bugreport so having a Link: tag makes sense. For Bugzilla reports I saw something like Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=43511 in a patch I was Cc´d to. Of course that does only apply if the patch in question fixes the reported bug. > In the example of the qgroup patch I sent yesterday resulting from > Misono's report there was also an involved discussion hence I added a > link to the original thread. […] -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "decompress failed" in 1-2 files always causes kernel oops, check/scrub pass
Hey James. james harvey - 12.05.18, 07:08: > 100% reproducible, booting from disk, or even Arch installation ISO. > Kernel 4.16.7. btrfs-progs v4.16. > > Reading one of two journalctl files causes a kernel oops. Initially > ran into it from "journalctl --list-boots", but cat'ing the file does > it too. I believe this shows there's compressed data that is invalid, > but its btrfs checksum is invalid. I've cat'ed every file on the > disk, and luckily have the problems narrowed down to only these 2 > files in /var/log/journal. > > This volume has always been mounted with lzo compression. > > scrub has never found anything, and have ran it since the oops. > > Found a user a few years ago who also ran into this, without > resolution, at: > https://www.spinics.net/lists/linux-btrfs/msg52218.html > > 1. Cat'ing a (non-essential) file shouldn't be able to bring down the > system. > > 2. If this is infact invalid compressed data, there should be a way to > check for that. Btrfs check and scrub pass. I think systemd-journald sets those files to nocow on BTRFS in order to reduce fragmentation: That means no checksums, no snapshots, no nothing. I just removed /var/log/journal and thus disabled journalling to disk. Its sufficient for me to have the recent state in /run/journal. Can you confirm nocow being set via lsattr on those files? Still they should be decompressible just fine. > Hardware is fine. Passes memtest86+ in SMP mode. Works fine on all > other files. > > > > [ 381.869940] BUG: unable to handle kernel paging request at > 00390e50 [ 381.870881] BTRFS: decompress failed […] -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs cont. [Was: Re: metadata_ratio mount option?]
Hello Chris, Dne 7.5.2018 v 18:37 Chris Mason napsal(a): > > > On 7 May 2018, at 12:16, Martin Svec wrote: > >> Hello Chris, >> >> Dne 7.5.2018 v 16:49 Chris Mason napsal(a): >>> On 7 May 2018, at 7:40, Martin Svec wrote: >>> >>>> Hi, >>>> >>>> According to man btrfs [1], I assume that metadata_ratio=1 mount option >>>> should >>>> force allocation of one metadata chunk after every allocated data chunk. >>>> However, >>>> when I set this option and start filling btrfs with "dd if=/dev/zero >>>> of=dummyfile.dat", >>>> only data chunks are allocated but no metadata ones. So, how does the >>>> metadata_ratio >>>> option really work? >>>> >>>> Note that I'm trying to use this option as a workaround of the bug >>>> reported here: >>>> >>> >>> [ urls that FB email server eats, sorry ] >> >> It's link to "Btrfs remounted read-only due to ENOSPC in >> btrfs_run_delayed_refs" thread :) > > Oh yeah, the link worked fine, it just goes through this url defense monster > that munges it in > replies. > >> >>> >>>> >>>> i.e. I want to manually preallocate metadata chunks to avoid nightly >>>> ENOSPC errors. >>> >>> >>> metadata_ratio is almost but not quite what you want. It sets a flag on >>> the space_info to force a >>> chunk allocation the next time we decide to call should_alloc_chunk(). >>> Thanks to the overcommit >>> code, we usually don't call that until the metadata we think we're going to >>> need is bigger than >>> the metadata space available. In other words, by the time we're into the >>> code that honors the >>> force flag, reservations are already high enough to make us allocate the >>> chunk anyway. >> >> Yeah, that's how I understood the code. So I think metadata_ratio man >> section is quite confusing >> because it implies that btrfs guarantees given metadata to data chunk space >> ratio, which isn't true. >> >>> >>> I tried to use metadata_ratio to experiment with forcing more metadata slop >>> space, but really I >>> have to tweak the overcommit code first. >>> Omar beat me to a better solution, tracking down our transient ENOSPC >>> problems here at FB to >>> reservations done for orphans. Do you have a lot of deleted files still >>> being held open? lsof >>> /mntpoint | grep deleted will list them. >> >> I'll take a look during backup window. The initial bug report describes our >> rsync workload in >> detail, for your reference. No, there're no trailing deleted files during backup. However, I noticed something interesting in strace output: rsync does ftruncate() of every transferred file before closing it. In 99.9% cases the file is truncated to its own size, so it should be a no-op. But these ftruncates are by far the slowest syscalls according to strace timing and btrfs_truncate() comments itself as "indeed ugly". Could it be the root cause of global reservations pressure? I've found this patch from Filipe (Cc'd): https://patchwork.kernel.org/patch/10205013/. Should I apply it to our 4.14.y kernel and try the impact on intensive rsync workloads? Thank you Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: metadata_ratio mount option?
Hello Chris, Dne 7.5.2018 v 16:49 Chris Mason napsal(a): > On 7 May 2018, at 7:40, Martin Svec wrote: > >> Hi, >> >> According to man btrfs [1], I assume that metadata_ratio=1 mount option >> should >> force allocation of one metadata chunk after every allocated data chunk. >> However, >> when I set this option and start filling btrfs with "dd if=/dev/zero >> of=dummyfile.dat", >> only data chunks are allocated but no metadata ones. So, how does the >> metadata_ratio >> option really work? >> >> Note that I'm trying to use this option as a workaround of the bug reported >> here: >> > > [ urls that FB email server eats, sorry ] It's link to "Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs" thread :) > >> >> i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC >> errors. > > > metadata_ratio is almost but not quite what you want. It sets a flag on the > space_info to force a > chunk allocation the next time we decide to call should_alloc_chunk(). > Thanks to the overcommit > code, we usually don't call that until the metadata we think we're going to > need is bigger than > the metadata space available. In other words, by the time we're into the > code that honors the > force flag, reservations are already high enough to make us allocate the > chunk anyway. Yeah, that's how I understood the code. So I think metadata_ratio man section is quite confusing because it implies that btrfs guarantees given metadata to data chunk space ratio, which isn't true. > > I tried to use metadata_ratio to experiment with forcing more metadata slop > space, but really I > have to tweak the overcommit code first. > Omar beat me to a better solution, tracking down our transient ENOSPC > problems here at FB to > reservations done for orphans. Do you have a lot of deleted files still > being held open? lsof > /mntpoint | grep deleted will list them. I'll take a look during backup window. The initial bug report describes our rsync workload in detail, for your reference. > > We're working through a patch for the orphans here. You've got a ton of > bytes pinned, which isn't > a great match for the symptoms we see: > > [285169.096630] BTRFS info (device sdb): space_info 4 has > 18446744072120172544 free, is not full > [285169.096633] BTRFS info (device sdb): space_info total=273804165120, > used=269218267136, > pinned=3459629056, reserved=52396032, may_use=2663120896, readonly=131072 > > But, your may_use count is high enough that you might be hitting this > problem. Otherwise I'll > work out a patch to make some more metadata chunks while Josef is perfecting > his great delayed ref > update. As mentioned in the bug report, we have a custom patch that dedicates SSDs for metadata chunks and HDDs for data chunks. So, all we need is to preallocate metadata chunks to occupy all of the SSD space and our issues will be gone. Note that btrfs with SSD-backed metadata works absolutely great for rsync backups, even if there're billions of files and thousands of snapshots. The global reservation ENOSPC is the last issue we're struggling with. Thank you Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
metadata_ratio mount option?
Hi, According to man btrfs [1], I assume that metadata_ratio=1 mount option should force allocation of one metadata chunk after every allocated data chunk. However, when I set this option and start filling btrfs with "dd if=/dev/zero of=dummyfile.dat", only data chunks are allocated but no metadata ones. So, how does the metadata_ratio option really work? Note that I'm trying to use this option as a workaround of the bug reported here: https://www.spinics.net/lists/linux-btrfs/msg75104.html i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC errors. Best regards. Martin [1] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5)#MOUNT_OPTIONS -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: extent-tree.c no space left (4.9.77 + 4.16.2)
Hi David, this looks like the bug that I already reported two times: https://www.spinics.net/lists/linux-btrfs/msg54394.html https://www.spinics.net/lists/linux-btrfs/msg75104.html The second thread contains Nikolay's debug patch that can confirm if you run out of global metadata reservations too. Martin Dne 21.4.2018 v 9:38 David Goodwin napsal(a): > Hi, > > I'm running a 3TiB EBS based (2+1TiB devices) volume in EC2 which contains > about 500 read-only > snapshots. > > btrfs-progs v4.7.3 > > There are two dmesg trace things below. The first one from a 4.9.77 kernel - > > [ cut here ] > BTRFS: error (device xvdg) in btrfs_run_delayed_refs:2967: errno=-28 No space > left > BTRFS info (device xvdg): forced readonlyApr 19 11:44:40 gateway1 kernel: > [7648104.300115] > WARNING: CPU: 2 PID: 963 at fs/btrfs/extent-tree.c:2967 > btrfs_run_delayed_refs+0x27e/0x2b0 > [btrfs]Apr 19 11:44:40 gateway1 kernel: [7648104.313268] BTRFS: Transaction > aborted (error -28) > Modules linked in: dm_mod nfsv3 ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE > nf_nat_masquerade_ipv4 > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp > nf_conntrack_ftp nf_nat > nf_conntrack xt_mu > nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc evdev > intel_rapl > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcsp snd_pcm > aesni_intel aes_x86_64 lrw > gf128mul glue_helper snd_timer ablk_helper snd cryptd soundcore ext4 crc16 > jbd2 mbcache btrfs xor > raid6_pq xen_netfront xen_blkfront crc32c_intel > CPU: 2 PID: 963 Comm: btrfs-transacti Not tainted 4.9.77-dg1 #1Apr 19 > 11:44:40 gateway1 kernel: > [7648104.408561] 812f17a4 c90043203d08 > > 8107389e a0157d5a c90043203d58 8802ccfd7170 > 880394684800 880394684800 0007315c 8107390f > Call Trace: > [] ? dump_stack+0x5c/0x78 > [] ? __warn+0xbe/0xe0 > [] ? warn_slowpath_fmt+0x4f/0x60 > [] ? btrfs_run_delayed_refs+0x27e/0x2b0 [btrfs] > [] ? btrfs_release_path+0x13/0x80 [btrfs] > [] ? btrfs_start_dirty_block_groups+0x2c2/0x450 [btrfs] > [] ? btrfs_commit_transaction+0x14c/0xa30 [btrfs] > [] ? start_transaction+0x96/0x480 [btrfs] > [] ? transaction_kthread+0x1dc/0x200 [btrfs] > [] ? btrfs_cleanup_transaction+0x550/0x550 [btrfs] > [] ? kthread+0xc7/0xe0 > [] ? kthread_park+0x60/0x60 > [] ? ret_from_fork+0x54/0x60 > ---[ end trace 69ca1332d91b4310 ]--- > BTRFS: error (device xvdg) in btrfs_run_delayed_refs:2967: errno=-28 No space > left > BTRFS error (device xvdg): parent transid verify failed on 5400398217216 > wanted 1893543 found 1893366 > > > > On checking btrfs fi us there was plenty of unallocated space left. > > % btrfs fi us /broken/ > > Overall: > Device size: 3.06TiB > Device allocated: 2.43TiB > Device unallocated: 643.09GiB > Device missing: 0.00B > Used: 2.43TiB > Free (estimated): 646.41GiB (min: 646.41GiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > > > > The VM was then rebooted with a 4.16.2 kernel, which encountered what I > assume is the same problem: > > > [ cut here ] > BTRFS: Transaction aborted (error -28) > WARNING: CPU: 2 PID: 981 at fs/btrfs/extent-tree.c:6990 > __btrfs_free_extent.isra.63+0x3d2/0xd20 > [btrfs] > Modules linked in: nfsv3 ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE > nf_nat_masquerade_ipv4 > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp > nf_conntrack_ftp nf_nat > nf_conntrack libcrc32c crc32c_generic xt_multiport iptable_filter ip_tables > x_tables autofs4 nfsd > auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc intel_rapl > crct10dif_pclmul crc32_pclmul > ghash_clmulni_intel evdev pcbc snd_pcsp aesni_intel snd_pcm aes_x86_64 > snd_timer crypto_simd > glue_helper snd cryptd soundcore ext4 crc16 mbcache jbd2 btrfs xor > zstd_decompress zstd_compress > xxhash raid6_pq xen_netfront xen_blkfront crc32c_intel > CPU: 2 PID: 981 Comm: btrfs-transacti Not tainted 4.16.2-dg1 #1 > RIP: e030:__btrfs_free_extent.isra.63+0x3d2/0xd20 [btrfs] > RSP: e02b:c900428d7c68 EFLAGS: 00010292 > RAX: 0026 RBX: 01fb8031c000 RCX: 0006 > RDX: 0007 RSI: 0001 RDI: 88039a916650 > RBP: ffe4 R08: 0001 R09: 010a > R10: 0001 R11: 010a R12: 8803957e6000 > R13: 88036f5a9e70 R14: R15:
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Dne 10.3.2018 v 15:51 Martin Svec napsal(a): > Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a): >> >> >>>>> And then report back on the output of the extra debug >>>>> statements. >>>>> >>>>> Your global rsv is essentially unused, this means >>>>> in the worst case the code should fallback to using the global rsv >>>>> for satisfying the memory allocation for delayed refs. So we should >>>>> figure out why this isn't' happening. >>>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as >>>> we hit ENOSPC again. >>> There is the output: >>> >>> [24672.573075] BTRFS info (device sdb): space_info 4 has >>> 18446744072971649024 free, is not full >>> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, >>> used=304593289216, pinned=2321940480, reserved=174800896, >>> may_use=1811644416, readonly=131072 >>> [24672.573079] use_block_rsv: Not using global blockrsv! Current >>> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 >>> global_rsv->space_info = 999a57db7000 >>> [24672.573083] BTRFS: Transaction aborted (error -28) >> Bummer, so you are indeed running out of global space reservations in >> context which can't really use any other reservation type, thus the >> ENOSPC. Was the stacktrace again during processing of running delayed refs? > Yes, the stacktrace is below. > > [24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 > btrfs_run_delayed_refs+0x259/0x270 [btrfs] > [24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp > iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw > ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 > nf_nat nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle > intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul > ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd > glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev > iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich > mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac > ipmi_si ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq > ip_tables x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor > zstd_decompress zstd_compress > [24672.573161] xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel > psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2 > [24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I > 4.14.23-znr8+ #73 > [24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 > 02/01/2011 > [24672.573172] task: 999a23229140 task.stack: a85642094000 > [24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs] > [24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282 > [24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: > 0006 > [24672.573189] RDX: RSI: 0082 RDI: > 999a6fcd66f0 > [24672.573190] RBP: 95c24d68 R08: 0001 R09: > 0479 > [24672.573190] R10: 99974b1960e0 R11: 0479 R12: > 999a5a65 > [24672.573191] R13: 999a5a6511f0 R14: R15: > > [24672.573192] FS: () GS:999a6fcc() > knlGS: > [24672.573193] CS: 0010 DS: ES: CR0: 80050033 > [24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: > 000206e0 > [24672.573195] Call Trace: > [24672.573215] btrfs_commit_transaction+0x3e1/0x950 [btrfs] > [24672.573231] ? start_transaction+0x89/0x410 [btrfs] > [24672.573246] transaction_kthread+0x195/0x1b0 [btrfs] > [24672.573249] kthread+0xfc/0x130 > [24672.573265] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs] > [24672.573266] ? kthread_create_on_node+0x70/0x70 > [24672.573269] ret_from_fork+0x35/0x40 > [24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 > 8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> > 0b 44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 > [24672.573292] ---[ end trace b17d927a946cb02e ]--- > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Again, another ENOSPC due to lack of global rsv space in the context of d
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a): > > > And then report back on the output of the extra debug statements. Your global rsv is essentially unused, this means in the worst case the code should fallback to using the global rsv for satisfying the memory allocation for delayed refs. So we should figure out why this isn't' happening. >>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as >>> we hit ENOSPC again. >> There is the output: >> >> [24672.573075] BTRFS info (device sdb): space_info 4 has >> 18446744072971649024 free, is not full >> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, >> used=304593289216, pinned=2321940480, reserved=174800896, >> may_use=1811644416, readonly=131072 >> [24672.573079] use_block_rsv: Not using global blockrsv! Current >> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 >> global_rsv->space_info = 999a57db7000 >> [24672.573083] BTRFS: Transaction aborted (error -28) > Bummer, so you are indeed running out of global space reservations in > context which can't really use any other reservation type, thus the > ENOSPC. Was the stacktrace again during processing of running delayed refs? Yes, the stacktrace is below. [24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 btrfs_run_delayed_refs+0x259/0x270 [btrfs] [24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac ipmi_si ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq ip_tables x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor zstd_decompress zstd_compress [24672.573161] xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2 [24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I 4.14.23-znr8+ #73 [24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 02/01/2011 [24672.573172] task: 999a23229140 task.stack: a85642094000 [24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs] [24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282 [24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: 0006 [24672.573189] RDX: RSI: 0082 RDI: 999a6fcd66f0 [24672.573190] RBP: 95c24d68 R08: 0001 R09: 0479 [24672.573190] R10: 99974b1960e0 R11: 0479 R12: 999a5a65 [24672.573191] R13: 999a5a6511f0 R14: R15: [24672.573192] FS: () GS:999a6fcc() knlGS: [24672.573193] CS: 0010 DS: ES: CR0: 80050033 [24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: 000206e0 [24672.573195] Call Trace: [24672.573215] btrfs_commit_transaction+0x3e1/0x950 [btrfs] [24672.573231] ? start_transaction+0x89/0x410 [btrfs] [24672.573246] transaction_kthread+0x195/0x1b0 [btrfs] [24672.573249] kthread+0xfc/0x130 [24672.573265] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs] [24672.573266] ? kthread_create_on_node+0x70/0x70 [24672.573269] ret_from_fork+0x35/0x40 [24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> 0b 44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 [24672.573292] ---[ end trace b17d927a946cb02e ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Dne 9.3.2018 v 20:03 Martin Svec napsal(a): > Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a): >> On 23.02.2018 16:28, Martin Svec wrote: >>> Hello, >>> >>> we have a btrfs-based backup system using btrfs snapshots and rsync. >>> Sometimes, >>> we hit ENOSPC bug and the filesystem is remounted read-only. However, >>> there's >>> still plenty of unallocated space according to "btrfs fi usage". So I think >>> this >>> isn't another edge condition when btrfs runs out of space due to fragmented >>> chunks, >>> but a bug in disk space allocation code. It suffices to umount the >>> filesystem and >>> remount it back and it works fine again. The frequency of ENOSPC seems to be >>> dependent on metadata chunks usage. When there's a lot of free space in >>> existing >>> metadata chunks, the bug doesn't happen for months. If most metadata chunks >>> are >>> above ~98%, we hit the bug every few days. Below are details regarding the >>> backup >>> server and btrfs. >>> >>> The backup works as follows: >>> >>> * Every night, we create a btrfs snapshot on the backup server and rsync >>> data >>> from a production server into it. This snapshot is then marked >>> read-only and >>> will be used as a base subvolume for the next backup snapshot. >>> * Every day, expired snapshots are removed and their space is freed. >>> Cleanup >>> is scheduled in such a way that it doesn't interfere with the backup >>> window. >>> * Multiple production servers are backed up in parallel to one backup >>> server. >>> * The backed up servers are mostly webhosting servers and mail servers, >>> i.e. >>> hundreds of billions of small files. (Yes, we push btrfs to the limits >>> :-)) >>> * Backup server contains ~1080 snapshots, Zlib compression is enabled. >>> * Rsync is configured to use whole file copying. >>> >>> System configuration: >>> >>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch >>> (see below) and >>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in >>> metadata_reserve_bytes) >>> >>> btrfs mount options: >>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15 >>> >>> $ btrfs fi df /backup: >>> >>> Data, single: total=28.05TiB, used=26.37TiB >>> System, single: total=32.00MiB, used=3.53MiB >>> Metadata, single: total=255.00GiB, used=250.73GiB >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> >>> $ btrfs fi show /backup: >>> >>> Label: none uuid: a52501a9-651c-4712-a76b-7b4238cfff63 >>> Total devices 2 FS bytes used 26.62TiB >>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb >>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc >>> >>> $ btrfs fi usage /backup: >>> >>> Overall: >>> Device size: 36.79TiB >>> Device allocated: 28.30TiB >>> Device unallocated:8.49TiB >>> Device missing: 0.00B >>> Used: 26.62TiB >>> Free (estimated): 10.17TiB (min: 10.17TiB) >>> Data ratio: 1.00 >>> Metadata ratio: 1.00 >>> Global reserve: 512.00MiB (used: 0.00B) >>> >>> Data,single: Size:28.05TiB, Used:26.37TiB >>>/dev/sdc 28.05TiB >>> >>> Metadata,single: Size:255.00GiB, Used:250.73GiB >>>/dev/sdb 255.00GiB >>> >>> System,single: Size:32.00MiB, Used:3.53MiB >>>/dev/sdb 32.00MiB >>> >>> Unallocated: >>>/dev/sdb 161.59GiB >>>/dev/sdc8.33TiB >>> >>> Btrfs filesystem uses two logical drives in single mode, backed by >>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting >>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume. >>> >>> Please note that we have a simple custom patch in btrfs which ensures >>> that metadata chunks are allocated preferably on SSD volume and data >>> chunks are allocated only on SATA volume. The patch slightly modifies >>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating >>> dev
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a): > > On 23.02.2018 16:28, Martin Svec wrote: >> Hello, >> >> we have a btrfs-based backup system using btrfs snapshots and rsync. >> Sometimes, >> we hit ENOSPC bug and the filesystem is remounted read-only. However, >> there's >> still plenty of unallocated space according to "btrfs fi usage". So I think >> this >> isn't another edge condition when btrfs runs out of space due to fragmented >> chunks, >> but a bug in disk space allocation code. It suffices to umount the >> filesystem and >> remount it back and it works fine again. The frequency of ENOSPC seems to be >> dependent on metadata chunks usage. When there's a lot of free space in >> existing >> metadata chunks, the bug doesn't happen for months. If most metadata chunks >> are >> above ~98%, we hit the bug every few days. Below are details regarding the >> backup >> server and btrfs. >> >> The backup works as follows: >> >> * Every night, we create a btrfs snapshot on the backup server and rsync >> data >> from a production server into it. This snapshot is then marked read-only >> and >> will be used as a base subvolume for the next backup snapshot. >> * Every day, expired snapshots are removed and their space is freed. >> Cleanup >> is scheduled in such a way that it doesn't interfere with the backup >> window. >> * Multiple production servers are backed up in parallel to one backup >> server. >> * The backed up servers are mostly webhosting servers and mail servers, >> i.e. >> hundreds of billions of small files. (Yes, we push btrfs to the limits >> :-)) >> * Backup server contains ~1080 snapshots, Zlib compression is enabled. >> * Rsync is configured to use whole file copying. >> >> System configuration: >> >> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch >> (see below) and >> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in >> metadata_reserve_bytes) >> >> btrfs mount options: >> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15 >> >> $ btrfs fi df /backup: >> >> Data, single: total=28.05TiB, used=26.37TiB >> System, single: total=32.00MiB, used=3.53MiB >> Metadata, single: total=255.00GiB, used=250.73GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> $ btrfs fi show /backup: >> >> Label: none uuid: a52501a9-651c-4712-a76b-7b4238cfff63 >> Total devices 2 FS bytes used 26.62TiB >> devid1 size 416.62GiB used 255.03GiB path /dev/sdb >> devid2 size 36.38TiB used 28.05TiB path /dev/sdc >> >> $ btrfs fi usage /backup: >> >> Overall: >> Device size: 36.79TiB >> Device allocated: 28.30TiB >> Device unallocated:8.49TiB >> Device missing: 0.00B >> Used: 26.62TiB >> Free (estimated): 10.17TiB (min: 10.17TiB) >> Data ratio: 1.00 >> Metadata ratio: 1.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> Data,single: Size:28.05TiB, Used:26.37TiB >>/dev/sdc 28.05TiB >> >> Metadata,single: Size:255.00GiB, Used:250.73GiB >>/dev/sdb 255.00GiB >> >> System,single: Size:32.00MiB, Used:3.53MiB >>/dev/sdb 32.00MiB >> >> Unallocated: >>/dev/sdb 161.59GiB >>/dev/sdc8.33TiB >> >> Btrfs filesystem uses two logical drives in single mode, backed by >> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting >> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume. >> >> Please note that we have a simple custom patch in btrfs which ensures >> that metadata chunks are allocated preferably on SSD volume and data >> chunks are allocated only on SATA volume. The patch slightly modifies >> __btrfs_alloc_chunk() so that its loop over devices ignores rotating >> devices when a metadata chunk is requested and vice versa. However, >> I'm quite sure that this patch doesn't cause the reported bug because >> we log every call of the modified code and there're no __btrfs_alloc_chunk() >> calls when ENOSPC is triggered. Moreover, we observed the same bug before >> we developed the patch. (IIRC, Chris Mason mentioned that they work on >> a similar feature in faceb
Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
Nobody knows? I'm particularly interested why debug space_info 4 shows negative (unsigned 18446744072120172544) value as free metadata space, please see the original report. Is it a bug in dump_space_info(), or metadata reservations can temporarily exceed the total space, or is it an indication of a damaged filesystem? Also note that rebuilding free space cache doesn't help. Thank you. Martin Dne 23.2.2018 v 15:28 Martin Svec napsal(a): > Hello, > > we have a btrfs-based backup system using btrfs snapshots and rsync. > Sometimes, > we hit ENOSPC bug and the filesystem is remounted read-only. However, there's > still plenty of unallocated space according to "btrfs fi usage". So I think > this > isn't another edge condition when btrfs runs out of space due to fragmented > chunks, > but a bug in disk space allocation code. It suffices to umount the filesystem > and > remount it back and it works fine again. The frequency of ENOSPC seems to be > dependent on metadata chunks usage. When there's a lot of free space in > existing > metadata chunks, the bug doesn't happen for months. If most metadata chunks > are > above ~98%, we hit the bug every few days. Below are details regarding the > backup > server and btrfs. > > The backup works as follows: > > * Every night, we create a btrfs snapshot on the backup server and rsync > data > from a production server into it. This snapshot is then marked read-only > and > will be used as a base subvolume for the next backup snapshot. > * Every day, expired snapshots are removed and their space is freed. Cleanup > is scheduled in such a way that it doesn't interfere with the backup > window. > * Multiple production servers are backed up in parallel to one backup > server. > * The backed up servers are mostly webhosting servers and mail servers, i.e. > hundreds of billions of small files. (Yes, we push btrfs to the limits > :-)) > * Backup server contains ~1080 snapshots, Zlib compression is enabled. > * Rsync is configured to use whole file copying. > > System configuration: > > Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch > (see below) and > Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in > metadata_reserve_bytes) > > btrfs mount options: > noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15 > > $ btrfs fi df /backup: > > Data, single: total=28.05TiB, used=26.37TiB > System, single: total=32.00MiB, used=3.53MiB > Metadata, single: total=255.00GiB, used=250.73GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > $ btrfs fi show /backup: > > Label: none uuid: a52501a9-651c-4712-a76b-7b4238cfff63 > Total devices 2 FS bytes used 26.62TiB > devid1 size 416.62GiB used 255.03GiB path /dev/sdb > devid2 size 36.38TiB used 28.05TiB path /dev/sdc > > $ btrfs fi usage /backup: > > Overall: > Device size: 36.79TiB > Device allocated: 28.30TiB > Device unallocated:8.49TiB > Device missing: 0.00B > Used: 26.62TiB > Free (estimated): 10.17TiB (min: 10.17TiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:28.05TiB, Used:26.37TiB >/dev/sdc 28.05TiB > > Metadata,single: Size:255.00GiB, Used:250.73GiB >/dev/sdb 255.00GiB > > System,single: Size:32.00MiB, Used:3.53MiB >/dev/sdb 32.00MiB > > Unallocated: >/dev/sdb 161.59GiB >/dev/sdc8.33TiB > > Btrfs filesystem uses two logical drives in single mode, backed by > hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting > of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume. > > Please note that we have a simple custom patch in btrfs which ensures > that metadata chunks are allocated preferably on SSD volume and data > chunks are allocated only on SATA volume. The patch slightly modifies > __btrfs_alloc_chunk() so that its loop over devices ignores rotating > devices when a metadata chunk is requested and vice versa. However, > I'm quite sure that this patch doesn't cause the reported bug because > we log every call of the modified code and there're no __btrfs_alloc_chunk() > calls when ENOSPC is triggered. Moreover, we observed the same bug before > we developed the patch. (IIRC, Chris Mason mentioned that they work on > a similar feature in facebook, but I've found no official patches yet.) > > Dmesg dump: > >
Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs
[285167.750879] RSP: 0018:ba48c1ecf958 EFLAGS: 00010282 [285167.750880] RAX: 001d RBX: 9c4a1c2ce128 RCX: 0006 [285167.750881] RDX: RSI: 0096 RDI: 9c4a2fd566f0 [285167.750882] RBP: 4000 R08: 0001 R09: 03dc [285167.750883] R10: 0001 R11: 03dc R12: 9c4a1c2ce000 [285167.750883] R13: 9c4a17692800 R14: 0001 R15: ffe4 [285167.750885] FS: () GS:9c4a2fd4() knlGS: [285167.750885] CS: 0010 DS: ES: CR0: 80050033 [285167.750886] CR2: 56250e55bfd0 CR3: 0ee0a003 CR4: 000206e0 [285167.750887] Call Trace: [285167.750903] __btrfs_cow_block+0x125/0x5c0 [btrfs] [285167.750917] btrfs_cow_block+0xcb/0x1b0 [btrfs] [285167.750929] btrfs_search_slot+0x1fd/0x9e0 [btrfs] [285167.750943] lookup_inline_extent_backref+0x105/0x610 [btrfs] [285167.750961] ? set_extent_bit+0x19/0x20 [btrfs] [285167.750974] __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs] [285167.750992] ? btrfs_merge_delayed_refs+0x63/0x560 [btrfs] [285167.751006] __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs] [285167.751021] btrfs_run_delayed_refs+0x7a/0x270 [btrfs] [285167.751037] btrfs_commit_transaction+0x3e1/0x950 [btrfs] [285167.751053] ? start_transaction+0x89/0x410 [btrfs] [285167.751068] transaction_kthread+0x195/0x1b0 [btrfs] [285167.751071] kthread+0xfc/0x130 [285167.751087] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs] [285167.751088] ? kthread_create_on_node+0x70/0x70 [285167.751091] ret_from_fork+0x35/0x40 [285167.751092] Code: ff 48 c7 c6 28 d7 44 c0 48 c7 c7 a0 21 4a c0 e8 3c a5 4b cb 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 c0 4c 45 c0 e8 80 fd f1 ca <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb [285167.751114] ---[ end trace 8721883b5af677ec ]--- [285169.096630] BTRFS info (device sdb): space_info 4 has 18446744072120172544 free, is not full [285169.096633] BTRFS info (device sdb): space_info total=273804165120, used=269218267136, pinned=3459629056, reserved=52396032, may_use=2663120896, readonly=131072 [285169.096638] BTRFS: Transaction aborted (error -28) [285169.096664] [ cut here ] [285169.096691] WARNING: CPU: 7 PID: 443 at fs/btrfs/extent-tree.c:3089 btrfs_run_delayed_refs+0x259/0x270 [btrfs] [285169.096692] Modules linked in: binfmt_misc xt_comment xt_tcpudp iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntr [285169.096722] zstd_compress xxhash raid6_pq sd_mod crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2 [285169.096729] CPU: 7 PID: 443 Comm: btrfs-transacti Tainted: GW I 4.14.20-znr1+ #69 [285169.096730] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 02/01/2011 [285169.096731] task: 9c4a1740e280 task.stack: ba48c1ecc000 [285169.096745] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs] [285169.096746] RSP: 0018:ba48c1ecfde0 EFLAGS: 00010282 [285169.096747] RAX: 0026 RBX: 9c47990c0780 RCX: 0006 [285169.096748] RDX: RSI: 0082 RDI: 9c4a2fdd66f0 [285169.096749] RBP: 9c493d509b68 R08: 0001 R09: 0403 [285169.096749] R10: 9c49731d6620 R11: 0403 R12: 9c4a1c2ce000 [285169.096750] R13: 9c4a1c2cf1f0 R14: R15: [285169.096751] FS: () GS:9c4a2fdc() knlGS: [285169.096752] CS: 0010 DS: ES: CR0: 80050033 [285169.096753] CR2: 55e70555bfe0 CR3: 0ee0a005 CR4: 000206e0 [285169.096754] Call Trace: [285169.096774] btrfs_commit_transaction+0x3e1/0x950 [btrfs] [285169.096790] ? start_transaction+0x89/0x410 [btrfs] [285169.096806] transaction_kthread+0x195/0x1b0 [btrfs] [285169.096809] kthread+0xfc/0x130 [285169.096825] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs] [285169.096826] ? kthread_create_on_node+0x70/0x70 [285169.096828] ret_from_fork+0x35/0x40 [285169.096830] Code: c7 c6 20 d8 44 c0 48 89 df 44 89 04 24 e8 19 bb 09 00 44 8b 04 24 eb 86 44 89 c6 48 c7 c7 30 48 45 c0 44 89 04 24 e8 d2 40 f2 ca <0f> ff 44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 [285169.096852] ---[ end trace 8721883b5af677ed ]--- [285169.096918] BTRFS: error (device sdb) in btrfs_run_delayed_refs:3089: errno=-28 No space left [285169.096976] BTRFS info (device sdb): forced readonly [285169.096979] BTRFS warning (device sdb): Skipping commit of aborted transaction. [285169.096981] BTRFS: error (device sdb) in cleanup_transaction:1873: errno=-28 No space left How can I help you to fix this issue? Regards, Martin Svec -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a m
Re: Recommendations for balancing as part of regular maintenance?
On 08.01.2018 19:34 Austin S. Hemmelgarn wrote: > On 2018-01-08 13:17, Graham Cobb wrote: >> On 08/01/18 16:34, Austin S. Hemmelgarn wrote: >>> Ideally, I think it should be as generic as reasonably possible, >>> possibly something along the lines of: >>> >>> A: While not strictly necessary, running regular filtered balances (for >>> example `btrfs balance start -dusage=50 -dlimit=2 -musage=50 >>> -mlimit=4`, >>> see `man btrfs-balance` for more info on what the options mean) can >>> help >>> keep a volume healthy by mitigating the things that typically cause >>> ENOSPC errors. Full balances by contrast are long and expensive >>> operations, and should be done only as a last resort. >> >> That recommendation is similar to what I do and it works well for my use >> case. I would recommend it to anyone with my usage, but cannot say how >> well it would work for other uses. In my case, I run balances like that >> once a week: some weeks nothing happens, other weeks 5 or 10 blocks may >> get moved. > > In my own usage I've got a pretty varied mix of other stuff going on. > All my systems are Gentoo, so system updates mean that I'm building > software regularly (though on most of the systems that happens on > tmpfs in RAM), I run a home server with a dozen low use QEMU VM's and > a bunch of transient test VM's, all of which I'm currently storing > disk images for raw on top of BTRFS (which is actually handling all of > it pretty well, though that may be thanks to all the VM's using > PV-SCSI for their disks), I run a BOINC client system that sees pretty > heavy filesystem usage, and have a lot of personal files that get > synced regularly across systems, and all of this is on raid1 with > essentially no snapshots. For me the balance command I mentioned > above run daily seems to help, even if the balance doesn't move much > most of the time on most filesystems, and the actual balance > operations take at most a few seconds most of the time (I've got > reasonably nice SSD's in everything). There have been reports of (rare) corruption caused by balance (won't be detected by a scrub) here on the mailing list. So I would stay a away from btrfs balance unless it is absolutely needed (ENOSPC), and while it is run I would try not to do anything else wrt. to writes simultaneously. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs blocked by too many delayed refs
Hi, I have the problem that too many delayed refs block a btrfs storage. I have one thread that does work: [] io_schedule+0x16/0x40 [] wait_on_page_bit+0x116/0x150 [] read_extent_buffer_pages+0x1c5/0x290 [] btree_read_extent_buffer_pages+0x9d/0x100 [] read_tree_block+0x32/0x50 [] read_block_for_search.isra.30+0x120/0x2e0 [] btrfs_search_slot+0x385/0x990 [] btrfs_insert_empty_items+0x71/0xc0 [] insert_extent_data_ref.isra.49+0x11b/0x2a0 [] __btrfs_inc_extent_ref.isra.59+0x1ee/0x220 [] __btrfs_run_delayed_refs+0x924/0x12c0 [] btrfs_run_delayed_refs+0x7a/0x260 [] create_pending_snapshot+0x5e4/0xf00 [] create_pending_snapshots+0x97/0xc0 [] btrfs_commit_transaction+0x395/0x930 [] btrfs_mksubvol+0x4a6/0x4f0 [] btrfs_ioctl_snap_create_transid+0x185/0x190 [] btrfs_ioctl_snap_create_v2+0x104/0x150 [] btrfs_ioctl+0x5e1/0x23b0 [] do_vfs_ioctl+0x92/0x5a0 [] SyS_ioctl+0x79/0x9 the others are in 'D' state e.g. with [] call_rwsem_down_write_failed+0x17/0x30 [] filename_create+0x6b/0x150 [] SyS_mkdir+0x44/0xe0 Slabtop shows 2423910 btrfs_delayed_ref_head structs, slowly decreasing. What I think is happening is that delayed refs are added without throttling them with btrfs_should_throttle_delayed_refs . Maybe by creating a snapshot of a file and then modifying it (some action that creates delayed refs, is not truncate which is already throttled and does not commit a transaction which is also throttled). Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: again "out of space" and remount read only, with 4.14
On 03.12.2017 16:39 Martin Raiber wrote: > Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski: >> On 2017-11-27 00:37, Martin Raiber wrote: >>> On 26.11.2017 08:46 Tomasz Chmielewski wrote: >>>> Got this one on a 4.14-rc7 filesystem with some 400 GB left: >>> I guess it is too late now, but I guess the "btrfs fi usage" output of >>> the file system (especially after it went ro) would be useful. >> It was more or less similar as it went ro: >> >> # btrfs fi usage /srv >> Overall: >> Device size: 5.25TiB >> Device allocated: 4.45TiB >> Device unallocated: 823.97GiB >> Device missing: 0.00B >> Used: 4.33TiB >> Free (estimated): 471.91GiB (min: 471.91GiB) >> Data ratio: 2.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> Unallocated: >> /dev/sda4 411.99GiB >> /dev/sdb4 411.99GiB > I wanted to check if is the same issue I have, e.g. with 4.14.1 > space_cache=v2: > > [153245.341823] BTRFS: error (device loop0) in > btrfs_run_delayed_refs:3089: errno=-28 No space left > [153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317: > errno=-28 No space left > [153245.341848] BTRFS info (device loop0): forced readonly > [153245.341972] BTRFS warning (device loop0): Skipping commit of aborted > transaction. > [153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873: > errno=-28 No space left > # btrfs fi usage /media/backup > Overall: > Device size: 49.60TiB > Device allocated: 38.10TiB > Device unallocated: 11.50TiB > Device missing: 0.00B > Used: 36.98TiB > Free (estimated): 12.59TiB (min: 12.59TiB) > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 2.00GiB (used: 1.99GiB) > > Data,single: Size:37.70TiB, Used:36.61TiB > /dev/loop0 37.70TiB > > Metadata,single: Size:411.01GiB, Used:380.98GiB > /dev/loop0 411.01GiB > > System,single: Size:36.00MiB, Used:4.00MiB > /dev/loop0 36.00MiB > > Unallocated: > /dev/loop0 11.50TiB > > Note the global reserve being at maximum. I already increased that in > the code to 2G and that seems to make this issue appear more rarely. This time with enospc_debug mount option: With Linux 4.14.3. Single large device. [15179.739038] [ cut here ] [15179.739059] WARNING: CPU: 0 PID: 28694 at fs/btrfs/extent-tree.c:8458 btrfs_alloc_tree_block+0x38f/0x4a0 [15179.739060] Modules linked in: bcache loop dm_crypt algif_skcipher af_alg st sr_mod cdrom xfs libcrc32c zbud intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt kvm_intel kvm iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc raid1 mgag200 snd_pcm aesni_intel ttm snd_timer drm_kms_helper snd soundcore aes_x86_64 crypto_simd glue_helper cryptd pcspkr i2c_i801 joydev drm mei_me evdev lpc_ich mei mfd_core ipmi_si ipmi_devintf ipmi_msghandler tpm_tis tpm_tis_core tpm wmi ioatdma button shpchp fuse autofs4 hid_generic usbhid hid sg sd_mod dm_mod dax md_mod crc32c_intel isci ahci mpt3sas libsas libahci igb raid_class ehci_pci i2c_algo_bit libata dca ehci_hcd scsi_transport_sas ptp nvme pps_core scsi_mod usbcore nvme_core [15179.739133] CPU: 0 PID: 28694 Comm: btrfs Not tainted 4.14.3 #2 [15179.739134] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015 [15179.739136] task: 8813e4f02ac0 task.stack: c9000aea [15179.739140] RIP: 0010:btrfs_alloc_tree_block+0x38f/0x4a0 [15179.739141] RSP: 0018:c9000aea3558 EFLAGS: 00010292 [15179.739144] RAX: 001d RBX: 4000 RCX: [15179.739146] RDX: 880c4fa15b38 RSI: 880c4fa0de58 RDI: 880c4fa0de58 [15179.739147] RBP: c9000aea35d0 R08: 0001 R09: 0662 [15179.739149] R10: 1600 R11: 0662 R12: 880c0a454000 [15179.739151] R13: 880c4ba33800 R14: 0001 R15: 880c0a454128 [15179.739153] FS: 7f0d699128c0() GS:880c4fa0() knlGS: [15179.739155] CS: 0010 DS: ES: CR0: 80050033 [15179.739156] CR2: 7bbfcdf2c6e8 CR3: 00151da91003 CR4: 000606f0 [15179.739158] Call Trace: [15179.739166] __btrfs_cow_block+0x117/0x580 [15179.739169] btrfs_cow_block+0xdf/0x200 [15179.739171] btrfs_search_slot+0x1ea/0x990 [15179.739174] lookup_inline_extent_backref+0x
Re: again "out of space" and remount read only, with 4.14
Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski: > On 2017-11-27 00:37, Martin Raiber wrote: >> On 26.11.2017 08:46 Tomasz Chmielewski wrote: >>> Got this one on a 4.14-rc7 filesystem with some 400 GB left: >> I guess it is too late now, but I guess the "btrfs fi usage" output of >> the file system (especially after it went ro) would be useful. > It was more or less similar as it went ro: > > # btrfs fi usage /srv > Overall: > Device size: 5.25TiB > Device allocated: 4.45TiB > Device unallocated: 823.97GiB > Device missing: 0.00B > Used: 4.33TiB > Free (estimated): 471.91GiB (min: 471.91GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Unallocated: > /dev/sda4 411.99GiB > /dev/sdb4 411.99GiB I wanted to check if is the same issue I have, e.g. with 4.14.1 space_cache=v2: [153245.341823] BTRFS: error (device loop0) in btrfs_run_delayed_refs:3089: errno=-28 No space left [153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317: errno=-28 No space left [153245.341848] BTRFS info (device loop0): forced readonly [153245.341972] BTRFS warning (device loop0): Skipping commit of aborted transaction. [153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873: errno=-28 No space left # btrfs fi usage /media/backup Overall: Device size: 49.60TiB Device allocated: 38.10TiB Device unallocated: 11.50TiB Device missing: 0.00B Used: 36.98TiB Free (estimated): 12.59TiB (min: 12.59TiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 2.00GiB (used: 1.99GiB) Data,single: Size:37.70TiB, Used:36.61TiB /dev/loop0 37.70TiB Metadata,single: Size:411.01GiB, Used:380.98GiB /dev/loop0 411.01GiB System,single: Size:36.00MiB, Used:4.00MiB /dev/loop0 36.00MiB Unallocated: /dev/loop0 11.50TiB Note the global reserve being at maximum. I already increased that in the code to 2G and that seems to make this issue appear more rarely. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
David Sterba - 15.11.17, 15:39: > On Tue, Nov 14, 2017 at 07:53:31PM +0100, David Sterba wrote: > > On Mon, Nov 13, 2017 at 11:50:46PM +0100, David Sterba wrote: > > > Up to now, there are no bootloaders supporting ZSTD. > > > > I've tried to implement the support to GRUB, still incomplete and hacky > > but most of the code is there. The ZSTD implementation is copied from > > kernel. The allocators need to be properly set up, as it needs to use > > grub_malloc/grub_free for the workspace thats called from some ZSTD_* > > functions. > > > > https://github.com/kdave/grub/tree/btrfs-zstd > > The branch is now in a state that can be tested. Turns out the memory > requirements are too much for grub, so the boot fails with "not enough > memory". The calculated value > > ZSTD_BTRFS_MAX_INPUT: 131072 > ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424 > > This is not something I could fix easily, we'd probalby need a tuned > version of ZSTD for grub constraints. Adding Nick to CC. Somehow I am happy that I still have a plain Ext4 for /boot. :) Thanks for looking into Grub support anyway. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
David Sterba - 14.11.17, 19:49: > On Tue, Nov 14, 2017 at 08:34:37AM +0100, Martin Steigerwald wrote: > > Hello David. > > > > David Sterba - 13.11.17, 23:50: > > > while 4.14 is still fresh, let me address some concerns I've seen on > > > linux > > > forums already. > > > > > > The newly added ZSTD support is a feature that has broader impact than > > > just the runtime compression. The btrfs-progs understand filesystem with > > > ZSTD since 4.13. The remaining key part is the bootloader. > > > > > > Up to now, there are no bootloaders supporting ZSTD. This could lead to > > > an > > > unmountable filesystem if the critical files under /boot get > > > accidentally > > > or intentionally compressed by ZSTD. > > > > But otherwise ZSTD is safe to use? Are you aware of any other issues? > > No issues from my own testing or reported by other users. Thanks to you and the others. I think I try this soon. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read before you deploy btrfs + zstd
Hello David. David Sterba - 13.11.17, 23:50: > while 4.14 is still fresh, let me address some concerns I've seen on linux > forums already. > > The newly added ZSTD support is a feature that has broader impact than > just the runtime compression. The btrfs-progs understand filesystem with > ZSTD since 4.13. The remaining key part is the bootloader. > > Up to now, there are no bootloaders supporting ZSTD. This could lead to an > unmountable filesystem if the critical files under /boot get accidentally > or intentionally compressed by ZSTD. But otherwise ZSTD is safe to use? Are you aware of any other issues? I consider switching from LZO to ZSTD on this ThinkPad T520 with Sandybridge. Thank you, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to run balance successfully (No space left on device)?
On 10.11.2017 22:51 Chris Murphy wrote: >> Combined with evidence that "No space left on device" during balance can >> lead to various file corruption (we've witnessed it with MySQL), I'd day >> btrfs balance is a dangerous operation and decision to use it should be >> considered very thoroughly. > I've never heard of this. Balance is COW at the chunk level. The old > chunk is not dereferenced until it's written in the new location > correctly. Corruption during balance shouldn't be possible so if you > have a reproducer, the devs need to know about it. I didn't say anything before, because I could not reproduce the problem. I had (I guess) a corruption caused by balance as well. It had ENOSPC in spite of enough free space (4.9.x), which made me balance it regularly to keep unallocated space around. Corruption occured probably after or shortly before power reset during a balance -- no skip_balance specified so it continued directly after mount -- data was moved relatively fast after the mount operation (copy file then delete old file). I think space_cache=v2 was active at the time. I'm of course not completely sure it was btrfs's fault and as usual not all the conditions may be relevant. Could also be instead an upper layer error (Hyper-V storage), memory issue or an application error. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple btrfs-cleaner threads per volume
On 02.11.2017 16:10 Hans van Kranenburg wrote: > On 11/02/2017 04:02 PM, Martin Raiber wrote: >> snapshot cleanup is a little slow in my case (50TB volume). Would it >> help to have multiple btrfs-cleaner threads? The block layer underneath >> would have higher throughput with more simultaneous read/write requests. > Just curious: > * How many subvolumes/snapshots are you removing, and what's the > complexity level (like, how many other subvolumes/snapshots reference > the same data extents?) > * Do you see a lot of cpu usage, or mainly a lot of disk I/O? If it's > disk IO, is it mainly random read IO, or is it a lot of write traffic? > * What mount options are you running with (from /proc/mounts)? It is a single block device, so not a multi-device btrfs, so optimizations in that area wouldn't help. It is a UrBackup system with about 200 snapshots per client. 20009 snapshots total. UrBackup reflinks files between them, but btrfs-cleaner doesn't use much CPU (so it doesn't seem like the backref walking is the problem). btrfs-cleaner is probably limited mainly by random read/write IO. The device has a cache, so parallel accesses would help, as some of them may hit the cache. Looking at the code it seems easy enough to do. Question is if there are any obvious reasons why this wouldn't work (like some lock etc.). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Multiple btrfs-cleaner threads per volume
Hi, snapshot cleanup is a little slow in my case (50TB volume). Would it help to have multiple btrfs-cleaner threads? The block layer underneath would have higher throughput with more simultaneous read/write requests. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data and metadata extent allocators [1/2]: Recap: The data story
virtual address space) I see a difference in behavior but I do not yet fully understand what I am looking at. > Q: But what if all my chunks have badly fragmented free space right now? > A: If your situation allows for it, the simplest way is running a full > balance of the data, as some sort of big reset button. If you only want > to clean up chunks with excessive free space fragmentation, then you can > use the helper I used to identify them, which is > show_free_space_fragmentation.py in [8]. Just feed the chunks to balance > starting with the one with the highest score. The script requires the > free space tree to be used, which is a good idea anyway. Okay, when I understand this correctly I don´t need to use "nossd" with kernel 4.14, but it would be good to do a full "btrfs filesystem balance" run on all the SSD BTRFS filesystems or all other ones with rotational=0. What would be the benefit of that? Would the filesystem run faster again? My subjective impression is that performance got worse over time. *However* all my previous full balance attempts made the performance even more worse. So… is a full balance safe to the filesystem performance meanwhile? I still have the issue that fstrim on /home only works with patch from Lutz Euler from 2014, which is still not in mainline BTRFS. Maybe it would be a good idea to recreate /home in order to get rid of that special "anomaly" of the BTRFS that fstrim don´t work without this patch. Maybe a least a part of this should go into BTRFS kernel wiki as it would be more easy to find there for users. I wonder about a "upgrade notes for users" / "BTRFS maintenance" page that gives recommendations in case some step is recommended after a major kernel update and general recommendations for maintenance. Ideally most of this would be integrated into BTRFS or a userspace daemon for it and be handled transparently and automatically. Yet a full balance is an expensive operation time-wise and probably should not be started without user consent. I do wonder about the ton of tools here and there and I would love some btrfsd or… maybe even more generic fsd filesystem maintenance daemon which would do regular scrubs and whatever else makes sense. It could use some configuration in the root directory of a filesystem and work for BTRFS and other filesystem that do have beneficial online / background upgraded like XFS which also has online scrubbing by now (at least for metadata). > [0] https://www.spinics.net/lists/linux-btrfs/msg64446.html > [1] https://www.spinics.net/lists/linux-btrfs/msg64771.html > [2] https://github.com/knorrie/btrfs-heatmap/ > [3] > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-shotgunblast.png > [4] > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-12-18-heatmap-scripting/ > fsid_ed10a358-c846-4e76-a071-3821d423a99d_startat_320029589504_at_1482095269 > .png [5] https://www.spinics.net/lists/linux-btrfs/msg64418.html > [6] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i > d=583b723151794e2ff1691f1510b4e43710293875 [7] > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4 [8] > https://github.com/knorrie/python-btrfs/tree/develop/examples Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.13: "error in btrfs_run_delayed_refs:3009: errno=-28 No space left" with 1.3TB unallocated / 737G free?
On 19.10.2017 10:16 Vladimir Panteleev wrote: > On Tue, 17 Oct 2017 16:21:04 -0700, Duncan wrote: >> * try the balance on 4.14-rc5+, where the known bug should be fixed > > Thanks! However, I'm getting the same error on > 4.14.0-rc5-g9aa0d2dde6eb. The stack trace is different, though: > > Aside from rebuilding the filesystem, what are my options? Should I > try to temporarily add a file from another volume as a device and > retry the balance? If so, what would be a good size for the temporary > device? > Hi, for me a work-around for something like this has been to reduce the amount of dirty memory via e.g. sysctl vm.dirty_background_bytes=$((100*1024*1024)) sysctl vm.dirty_bytes=$((400*1024*1024)) this reduces performance, however. You could also mount with "enospc_debug" to give the devs more infos about this issue. I am having more ENOSPC issues with 4.9.x than with the latest 4.14. Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?
[repost. I didn´t notice autocompletion gave me wrong address for fsdevel, blacklisted now] Hello. What do you think of http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs ? There are quite some BTRFS maintenance programs like the deduplication stuff. Also regular scrubs… and in certain circumstances probably balances can make sense. In addition to this XFS got scrub functionality as well. Now putting the foundation for such a functionality in the kernel I think would only be reasonable if it cannot be done purely within user space, so I wonder about the safety from other concurrent ZFS modification and atomicity that are mentioned on the wiki page. The second set of slides, those the OpenZFS Developer Commit 2014, which are linked to on the wiki page explain this more. (I didn´t look the first ones, as I am no fan of slideshare.net and prefer a simple PDF to download and view locally anytime, not for privacy reasons alone, but also to avoid a using a crappy webpage over a wonderfully functional PDF viewer fat client like Okular) Also I wonder about putting a lua interpreter into the kernel, but it seems at least NetBSD developers added one to their kernel with version 7.0¹. I also ask this cause I wondered about a kind of fsmaintd or volmaintd for quite a while, and thought… it would be nice to do this in a generic way, as BTRFS is not the only filesystem which supports maintenance operations. However if it can all just nicely be done in userspace, I am all for it. [1] http://www.netbsd.org/releases/formal-7/NetBSD-7.0.html (tons of presentation PDFs on their site as well) Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?
Hello. What do you think of http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs ? There are quite some BTRFS maintenance programs like the deduplication stuff. Also regular scrubs… and in certain circumstances probably balances can make sense. In addition to this XFS got scrub functionality as well. Now putting the foundation for such a functionality in the kernel I think would only be reasonable if it cannot be done purely within user space, so I wonder about the safety from other concurrent ZFS modification and atomicity that are mentioned on the wiki page. The second set of slides, those the OpenZFS Developer Commit 2014, which are linked to on the wiki page explain this more. (I didn´t look the first ones, as I am no fan of slideshare.net and prefer a simple PDF to download and view locally anytime, not for privacy reasons alone, but also to avoid a using a crappy webpage over a wonderfully functional PDF viewer fat client like Okular) Also I wonder about putting a lua interpreter into the kernel, but it seems at least NetBSD developers added one to their kernel with version 7.0¹. I also ask this cause I wondered about a kind of fsmaintd or volmaintd for quite a while, and thought… it would be nice to do this in a generic way, as BTRFS is not the only filesystem which supports maintenance operations. However if it can all just nicely be done in userspace, I am all for it. [1] http://www.netbsd.org/releases/formal-7/NetBSD-7.0.html (tons of presentation PDFs on their site as well) Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding handling of file renames in Btrfs
Hi, On 16.09.2017 14:27 Hans van Kranenburg wrote: > On 09/10/2017 01:50 AM, Rohan Kadekodi wrote: >> I was trying to understand how file renames are handled in Btrfs. I >> read the code documentation, but had a problem understanding a few >> things. >> >> During a file rename, btrfs_commit_transaction() is called which is >> because Btrfs has to commit the whole FS before storing the >> information related to the new renamed file. > Can you point to which lines of code you're looking at? > >> It has to commit the FS >> because a rename first does an unlink, which is not recorded in the >> btrfs_rename() transaction and so is not logged in the log tree. Is my >> understanding correct? [...] > Can you also point to where exactly you see this happening? I'd also > like to understand more about this. > > The whole mail thread following this message continues about what a > transaction commit is and does etc, but the above question is never > answered I think. > > And I think it's an interesting question. Is a rename a "heavier" > operation relative to other file operations? > as far as I can see it only uses the log tree in some cases where the log tree was already used for the file or the parent directory. The cases are documented here https://github.com/torvalds/linux/blob/master/fs/btrfs/tree-log.c#L45 . So rename isn't much heavier than unlink+create. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
Hi, On 12.09.2017 23:13 Adam Borowski wrote: > On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote: >> On 2017-09-12 16:00, Adam Borowski wrote: >>> Noted. Both Marat's and my use cases, though, involve VMs that are off most >>> of the time, and at least for me, turned on only to test something. >>> Touching mtime makes rsync run again, and it's freaking _slow_: worse than >>> 40 minutes for a 40GB VM (source:SSD target:deduped HDD). >> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if >> you're going direct to a hard drive. I get better performance than that on >> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there, >> but it's for archival storage so I don't really care). I'm actually curious >> what the exact rsync command you are using is (you can obviously redact >> paths as you see fit), as the only way I can think of that it should be that >> slow is if you're using both --checksum (but if you're using this, you can >> tell rsync to skip the mtime check, and that issue goes away) and --inplace, >> _and_ your HDD is slow to begin with. > rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu > The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but > with nothing notable on SMART, in a Qnap 253a, kernel 4.9. > > Both source and target are btrfs, but here switching to send|receive > wouldn't give much as this particular guest is Win10 Insider Edition -- > a thingy that shows what the folks from Redmond have cooked up, with roughly > weekly updates to the tune of ~10GB writes 10GB deletions (if they do > incremental transfers, installation still rewrites everything system). > > Lemme look a bit more, rsync performance is indeed really abysmal compared > to what it should be. self promo, but consider using UrBackup (OSS software, too) instead? For Windows VMs I would install the client in the VM. It excludes unnessary stuff like e.g. page files or the shadow storage area from the image backups, as well and has a mode to store image backups as raw btrfs files. Linux VMs I'd backup as files either from the hypervisor or from in VM. If you want to backup big btrfs image files it can do that too, and faster than rsync plus it can do incremental backups with sparse files. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding handling of file renames in Btrfs
Hi, On 10.09.2017 08:45 Qu Wenruo wrote: > > > On 2017年09月10日 14:41, Qu Wenruo wrote: >> >> >> On 2017年09月10日 07:50, Rohan Kadekodi wrote: >>> Hello, >>> >>> I was trying to understand how file renames are handled in Btrfs. I >>> read the code documentation, but had a problem understanding a few >>> things. >>> >>> During a file rename, btrfs_commit_transaction() is called which is >>> because Btrfs has to commit the whole FS before storing the >>> information related to the new renamed file. It has to commit the FS >>> because a rename first does an unlink, which is not recorded in the >>> btrfs_rename() transaction and so is not logged in the log tree. Is my >>> understanding correct? If yes, my questions are as follows: >> >> Not familiar with rename kernel code, so not much help for rename >> opeartion. >> >>> >>> 1. What does committing the whole FS mean? >> >> Committing the whole fs means a lot of things, but generally >> speaking, it makes that the on-disk data is inconsistent with each >> other. > >> For obvious part, it writes modified fs/subvolume trees to disk (with >> handling of tree operations so no half modified trees). >> >> Also other trees like extent tree (very hot since every CoW will >> update it, and the most complicated one), csum tree if modified. >> >> After transaction is committed, the on-disk btrfs will represent the >> states when commit trans is called, and every tree should match each >> other. >> >> Despite of this, after a transaction is committed, generation of the >> fs get increased and modified tree blocks will have the same >> generation number. >> >>> Blktrace shows that there >>> are 2 256KB writes, which are essentially writes to the data of >>> the root directory of the file system (which I found out through >>> btrfs-debug-tree). >> >> I'd say you didn't check btrfs-debug-tree output carefully enough. >> I strongly recommend to do vimdiff to get what tree is modified. >> >> At least the following trees are modified: >> >> 1) fs/subvolume tree >> Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and >> updated inode time. >> So fs/subvolume tree must be CoWed. >> >> 2) extent tree >> CoW of above metadata operation will definitely cause extent >> allocation and freeing, extent tree will also get updated. >> >> 3) root tree >> Both extent tree and fs/subvolume tree modified, their root bytenr >> needs to be updated and root tree must be updated. >> >> And finally superblocks. >> >> I just verified the behavior with empty btrfs created on a 1G file, >> only one file to do the rename. >> >> In that case (with 4K sectorsize and 16K nodesize), the total IO >> should be (3 * 16K) * 2 + 4K * 2 = 104K. >> >> "3" = number of tree blocks get modified >> "16K" = nodesize >> 1st "*2" = DUP profile for metadata >> "4K" = superblock size >> 2nd "*2" = 2 superblocks for 1G fs. >> >> If your extent/root/fs trees have higher level, then more tree blocks >> needs to be updated. >> And if your fs is very large, you may have 3 superblocks. >> >>> Is this equivalent to doing a shell sync, as the >>> same block groups are written during a shell sync too? >> >> For shell "sync" the difference is that, "sync" will write all dirty >> data pages to disk, and then commit transaction. >> While only calling btrfs_commit_transacation() doesn't trigger dirty >> page writeback. >> >> So there is a difference. this conversation made me realize why btrfs has sub-optimal meta-data performance. Cow b-trees are not the best data structure for such small changes. In my application I have multiple operations (e.g. renames) which can be bundles up and (mostly) one writer. I guess using BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END would be one way to reduce the cow overhead, but those are dangerous wrt. to ENOSPC and there have been discussions about removing them. Best would be if there were delayed metadata, where metadata is handled the same as delayed allocations and data changes, i.e. commit on fsync, commit interval or fssync. I assumed this was already the case... Please correct me if I got this wrong. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)
Hello Duncan. Duncan - 09.07.17, 11:17: > Paul Jones posted on Sun, 09 Jul 2017 09:16:36 + as excerpted: > >> Marc MERLIN - 08.07.17, 21:34: > >> > This is now the 3rd filesystem I have (on 3 different machines) that > >> > is getting corruption of some kind (on 4.11.6). > >> > >> Anyone else getting corruptions with 4.11? > >> > >> I happily switch back to 4.10.17 or even 4.9 if that is the case. I may > >> even do so just from your reports. Well, yes, I will do exactly that. I > >> just switch back for 4.10 for now. Better be safe, than sorry. > > > > No corruption for me - I've been on 4.11 since about .2 and everything > > seems fine. Currently on 4.11.8 > > No corruptions here either. 4.12.0 now, previously 4.12-rc5(ish, git), > before that 4.11.0. > > I have however just upgraded to new ssds then wiped and setup the old […] > Also, all my btrfs are raid1 or dup for checksummed redundancy, and > relatively small, the largest now 80 GiB per device, after the upgrade. > And my use-case doesn't involve snapshots or subvolumes. > > So any bug that is most likely on older filesystems, say those without > the no-holes feature, for instance, or that doesn't tend to hit raid1 or > dup mode, or that is less likely on small filesystems on fast ssds, or > that triggers most often with reflinks and thus on filesystems with > snapshots, is unlikely to hit me. Hmmm, the BTRFS filesystems on my laptop 3 to 5 or even more years old. I stick with 4.10 for now, I think. The older ones are RAID 1 across two SSDs, the newer one is single device, on one SSD. These filesystems didn´t fail me in years and since 4.5 or 4.6 even the "I search for free space" kernel hang (hung tasks and all that) is gone as well. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)
Hello Marc. Marc MERLIN - 08.07.17, 21:34: > Sigh, > > This is now the 3rd filesystem I have (on 3 different machines) that is > getting corruption of some kind (on 4.11.6). Anyone else getting corruptions with 4.11? I happily switch back to 4.10.17 or even 4.9 if that is the case. I may even do so just from your reports. Well, yes, I will do exactly that. I just switch back for 4.10 for now. Better be safe, than sorry. I know how you feel, Marc. I posted about a corruption on one of my backup harddisks here some time ago that btrfs check --repair wasn´t able to handle. I redid that disk from scratch and it took a long, long time. I agree with you that this has to stop. Before that I will never *ever* recommend this to a customer. Ideally no corruptions in stable kernels, especially when its a .6 at the end of the version number. But if so… then fixable. Other filesystems like Ext4 and XFS can do it… so this should be possible with BTRFS as well. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/13] scsi/osd: don't save block errors into req_results
Christoph, > We will only have sense data if the command exectured and got a SCSI > result, so this is pointless. "executed" Reviewed-by: Martin K. Petersen -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] [PATCH 08/15] dm mpath: merge do_end_io_bio into multipath_end_io_bio
On Thu, 2017-05-18 at 15:18 +0200, Christoph Hellwig wrote: > This simplifies the code and especially the error passing a bit and > will help with the next patch. > > Signed-off-by: Christoph Hellwig > --- > drivers/md/dm-mpath.c | 42 - > - > 1 file changed, 16 insertions(+), 26 deletions(-) > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 3df056b73b66..b1cb0273b081 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -1510,24 +1510,26 @@ static int multipath_end_io(struct dm_target > *ti, struct request *clone, > return r; > } > > -static int do_end_io_bio(struct multipath *m, struct bio *clone, > - int error, struct dm_mpath_io *mpio) > +static int multipath_end_io_bio(struct dm_target *ti, struct bio > *clone, int error) > { > + struct multipath *m = ti->private; > + struct dm_mpath_io *mpio = get_mpio_from_bio(clone); > + struct pgpath *pgpath = mpio->pgpath; > unsigned long flags; > > - if (!error) > - return 0; /* I/O complete */ > + BUG_ON(!mpio); You dereferenced mpio already above. Regards, Martin > > - if (noretry_error(error)) > - return error; > + if (!error || noretry_error(error)) > + goto done; > > - if (mpio->pgpath) > - fail_path(mpio->pgpath); > + if (pgpath) > + fail_path(pgpath); > > if (atomic_read(&m->nr_valid_paths) == 0 && > !test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) { > dm_report_EIO(m); > - return -EIO; > + error = -EIO; > + goto done; > } > > /* Queue for the daemon to resubmit */ > @@ -1539,28 +1541,16 @@ static int do_end_io_bio(struct multipath *m, > struct bio *clone, > if (!test_bit(MPATHF_QUEUE_IO, &m->flags)) > queue_work(kmultipathd, &m->process_queued_bios); > > - return DM_ENDIO_INCOMPLETE; > -} > - > -static int multipath_end_io_bio(struct dm_target *ti, struct bio > *clone, int error) > -{ > - struct multipath *m = ti->private; > - struct dm_mpath_io *mpio = get_mpio_from_bio(clone); > - struct pgpath *pgpath; > - struct path_selector *ps; > - int r; > - > - BUG_ON(!mpio); > - > - r = do_end_io_bio(m, clone, error, mpio); > - pgpath = mpio->pgpath; > + error = DM_ENDIO_INCOMPLETE; > +done: > if (pgpath) { > - ps = &pgpath->pg->ps; > + struct path_selector *ps = &pgpath->pg->ps; > + > if (ps->type->end_io) > ps->type->end_io(ps, &pgpath->path, mpio- > >nr_bytes); > } > > - return r; > + return error; > } > > /* -- Dr. Martin Wilck , Tel. +49 (0)911 74053 2107 SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
Stefan Priebe - Profihost AG - 10.05.17, 09:02: > I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me > something about the expected runtime or how to fix bad key ordering? I had a similar issue which remained unresolved. But I clearly saw that btrfs check was running in a loop, see thread: [4.9] btrfs check --repair looping over file extent discount errors So it would be interesting to see the exact output of btrfs check, maybe there is something like repeated numbers that also indicate a loop. I was about to say that BTRFS is production ready before this issue happened. I still think it for a lot of setup mostly is, as at least the "I get stuck on the CPU while searching for free space" issue seems to be gone since about anything between 4.5/4.6 kernels. I also think so regarding absence of data loss. I was able to copy over all of the data I needed of the broken filesystem. Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me. So unless someone has a clever idea here and shares it with you, it may be needed to backup anything you can from this filesystem and then start over from scratch. As to my past experience something like xfs_repair surpasses btrfs check in the ability to actually fix broken filesystem by a great extent. Ciao, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.9] btrfs check --repair looping over file extent discount errors
Martin Steigerwald - 22.04.17, 20:01: > Chris Murphy - 22.04.17, 09:31: > > Is the file system created with no-holes? > > I have how to find out about it and while doing accidentally set that I didn´t find out how to find out about it and… > feature on another filesystem (btrfstune only seems to be able to enable > the feature, not show the current state of it). > > But as there is no notice of the feature being set as standard in manpage of > mkfs.btrfs as of BTRFS tools 4.9.1 and as I didn´t set it myself, I best > bet is that the feature is not enable on the filesystem. > > Now I wonder… how to disable the feature on that other filesystem again. -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.9] btrfs check --repair looping over file extent discount errors
Hello Chris. Chris Murphy - 22.04.17, 09:31: > Is the file system created with no-holes? I have how to find out about it and while doing accidentally set that feature on another filesystem (btrfstune only seems to be able to enable the feature, not show the current state of it). But as there is no notice of the feature being set as standard in manpage of mkfs.btrfs as of BTRFS tools 4.9.1 and as I didn´t set it myself, I best bet is that the feature is not enable on the filesystem. Now I wonder… how to disable the feature on that other filesystem again. Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.9] btrfs check --repair looping over file extent discount errors
Hello. I am planning to copy of important data on the disk with the broken filesystem to the disk with the good filesystem and then reformatitting the disk with the broken filesystem soon, probably in the course of the day… so in case you want any debug information before that, let me know ASAP. Thanks, Martin Martin Steigerwald - 14.04.17, 21:35: > Hello, > > backup harddisk connected via eSATA. Hard kernel hang, mouse pointer > freezing two times seemingly after finishing /home backup and creating new > snapshot on source BTRFS SSD RAID 1 for / in order to backup it. I did > scrubbed / and it appears to be okay, but I didn´t run btrfs check on it. > Anyway deleting that subvolume works and I as I suspected an issue with the > backup disk I started with that one. > > I got > > merkaba:~> btrfs --version > btrfs-progs v4.9.1 > > merkaba:~> cat /proc/version > Linux version 4.9.20-tp520-btrfstrim+ (martin@merkaba) (gcc version 6.3.0 > 20170321 (Debian 6.3.0-11) ) #6 SMP PREEMPT Mon Apr 3 11:42:17 CEST 2017 > > merkaba:~> btrfs fi sh feenwald > Label: 'feenwald' uuid: […] > Total devices 1 FS bytes used 1.26TiB > devid1 size 2.73TiB used 1.27TiB path /dev/sdc1 > > on Debian unstable on ThinkPad T520 connected via eSATA port on Minidock. > > > I am now running btrfs check --repair on it after without --repair the > command reported file extent discount errors and it appears to loop on the > same file extent discount errors for ages. Any advice? > > I do have another backup harddisk with BTRFS that worked fine today, so I do > not need to recover that drive immediately. I may let it run for a little > more time, but then will abort the repair process as I really think its > looping just over and over and over the same issues again. At some time I > may just copy all the stuff that is on that harddisk, but not on the other > one over to the other one and mkfs.btrfs the filesystem again, but I´d > rather like to know whats happening here. > > Here is output: > > merkaba:~> btrfs check --repair /dev/sdc1 > enabling repair mode > Checking filesystem on /dev/sdc1 > [… UUID ommited …] > checking extents > Fixed 0 roots. > checking free space cache > cache and super generation don't match, space cache will be invalidated > checking fs roots > root 257 inode 4979842 errors 100, file extent discount > Found file extent holes: > start: 0, len: 78798848 > root 257 inode 4980212 errors 100, file extent discount > Found file extent holes: > start: 0, len: 143360 > root 257 inode 4980214 errors 100, file extent discount > Found file extent holes: > start: 0, len: 4227072 > root 257 inode 4979842 errors 100, file extent discount > Found file extent holes: > start: 0, len: 78798848 > root 257 inode 4980212 errors 100, file extent discount > Found file extent holes: > start: 0, len: 143360 > root 257 inode 4980214 errors 100, file extent discount > Found file extent holes: > start: 0, len: 4227072 > root 257 inode 4979842 errors 100, file extent discount > Found file extent holes: > start: 0, len: 78798848 > root 257 inode 4980212 errors 100, file extent discount > Found file extent holes: > start: 0, len: 143360 > root 257 inode 4980214 errors 100, file extent discount > Found file extent holes: > start: 0, len: 4227072 > root 257 inode 4979842 errors 100, file extent discount > Found file extent holes: > start: 0, len: 78798848 > root 257 inode 4980212 errors 100, file extent discount > Found file extent holes: > start: 0, len: 143360 > root 257 inode 4980214 errors 100, file extent discount > Found file extent holes: > start: 0, len: 4227072 > [… hours later …] > root 257 inode 4979842 errors 100, file extent discount > Found file extent holes: > start: 0, len: 78798848 > root 257 inode 4980212 errors 100, file extent discount > Found file extent holes: > start: 0, len: 143360 > root 257 inode 4980214 errors 100, file extent discount > Found file extent holes: > start: 0, len: 4227072 > root 257 inode 4979842 errors 100, file extent discount > Found file extent holes: > start: 0, len: 78798848 > root 257 inode 4980212 errors 100, file extent discount > Found file extent holes: > start: 0, len: 143360 > root 257 inode 4980214 errors 100, file extent discount > Found file extent holes: > start: 0, len: 4227072 > root 257 inode 4979842 errors 100, file extent discount > Found file extent holes: > start: 0, len: 78798848 > root 257 inode 4980212 errors 100, file extent discount > Found fil
[4.9] btrfs check --repair looping over file extent discount errors
Hello, backup harddisk connected via eSATA. Hard kernel hang, mouse pointer freezing two times seemingly after finishing /home backup and creating new snapshot on source BTRFS SSD RAID 1 for / in order to backup it. I did scrubbed / and it appears to be okay, but I didn´t run btrfs check on it. Anyway deleting that subvolume works and I as I suspected an issue with the backup disk I started with that one. I got merkaba:~> btrfs --version btrfs-progs v4.9.1 merkaba:~> cat /proc/version Linux version 4.9.20-tp520-btrfstrim+ (martin@merkaba) (gcc version 6.3.0 20170321 (Debian 6.3.0-11) ) #6 SMP PREEMPT Mon Apr 3 11:42:17 CEST 2017 merkaba:~> btrfs fi sh feenwald Label: 'feenwald' uuid: […] Total devices 1 FS bytes used 1.26TiB devid1 size 2.73TiB used 1.27TiB path /dev/sdc1 on Debian unstable on ThinkPad T520 connected via eSATA port on Minidock. I am now running btrfs check --repair on it after without --repair the command reported file extent discount errors and it appears to loop on the same file extent discount errors for ages. Any advice? I do have another backup harddisk with BTRFS that worked fine today, so I do not need to recover that drive immediately. I may let it run for a little more time, but then will abort the repair process as I really think its looping just over and over and over the same issues again. At some time I may just copy all the stuff that is on that harddisk, but not on the other one over to the other one and mkfs.btrfs the filesystem again, but I´d rather like to know whats happening here. Here is output: merkaba:~> btrfs check --repair /dev/sdc1 enabling repair mode Checking filesystem on /dev/sdc1 [… UUID ommited …] checking extents Fixed 0 roots. checking free space cache cache and super generation don't match, space cache will be invalidated checking fs roots root 257 inode 4979842 errors 100, file extent discount Found file extent holes: start: 0, len: 78798848 root 257 inode 4980212 errors 100, file extent discount Found file extent holes: start: 0, len: 143360 root 257 inode 4980214 errors 100, file extent discount Found file extent holes: start: 0, len: 4227072 root 257 inode 4979842 errors 100, file extent discount Found file extent holes: start: 0, len: 78798848 root 257 inode 4980212 errors 100, file extent discount Found file extent holes: start: 0, len: 143360 root 257 inode 4980214 errors 100, file extent discount Found file extent holes: start: 0, len: 4227072 root 257 inode 4979842 errors 100, file extent discount Found file extent holes: start: 0, len: 78798848 root 257 inode 4980212 errors 100, file extent discount Found file extent holes: start: 0, len: 143360 root 257 inode 4980214 errors 100, file extent discount Found file extent holes: start: 0, len: 4227072 root 257 inode 4979842 errors 100, file extent discount Found file extent holes: start: 0, len: 78798848 root 257 inode 4980212 errors 100, file extent discount Found file extent holes: start: 0, len: 143360 root 257 inode 4980214 errors 100, file extent discount Found file extent holes: start: 0, len: 4227072 [… hours later …] root 257 inode 4979842 errors 100, file extent discount Found file extent holes: start: 0, len: 78798848 root 257 inode 4980212 errors 100, file extent discount Found file extent holes: start: 0, len: 143360 root 257 inode 4980214 errors 100, file extent discount Found file extent holes: start: 0, len: 4227072 root 257 inode 4979842 errors 100, file extent discount Found file extent holes: start: 0, len: 78798848 root 257 inode 4980212 errors 100, file extent discount Found file extent holes: start: 0, len: 143360 root 257 inode 4980214 errors 100, file extent discount Found file extent holes: start: 0, len: 4227072 root 257 inode 4979842 errors 100, file extent discount Found file extent holes: start: 0, len: 78798848 root 257 inode 4980212 errors 100, file extent discount Found file extent holes: start: 0, len: 143360 root 257 inode 4980214 errors 100, file extent discount Found file extent holes: start: 0, len: 4227072 This basically seems to go on like this forever. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Do different btrfs volumes compete for CPU?
On 05/04/17 08:04, Marat Khalili wrote: > On 04/04/17 20:36, Peter Grandi wrote: >> SATA works for external use, eSATA works well, but what really >> matters is the chipset of the adapter card. > eSATA might be sound electrically, but mechanically it is awful. Try to > run it for months in a crowded server room, and inevitably you'll get > disconnections and data corruption. Tried different cables, brackets -- > same result. If you ever used eSATA connector, you'd feel it. Been using eSATA here for multiple disk packs continuously connected for a few years now for 48TB of data (not enough room in the host for the disks). Never suffered am eSATA disconnect. Had the usual cooling fan fails and HDD fails due to old age. All just a case of ensuring undisturbed clean cabling and a good UPS?... (BTRFS spanning four disks per external pack has worked well also.) Good luck, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Root volume (ID 5) in deleting state
It looks you're right! On a different machine: # btrfs sub list / | grep -v lxc ID 327 gen 1959587 top level 5 path mnt/reaver ID 498 gen 593655 top level 5 path var/lib/machines # btrfs sub list / -d | wc -l 0 Ok, apparently it's a regression in one of the latest versions then. But, it seems quite harmless. I'm glad my data are safe :) # uname -a Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET 2017 x86_64 GNU/Linux # btrfs fi show / Label: none uuid: 859dec5c-850c-4660-ad99-bc87456aa309 Total devices 1 FS bytes used 132.89GiB devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot As a side note, all of your disk space is allocated (200GiB of 200GiB). Even while there's still 70GiB of free space scattered around inside, this might lead to out-of-space issues, depending on how badly fragmented that free space is. I have not noticed this at all! # btrfs fi show / Label: none uuid: 859dec5c-850c-4660-ad99-bc87456aa309 Total devices 1 FS bytes used 134.23GiB devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot # btrfs fi df / Data, single: total=195.96GiB, used=131.58GiB System, single: total=3.00MiB, used=48.00KiB Metadata, single: total=4.03GiB, used=2.64GiB GlobalReserve, single: total=512.00MiB, used=0.00B After btrfs defrag there is no difference. btrfs fi show says still 200/200. I'll try to play with it. [ ... ] So, to get the numbers of total raw disk space allocation down, you need to defragment free space (compact the data), not defrag used space. You can even create pictures of space utilization in your btrfs filesystem, which might help understanding what it looks like right now: \o/ https://github.com/knorrie/btrfs-heatmap/ I've run into your tool yesterday while googling around this - thanks, it's really nice tool. Now rebalance is running and it seems to work well. Thank you for excellent responses and help! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Root volume (ID 5) in deleting state
On 13.2.2017 21:03, Hans van Kranenburg wrote: On 02/13/2017 12:26 PM, Martin Mlynář wrote: I've currently run into strange problem with BTRFS. I'm using it as my daily driver as root FS. Nothing complicated, just few subvolumes and incremental backups using btrbk. Now I've noticed that my btrfs root volume (absolute top, ID 5) is in "deleting" state. As I've done some testing and googling it seems that this should not be possible. [...] # btrfs sub list -ad /mnt/btrfs_root/ ID 5 gen 257505 top level 0 path /DELETED I have heard rumours that this is actually a bug in the output of sub list itself. What's the version of your btrfs-progs? (output of `btrfs version`) Sorry, I've lost this part: $ btrfs version btrfs-progs v4.9 # mount | grep btr /dev/mapper/vg0-btrfsroot on / type btrfs (rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=1339,subvol=/rootfs) /dev/mapper/vg0-btrfsroot on /mnt/btrfs_root type btrfs (rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=5,subvol=/) The rumour was that it had something to do with using space_cache=v2, which this example does not confirm. It looks you're right! On a different machine: # btrfs sub list / | grep -v lxc ID 327 gen 1959587 top level 5 path mnt/reaver ID 498 gen 593655 top level 5 path var/lib/machines # btrfs sub list / -d | wc -l 0 # btrfs version btrfs-progs v4.8.2 # uname -a Linux nxserver 4.8.6-1-ARCH #1 SMP PREEMPT Mon Oct 31 18:51:30 CET 2016 x86_64 GNU/Linux # mount | grep btrfs /dev/vda1 on / type btrfs (rw,relatime,nodatasum,nodatacow,space_cache,subvolid=5,subvol=/) Then I've upgraded this machine and: # btrfs sub list / | grep -v lxc ID 327 gen 1959587 top level 5 path mnt/reaver ID 498 gen 593655 top level 5 path var/lib/machines # btrfs sub list / -d | wc -l 1 # btrfs sub list / -d ID 5 gen 2186037 top level 0 path DELETED<== 1 # btrfs version btrfs-progs v4.9 # uname -a Linux nxserver 4.9.8-1-ARCH #1 SMP PREEMPT Mon Feb 6 12:59:40 CET 2017 x86_64 GNU/Linux # mount | grep btrfs /dev/vda1 on / type btrfs (rw,relatime,nodatasum,nodatacow,space_cache,subvolid=5,subvol=/) # uname -a Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET 2017 x86_64 GNU/Linux # btrfs fi show / Label: none uuid: 859dec5c-850c-4660-ad99-bc87456aa309 Total devices 1 FS bytes used 132.89GiB devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot As a side note, all of your disk space is allocated (200GiB of 200GiB). Even while there's still 70GiB of free space scattered around inside, this might lead to out-of-space issues, depending on how badly fragmented that free space is. I have not noticed this at all! # btrfs fi show / Label: none uuid: 859dec5c-850c-4660-ad99-bc87456aa309 Total devices 1 FS bytes used 134.23GiB devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot # btrfs fi df / Data, single: total=195.96GiB, used=131.58GiB System, single: total=3.00MiB, used=48.00KiB Metadata, single: total=4.03GiB, used=2.64GiB GlobalReserve, single: total=512.00MiB, used=0.00B After btrfs defrag there is no difference. btrfs fi show says still 200/200. I'll try to play with it. -- Martin Mlynář -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Root volume (ID 5) in deleting state
Hello, I've currently run into strange problem with BTRFS. I'm using it as my daily driver as root FS. Nothing complicated, just few subvolumes and incremental backups using btrbk. Now I've noticed that my btrfs root volume (absolute top, ID 5) is in "deleting" state. As I've done some testing and googling it seems that this should not be possible. I've tried scrubbing and checking, but nothing changed. Volume is not being deleted in reality. It just sits there in this state. Is there anything I can do to fix this? # btrfs sub list -a /mnt/btrfs_root/ ID 1339 gen 262150 top level 5 path rootfs ID 1340 gen 262101 top level 5 path .btrbk ID 1987 gen 262149 top level 5 path no_backup ID 4206 gen 255869 top level 1340 path /.btrbk/rootfs.20170121T1829 ID 4272 gen 257460 top level 1340 path /.btrbk/rootfs.20170123T0933 ID 4468 gen 259194 top level 1340 path /.btrbk/rootfs.20170131T1132 ID 4474 gen 260911 top level 1340 path /.btrbk/rootfs.20170207T0927 ID 4476 gen 261712 top level 1340 path /.btrbk/rootfs.20170211T ID 4477 gen 261970 top level 1340 path /.btrbk/rootfs.20170212T1331 ID 4478 gen 262102 top level 1340 path /.btrbk/rootfs.20170213T # btrfs sub list -ad /mnt/btrfs_root/ ID 5 gen 257505 top level 0 path /DELETED # mount | grep btr /dev/mapper/vg0-btrfsroot on / type btrfs (rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=1339,subvol=/rootfs) /dev/mapper/vg0-btrfsroot on /mnt/btrfs_root type btrfs (rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=5,subvol=/) # uname -a Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET 2017 x86_64 GNU/Linux # btrfs fi show / Label: none uuid: 859dec5c-850c-4660-ad99-bc87456aa309 Total devices 1 FS bytes used 132.89GiB devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot Thank you for your time, Best regards -- Martin Mlynář -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
On 08.02.2017 14:08 Austin S. Hemmelgarn wrote: > On 2017-02-08 07:14, Martin Raiber wrote: >> Hi, >> >> On 08.02.2017 03:11 Peter Zaitsev wrote: >>> Out of curiosity, I see one problem here: >>> If you're doing snapshots of the live database, each snapshot leaves >>> the database files like killing the database in-flight. Like shutting >>> the system down in the middle of writing data. >>> >>> This is because I think there's no API for user space to subscribe to >>> events like a snapshot - unlike e.g. the VSS API (volume snapshot >>> service) in Windows. You should put the database into frozen state to >>> prepare it for a hotcopy before creating the snapshot, then ensure all >>> data is flushed before continuing. >>> >>> I think I've read that btrfs snapshots do not guarantee single point in >>> time snapshots - the snapshot may be smeared across a longer period of >>> time while the kernel is still writing data. So parts of your writes >>> may still end up in the snapshot after issuing the snapshot command, >>> instead of in the working copy as expected. >>> >>> How is this going to be addressed? Is there some snapshot aware API to >>> let user space subscribe to such events and do proper preparation? Is >>> this planned? LVM could be a user of such an API, too. I think this >>> could have nice enterprise-grade value for Linux. >>> >>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But >>> still, also this needs to be integrated with MySQL to properly work. I >>> once (years ago) researched on this but gave up on my plans when I >>> planned database backups for our web server infrastructure. We moved to >>> creating SQL dumps instead, although there're binlogs which can be used >>> to recover to a clean and stable transactional state after taking >>> snapshots. But I simply didn't want to fiddle around with properly >>> cleaning up binlogs which accumulate horribly much space usage over >>> time. The cleanup process requires to create a cold copy or dump of the >>> complete database from time to time, only then it's safe to remove all >>> binlogs up to that point in time. >> >> little bit off topic, but I for one would be on board with such an >> effort. It "just" needs coordination between the backup >> software/snapshot tools, the backed up software and the various snapshot >> providers. If you look at the Windows VSS API, this would be a >> relatively large undertaking if all the corner cases are taken into >> account, like e.g. a database having the database log on a separate >> volume from the data, dependencies between different components etc. >> >> You'll know more about this, but databases usually fsync quite often in >> their default configuration, so btrfs snapshots shouldn't be much behind >> the properly snapshotted state, so I see the advantages more with >> usability and taking care of corner cases automatically. > Just my perspective, but BTRFS (and XFS, and OCFS2) already provide > reflinking to userspace, and therefore it's fully possible to > implement this in userspace. Having a version of the fsfreeze (the > generic form of xfs_freeze) stuff that worked on individual sub-trees > would be nice from a practical perspective, but implementing it would > not be easy by any means, and would be essentially necessary for a > VSS-like API. In the meantime though, it is fully possible for the > application software to implement this itself without needing anything > more from the kernel. VSS snapshots whole volumes, not individual files (so comparable to an LVM snapshot). The sub-folder freeze would be something useful in some situations, but duplicating the files+extends might also take too long in a lot of situations. You are correct that the kernel features are there and what is missing is a user-space daemon, plus a protocol that facilitates/coordinates the backups/snapshots. Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and manages its on buffer pool which won't get the FIFREEZE and flush, but as said, the default configuration is to flush/fsync on every commit. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS for OLTP Databases
Hi, On 08.02.2017 03:11 Peter Zaitsev wrote: > Out of curiosity, I see one problem here: > If you're doing snapshots of the live database, each snapshot leaves > the database files like killing the database in-flight. Like shutting > the system down in the middle of writing data. > > This is because I think there's no API for user space to subscribe to > events like a snapshot - unlike e.g. the VSS API (volume snapshot > service) in Windows. You should put the database into frozen state to > prepare it for a hotcopy before creating the snapshot, then ensure all > data is flushed before continuing. > > I think I've read that btrfs snapshots do not guarantee single point in > time snapshots - the snapshot may be smeared across a longer period of > time while the kernel is still writing data. So parts of your writes > may still end up in the snapshot after issuing the snapshot command, > instead of in the working copy as expected. > > How is this going to be addressed? Is there some snapshot aware API to > let user space subscribe to such events and do proper preparation? Is > this planned? LVM could be a user of such an API, too. I think this > could have nice enterprise-grade value for Linux. > > XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But > still, also this needs to be integrated with MySQL to properly work. I > once (years ago) researched on this but gave up on my plans when I > planned database backups for our web server infrastructure. We moved to > creating SQL dumps instead, although there're binlogs which can be used > to recover to a clean and stable transactional state after taking > snapshots. But I simply didn't want to fiddle around with properly > cleaning up binlogs which accumulate horribly much space usage over > time. The cleanup process requires to create a cold copy or dump of the > complete database from time to time, only then it's safe to remove all > binlogs up to that point in time. little bit off topic, but I for one would be on board with such an effort. It "just" needs coordination between the backup software/snapshot tools, the backed up software and the various snapshot providers. If you look at the Windows VSS API, this would be a relatively large undertaking if all the corner cases are taken into account, like e.g. a database having the database log on a separate volume from the data, dependencies between different components etc. You'll know more about this, but databases usually fsync quite often in their default configuration, so btrfs snapshots shouldn't be much behind the properly snapshotted state, so I see the advantages more with usability and taking care of corner cases automatically. Regards, Martin Raiber smime.p7s Description: S/MIME Cryptographic Signature
Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?
On 04.01.2017 00:43 Hans van Kranenburg wrote: > On 01/04/2017 12:12 AM, Peter Becker wrote: >> Good hint, this would be an option and i will try this. >> >> Regardless of this the curiosity has packed me and I will try to >> figure out where the problem with the low transfer rate is. >> >> 2017-01-04 0:07 GMT+01:00 Hans van Kranenburg >> : >>> On 01/03/2017 08:24 PM, Peter Becker wrote: All invocations are justified, but not relevant in (offline) backup and archive scenarios. For example you have multiple version of append-only log-files or append-only db-files (each more then 100GB in size), like this: > Snapshot_01_01_2017 -> file1.log .. 201 GB > Snapshot_02_01_2017 -> file1.log .. 205 GB > Snapshot_03_01_2017 -> file1.log .. 221 GB The first 201 GB would be every time the same. Files a copied at night from windows, linux or bsd systems and snapshoted after copy. >>> XY problem? >>> >>> Why not use rsync --inplace in combination with btrfs snapshots? Even if >>> the remote does not support rsync and you need to pull the full file >>> first, you could again use rsync locally. > please don't toppost > > Also, there is a rather huge difference in the two approaches, given the > way how btrfs works internally. > > Say, I have a subvolume with thousands of directories and millions of > files with random data in it, and I want to have a second deduped copy > of it. > > Approach 1: > > Create a full copy of everything (compare: retrieving remote file again) > (now 200% of data storage is used), and after that do deduplication, so > that again only 100% of data storage is used. > > Approach 2: > > cp -av --reflink original/ copy/ > > By doing this, you end up with the same as doing approach 1 if your > deduper is the most ideal in the world (and the files are so random they > don't contain duplicate blocks inside them). > > Approach 3: > > btrfs sub snap original copy > > W00t, that was fast, and the only thing that happened was writing a few > 16kB metadata pages again. (1 for the toplevel tree page that got cloned > into a new filesystem tree, and a few for the blocks one level lower to > add backreferences to the new root). > > So: > > The big difference in the end result between approach 1,2 and otoh 3 is > that while deduplicating your data, you're actually duplicating all your > metadata at the same time. > > In your situation, if possible doing an rsync --inplace from the remote, > so that only changed appended data gets stored, and then useing native > btrfs snapshotting it would seem the most effective. > Or use UrBackup as backup software. It uses the snapshot then modfiy approach with btrfs, plus you get file level deduplication between clients using reflinks. smime.p7s Description: S/MIME Cryptographic Signature