Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)

2021-04-13 Thread Martin Raiber

On 30.03.2021 09:16 Wang Yugui wrote:

H,


On 30.03.21 г. 9:24, Wang Yugui wrote:

Hi, Nikolay Borisov

With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code,
we find out the call stack for ENOMEM.
see the file -btrfs-dump_stack-when-ENOMEM.patch


#cat /usr/hpc-bio/xfstests/results//generic/476.dmesg
...
[ 5759.102929] ENOMEM btrfs_drew_lock_init
[ 5759.102943] ENOMEM btrfs_init_fs_root
[ 5759.102947] [ cut here ]
[ 5759.102950] BTRFS: Transaction aborted (error -12)
[ 5759.103052] WARNING: CPU: 14 PID: 2741468 at 
/ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 
create_pending_snapshot+0xb8c/0xd50 [btrfs]
...


btrfs_drew_lock_init() return -ENOMEM,
this is the source:

 /*
  * We might be called under a transaction (e.g. indirect backref
  * resolution) which could deadlock if it triggers memory reclaim
  */
 nofs_flag = memalloc_nofs_save();
 ret = btrfs_drew_lock_init(&root->snapshot_lock);
 memalloc_nofs_restore(nofs_flag);
 if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
 if (ret)
 goto fail;

And the souce come from:

commit dcc3eb9638c3c927f1597075e851d0a16300a876
Author: Nikolay Borisov 
Date:   Thu Jan 30 14:59:45 2020 +0200

 btrfs: convert snapshot/nocow exlcusion to drew lock


Any advice to fix this ENOMEM problem?

This is likely coming from changed behavior in MM, doesn't seem related
to btrfs. We have multiple places where nofs_save() is called. By the
same token the failure might have occurred in any other place, in any
other piece of code which uses memalloc_nofs_save, there is no
indication that this is directly related to btrfs.


top command show that this server have engough memory.

The hardware of this server:
CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
memory:  192G, no swap

You are showing that the server has 192G of installed memory, you have
not shown any stats which prove at the time of failure what is the state
of the MM subsystem. At the very least at the time of failure inspect
the output of :

cat /proc/meminfo

and "free -m" commands.



Only one xfstest job is running in this server.


Had what looks like the same issue happinging on a server:

[19146.391015] [ cut here ]
[19146.391017] BTRFS: Transaction aborted (error -12)
[19146.391035] WARNING: CPU: 13 PID: 1825871 at 
fs/btrfs/transaction.c:1684 create_pending_snapshot+0x912/0xd10
[19146.391036] Modules linked in: bcache crc64 loop dm_crypt bfq xfs 
dm_mod st sr_mod cdrom intel_powerclamp coretemp dcdbas kvm_intel 
snd_pcm snd_timer kvm snd irqbypass soundcore mgag200 serio_raw pcspkr 
drm_kms_helper evdev joydev iTCO_wdt iTCO_vendor_support i2c_algo_bit 
i7core_edac sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter 
button ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi drm configfs ip_tables x_tables autofs4 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx raid1 raid0 multipath linear md_mod sd_mod hid_generic usbhid 
hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel 
aesni_intel crypto_simd ahci cryptd glue_helper mpt3sas libahci uhci_hcd 
ehci_pci psmouse ehci_hcd lpc_ich raid_class libata nvme 
scsi_transport_sas mfd_core usbcore nvme_core scsi_mod t10_pi bnx2
[19146.391092] CPU: 13 PID: 1825871 Comm: btrfs Tainted: G W I   
5.10.26 #1
[19146.391093] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 
1.14.0 05/30/2018

[19146.391095] RIP: 0010:create_pending_snapshot+0x912/0xd10
[19146.391097] Code: 48 0f ba aa 40 0a 00 00 02 72 28 83 f8 fb 74 48 83 
f8 e2 74 43 89 c6 48 c7 c7 70 2d 10 82 48 89 85 78 ff ff ff e8 d5 65 55 
00 <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 94 06 00 00 48 c7 c6 70 46 e4

[19146.391098] RSP: 0018:c900201c3b00 EFLAGS: 00010286
[19146.391099] RAX:  RBX: 8881ba393200 RCX: 
0fb98b88
[19146.391100] RDX: ffd8 RSI: 0027 RDI: 
0fb98b80
[19146.391101] RBP: c900201c3bd0 R08: 825e2148 R09: 
00027ffb
[19146.391101] R10: 8000 R11: 3fff R12: 
888119dd39c0
[19146.391102] R13: 888248c36800 R14: 888a1bf69800 R15: 
fff4
[19146.391103] FS:  7f1d7c9488c0() GS:0fb8() 
knlGS:

[19146.391104] CS:  0010 DS:  ES:  CR0: 80050033
[19146.391105] CR2: 7fffef58d000 CR3: 00028c988004 CR4: 
000206e0

[19146.391106] Call Trace:
[19146.39]  ? create_pending_snapshots+0xa2/0xc0
[19146.391112]  create_pending_snapshots+0xa2/0xc0
[19146.391114]  btrfs_commit_transaction+0x4b9/0xb40
[19146.391116]  ? start_transaction+0xd2/0x580
[19146.391119]  btrfs_mksubvol+0x29e/0x450
[19146.391122]  btrfs_mksnapshot+0x7b/0xb0
[19146.391124]  __btrfs_ioctl_snap_create+0x16f/0x180
[19146.391126]  btrfs_ioctl_snap_create_v2+0xb3/0x130
[19146.391128]  btrfs_ioctl+0x15f/0x3040
[19146.391131]  ? __x64_sy

Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8

2021-04-08 Thread Martin Raiber

On 11.03.2021 18:58 Martin Raiber wrote:

On 01.02.2021 23:08 Martin Raiber wrote:

On 27.01.2021 22:03 Chris Murphy wrote:

On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber  wrote:

Hi,

seems 5.10.8 still has the ENOSPC issue when compression is used 
(compress-force=zstd,space_cache=v2):

Jan 27 11:02:14  kernel: [248571.569840] [ cut here ]
Jan 27 11:02:14  kernel: [248571.569843] BTRFS: Transaction aborted (error -28)
Jan 27 11:02:14  kernel: [248571.569845] BTRFS: error (device dm-0) in 
add_to_free_space_tree:1039: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569848] BTRFS info (device dm-0): forced 
readonly
Jan 27 11:02:14  kernel: [248571.569851] BTRFS: error (device dm-0) in 
add_to_free_space_tree:1039: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569852] BTRFS: error (device dm-0) in 
__btrfs_free_extent:3270: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569854] BTRFS: error (device dm-0) in 
btrfs_run_delayed_refs:2191: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at 
fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130
Jan 27 11:02:14  kernel: [248571.569913] BTRFS: error (device dm-0) in 
__btrfs_free_extent:3270: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569939] Modules linked in:
Jan 27 11:02:14  kernel: [248571.569966] BTRFS: error (device dm-0) in 
btrfs_run_delayed_refs:2191: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569992]  bfq zram bcache crc64 loop dm_crypt 
xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter bridge stp llc 
intel_powerclamp coretemp k$
Jan 27 11:02:14  kernel: [248571.570075] CPU: 3 PID: 21255 Comm: kworker/u50:22 
Tainted: G  I   5.10.8 #1
Jan 27 11:02:14  kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge 
R510/0DPRKF, BIOS 1.13.0 03/02/2018
Jan 27 11:02:14  kernel: [248571.570079] Workqueue: events_unbound 
btrfs_async_reclaim_metadata_space
Jan 27 11:02:14  kernel: [248571.570081] RIP: 
0010:add_to_free_space_tree+0xe8/0x130
Jan 27 11:02:14  kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 00 02 72 
22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 24 04 e8 8a 99 4a 
00 <0f> 0b 8$
Jan 27 11:02:14  kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: 
00010282
Jan 27 11:02:14  kernel: [248571.570084] RAX:  RBX: 
4000 RCX: 0027
Jan 27 11:02:14  kernel: [248571.570085] RDX: 0027 RSI: 
0004 RDI: 888617a58b88
Jan 27 11:02:14  kernel: [248571.570086] RBP: 8889ecb874e0 R08: 
888617a58b80 R09: 
Jan 27 11:02:14  kernel: [248571.570087] R10: 0001 R11: 
822372e0 R12: 00574151
Jan 27 11:02:14  kernel: [248571.570087] R13: 8884e05727e0 R14: 
88815ae4fc00 R15: 88815ae4fdd8
Jan 27 11:02:14  kernel: [248571.570088] FS:  () 
GS:888617a4() knlGS:
Jan 27 11:02:14  kernel: [248571.570089] CS:  0010 DS:  ES:  CR0: 
80050033
Jan 27 11:02:14  kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: 
0260a005 CR4: 000206e0
Jan 27 11:02:14  kernel: [248571.570091] Call Trace:
Jan 27 11:02:14  kernel: [248571.570097]  __btrfs_free_extent.isra.0+0x56a/0xa10
Jan 27 11:02:14  kernel: [248571.570100]  __btrfs_run_delayed_refs+0x659/0xf20
Jan 27 11:02:14  kernel: [248571.570102]  btrfs_run_delayed_refs+0x73/0x200
Jan 27 11:02:14  kernel: [248571.570103]  flush_space+0x4e8/0x5e0
Jan 27 11:02:14  kernel: [248571.570105]  ? btrfs_get_alloc_profile+0x66/0x1b0
Jan 27 11:02:14  kernel: [248571.570106]  ? btrfs_get_alloc_profile+0x66/0x1b0
Jan 27 11:02:14  kernel: [248571.570107]  
btrfs_async_reclaim_metadata_space+0x107/0x3a0
Jan 27 11:02:14  kernel: [248571.570111]  process_one_work+0x1b6/0x350
Jan 27 11:02:14  kernel: [248571.570112]  worker_thread+0x50/0x3b0
Jan 27 11:02:14  kernel: [248571.570114]  ? process_one_work+0x350/0x350
Jan 27 11:02:14  kernel: [248571.570116]  kthread+0xfe/0x140
Jan 27 11:02:14  kernel: [248571.570117]  ? kthread_park+0x90/0x90
Jan 27 11:02:14  kernel: [248571.570120]  ret_from_fork+0x22/0x30
Jan 27 11:02:14  kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]---
Jan 27 11:02:14  kernel: [248571.570123] BTRFS: error (device dm-0) in 
add_to_free_space_tree:1039: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.570151] BTRFS: error (device dm-0) in 
__btrfs_free_extent:3270: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.570178] BTRFS: error (device dm-0) in 
btrfs_run_delayed_refs:2191: errno=-28 No space left


btrfs fi usage:

Overall:
 Device size: 931.49GiB
 Device allocated:931.49GiB
 Device unallocated:1.00MiB
 Device missing:  0.00B
 Used:786.39GiB
 Free (estimated):107.69GiB  (min: 107.69GiB)
   

Re: btrfs-send format that contains binary diffs

2021-03-29 Thread Martin Raiber
On 29.03.2021 19:25 Henning Schild wrote:
> Am Mon, 29 Mar 2021 19:30:34 +0300
> schrieb Andrei Borzenkov :
>
>> On 29.03.2021 16:16, Claudius Heine wrote:
>>> Hi,
>>>
>>> I am currently investigating the possibility to use `btrfs-stream`
>>> files (generated by `btrfs send`) for deploying a image based
>>> update to systems (probably embedded ones).
>>>
>>> One of the issues I encountered here is that btrfs-send does not
>>> use any diff algorithm on files that have changed from one snapshot
>>> to the next. 
>> btrfs send works on block level. It sends blocks that differ between
>> two snapshots.
>>
>>> One way to implement this would be to add some sort of 'patch'
>>> command to the `btrfs-stream` format.
>>>   
>> This would require reading complete content of both snapshots instead
>> if just computing block diff using metadata. Unless I misunderstand
>> what you mean.
> On embedded systems it is common to update complete "firmware" images
> as opposed to package based partial updates. You often have two root
> filesystems to be able to always fall back to a working state in case
> of any sort or error.
>
> Take the picture from
> https://sbabic.github.io/swupdate/overview.html#double-copy
>
> and assume that "Application software" is a full blown OS with
> everything that makes your device.
>
> That approach offers great "control" but unfortunately can also lead to
> great downloads required for an update. The basic idea is to download
> the binary-diff between the future and the current rootfs only.
> Given a filesystem supports snapshots, it would be great to
> "send/receive" them as diffs.
>
> Today most people that do such things with other fss script around with
> xdelta etc. But btrfs is more "integrated", so when considering it for
> such embedded usecases native support would most likely be better than
> hacks on top.
>
> We have several use-cases in mind with btrfs.
>  - ro-base with rw overlays
>  - binary diff updates against such a ro-base
>  - backup/restore with snapshots of certain subvolumes
>  - factory reset with wiping certain submodules
>
> regards,
> Henning

I think I know what you want to accomplish and I've been doing it for a while 
now. But I don't know what the problem with btrfs send is? Do you want to have 
non-block based diff to make updates smaller? Have you overwritten files 
completely and need to dedupe or reflink them before sending them? 
Theoretically the btrfs send format would be able to support something like 
bsdiff (non-block based diff -- it is just a set of e.g. write commands with 
offset and binary data or using reflink to copy data from one file to another), 
but there currently isn't a tool to create this.

How I've done it is:

 - Create a btrfs image with a rw sys_root_current subvol
 - E.g. debootstrap a Linux system into it
 - Create sys_root_v1 as ro snapshot of sys_root_current

Use that system image on different systems.

On update on the original image:

 - Modify sys_root_current
 - Create ro snapshot sys_root_v2 of sys_root_current
 - Create an btrfs send update that modifies sys_root_v1 to sys_root_v2: btrfs 
send -p sys_root_v1 sys_root_v2 | xz -c > update_v1.btrfs.xz
 - Publish update_v1.btrfs.xz

On the systems:

 - Download update_v1.btrfs.xz (verify signature)
 - Create sys_root_v2 by applying differences to sys_root_v1: cat 
update_v1.btrfs.xz | xz -d -c | btrfs receive /rootfs
 - Rename (exchange) sys_root_current to sys_root_last
 - Create rw snapshot of sys_root_v2 as sys_root_current
 - Reboot into new system

>>> Is this something upstream would be interested in?
>>>
>>> Lets say we introduce a new `btrfs-send` format, lets call it
>>> `btrfs-delta-stream`, which could can be created from a
>>> `btrfs-stream`:
>>>
>>> 1. For all `write` commands, check the requirements:
>>>    - Does the file already exists in the old snapshot?
>>>    - Is the file smaller than xMiB (this depends on the diff-algo
>>> and the available resources)
>>> 2. If the file fulfills those requirements, replace 'write' command
>>> with 'patch' command, and calculate the binary delta.  Also check
>>> if the delta is actually smaller than the data of the new file.
>>> Possible add the used binary diff algo as well as a checksum of the
>>> 'old' file to the command as well.
>>>
>>> This file format can of course be converted back to `btrfs-stream`
>>> and then applied with `btrfs-receive`.
>>>
>>> I would probably start with `bsdiff` for the diff algorithm, but
>>> maybe we want to be flexible here.
>>>
>>> Of course if `btrfs-delta-stream` is implemented in `btrfs-progs`
>>> then, we can create and apply this format directly.
>>>
>>> regards,
>>> Claudius  




Re: Multiple files with the same name in one directory

2021-03-11 Thread Martin Raiber
On 11.03.2021 15:43 Filipe Manana wrote:
> On Wed, Mar 10, 2021 at 5:18 PM Martin Raiber  wrote:
>> Hi,
>>
>> I have this in a btrfs directory. Linux kernel 5.10.16, no errors in dmesg, 
>> no scrub errors:
>>
>> ls -lh
>> total 19G
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> -rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
>> ...
>>
>> disk_config.dat gets written to using fsync rename ( write new version to 
>> disk_config.dat.new, fsync disk_config.dat.new, then rename to 
>> disk_config.dat -- it is missing the parent directory fsync).
> That's interesting.
>
> I've just tried something like the following on 5.10.15 (and 5.12-rc2):
>
> create disk_config.dat
> sync
> for ((i = 0; i < 10; i++)); do
> create disk_config.dat.new
> write to disk_config.dat.new
> fsync disk_config.dat.new
> mv -f disk_config.dat.new disk_config.dat
> done
> 
> mount fs
> list directory
>
> I only get one file with the name disk_config.dat and one file with
> the name disk_config.dat.new.
> File disk_config.dat has the data written at iteration 9 and
> disk_config.dat.new has the data written at iteration 10 (expected).
>
> You haven't mentioned, but I suppose you had a power failure / unclean
> shutdown somewhere after an fsync, right?
> Is this something you can reproduce at will?

I think I rebooted via "echo b > /proc/sysrq-trigger". But at that point it 
probably didn't write to disk_config.dat anymore (for more than the commit 
interval). I'm also not sure about the delay of me noticing those multiple 
files (since it doesn't cause any problems) -- can't reproduce.

This is the same machine and file system with ENOSPC in 
btrfs_async_reclaim_metadata_space -> flush_space -> btrfs_run_delayed_refs. 
Could be that something went wrong with the error handling/remount-ro w.r.t. to 
the tree log?

>
>> So far no negative consequences... (except that programs might get confused).
>>
>> echo 3 > /proc/sys/vm/drop_caches doesn't help.
>>
>> Regards,
>> Martin Raiber




Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8

2021-03-11 Thread Martin Raiber
On 01.02.2021 23:08 Martin Raiber wrote:
> On 27.01.2021 22:03 Chris Murphy wrote:
>> On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber  wrote:
>>> Hi,
>>>
>>> seems 5.10.8 still has the ENOSPC issue when compression is used 
>>> (compress-force=zstd,space_cache=v2):
>>>
>>> Jan 27 11:02:14  kernel: [248571.569840] [ cut here 
>>> ]
>>> Jan 27 11:02:14  kernel: [248571.569843] BTRFS: Transaction aborted (error 
>>> -28)
>>> Jan 27 11:02:14  kernel: [248571.569845] BTRFS: error (device dm-0) in 
>>> add_to_free_space_tree:1039: errno=-28 No space left
>>> Jan 27 11:02:14  kernel: [248571.569848] BTRFS info (device dm-0): forced 
>>> readonly
>>> Jan 27 11:02:14  kernel: [248571.569851] BTRFS: error (device dm-0) in 
>>> add_to_free_space_tree:1039: errno=-28 No space left
>>> Jan 27 11:02:14  kernel: [248571.569852] BTRFS: error (device dm-0) in 
>>> __btrfs_free_extent:3270: errno=-28 No space left
>>> Jan 27 11:02:14  kernel: [248571.569854] BTRFS: error (device dm-0) in 
>>> btrfs_run_delayed_refs:2191: errno=-28 No space left
>>> Jan 27 11:02:14  kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at 
>>> fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130
>>> Jan 27 11:02:14  kernel: [248571.569913] BTRFS: error (device dm-0) in 
>>> __btrfs_free_extent:3270: errno=-28 No space left
>>> Jan 27 11:02:14  kernel: [248571.569939] Modules linked in:
>>> Jan 27 11:02:14  kernel: [248571.569966] BTRFS: error (device dm-0) in 
>>> btrfs_run_delayed_refs:2191: errno=-28 No space left
>>> Jan 27 11:02:14  kernel: [248571.569992]  bfq zram bcache crc64 loop 
>>> dm_crypt xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter 
>>> bridge stp llc intel_powerclamp coretemp k$
>>> Jan 27 11:02:14  kernel: [248571.570075] CPU: 3 PID: 21255 Comm: 
>>> kworker/u50:22 Tainted: G  I   5.10.8 #1
>>> Jan 27 11:02:14  kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge 
>>> R510/0DPRKF, BIOS 1.13.0 03/02/2018
>>> Jan 27 11:02:14  kernel: [248571.570079] Workqueue: events_unbound 
>>> btrfs_async_reclaim_metadata_space
>>> Jan 27 11:02:14  kernel: [248571.570081] RIP: 
>>> 0010:add_to_free_space_tree+0xe8/0x130
>>> Jan 27 11:02:14  kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 
>>> 00 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 
>>> 44 24 04 e8 8a 99 4a 00 <0f> 0b 8$
>>> Jan 27 11:02:14  kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: 
>>> 00010282
>>> Jan 27 11:02:14  kernel: [248571.570084] RAX:  RBX: 
>>> 4000 RCX: 0027
>>> Jan 27 11:02:14  kernel: [248571.570085] RDX: 0027 RSI: 
>>> 0004 RDI: 888617a58b88
>>> Jan 27 11:02:14  kernel: [248571.570086] RBP: 8889ecb874e0 R08: 
>>> 888617a58b80 R09: 
>>> Jan 27 11:02:14  kernel: [248571.570087] R10: 0001 R11: 
>>> 822372e0 R12: 00574151
>>> Jan 27 11:02:14  kernel: [248571.570087] R13: 8884e05727e0 R14: 
>>> 88815ae4fc00 R15: 88815ae4fdd8
>>> Jan 27 11:02:14  kernel: [248571.570088] FS:  () 
>>> GS:888617a4() knlGS:
>>> Jan 27 11:02:14  kernel: [248571.570089] CS:  0010 DS:  ES:  CR0: 
>>> 80050033
>>> Jan 27 11:02:14  kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: 
>>> 0260a005 CR4: 000206e0
>>> Jan 27 11:02:14  kernel: [248571.570091] Call Trace:
>>> Jan 27 11:02:14  kernel: [248571.570097]  
>>> __btrfs_free_extent.isra.0+0x56a/0xa10
>>> Jan 27 11:02:14  kernel: [248571.570100]  
>>> __btrfs_run_delayed_refs+0x659/0xf20
>>> Jan 27 11:02:14  kernel: [248571.570102]  btrfs_run_delayed_refs+0x73/0x200
>>> Jan 27 11:02:14  kernel: [248571.570103]  flush_space+0x4e8/0x5e0
>>> Jan 27 11:02:14  kernel: [248571.570105]  ? 
>>> btrfs_get_alloc_profile+0x66/0x1b0
>>> Jan 27 11:02:14  kernel: [248571.570106]  ? 
>>> btrfs_get_alloc_profile+0x66/0x1b0
>>> Jan 27 11:02:14  kernel: [248571.570107]  
>>> btrfs_async_reclaim_metadata_space+0x107/0x3a0
>>> Jan 27 11:02:14  kernel: [248571.570111]  process_one_work+0x1b6/0x350
>>> Jan 27 11:02:14  kernel: [248571.570112]  worker_thread+0x50/0x3b0
>>> Jan 27 11:02:14  kernel: [248571.570114]  ? process_one_work+0x350/0x350
>>> Jan 27 11:02:14

Multiple files with the same name in one directory

2021-03-10 Thread Martin Raiber
Hi,

I have this in a btrfs directory. Linux kernel 5.10.16, no errors in dmesg, no 
scrub errors:

ls -lh
total 19G
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
...

disk_config.dat gets written to using fsync rename ( write new version to 
disk_config.dat.new, fsync disk_config.dat.new, then rename to disk_config.dat 
-- it is missing the parent directory fsync).

So far no negative consequences... (except that programs might get confused).

echo 3 > /proc/sys/vm/drop_caches doesn't help.

Regards,
Martin Raiber



Re: [PATCH] btrfs: Prevent nowait or async read from doing sync IO

2021-03-08 Thread Martin Raiber
On 26.02.2021 18:00 David Sterba wrote:
> On Fri, Jan 08, 2021 at 12:02:48AM +0000, Martin Raiber wrote:
>> When reading from btrfs file via io_uring I get following
>> call traces:
>>
>> [<0>] wait_on_page_bit+0x12b/0x270
>> [<0>] read_extent_buffer_pages+0x2ad/0x360
>> [<0>] btree_read_extent_buffer_pages+0x97/0x110
>> [<0>] read_tree_block+0x36/0x60
>> [<0>] read_block_for_search.isra.0+0x1a9/0x360
>> [<0>] btrfs_search_slot+0x23d/0x9f0
>> [<0>] btrfs_lookup_csum+0x75/0x170
>> [<0>] btrfs_lookup_bio_sums+0x23d/0x630
>> [<0>] btrfs_submit_data_bio+0x109/0x180
>> [<0>] submit_one_bio+0x44/0x70
>> [<0>] extent_readahead+0x37a/0x3a0
>> [<0>] read_pages+0x8e/0x1f0
>> [<0>] page_cache_ra_unbounded+0x1aa/0x1f0
>> [<0>] generic_file_buffered_read+0x3eb/0x830
>> [<0>] io_iter_do_read+0x1a/0x40
>> [<0>] io_read+0xde/0x350
>> [<0>] io_issue_sqe+0x5cd/0xed0
>> [<0>] __io_queue_sqe+0xf9/0x370
>> [<0>] io_submit_sqes+0x637/0x910
>> [<0>] __x64_sys_io_uring_enter+0x22e/0x390
>> [<0>] do_syscall_64+0x33/0x80
>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Prevent those by setting IOCB_NOIO before calling
>> generic_file_buffered_read.
>>
>> Async read has the same problem. So disable that by removing
>> FMODE_BUF_RASYNC. This was added with commit
>> 8730f12b7962b21ea9ad2756abce1e205d22db84 ("btrfs: flag files as
>> supporting buffered async reads") with 5.9. Io_uring will read
>> the data via worker threads if it can't be read without sync IO
>> this way.
>>
>> Signed-off-by: Martin Raiber 
>> ---
>>  fs/btrfs/file.c | 15 +--
>>  1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 0e41459b8..8bb561f6d 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -3589,7 +3589,7 @@ static loff_t btrfs_file_llseek(struct file *file, 
>> loff_t offset, int whence)
>>  
>>  static int btrfs_file_open(struct inode *inode, struct file *filp)
>>  {
>> -filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
>> +filp->f_mode |= FMODE_NOWAIT;
>>  return generic_file_open(inode, filp);
>>  }
>>  
>> @@ -3639,7 +3639,18 @@ static ssize_t btrfs_file_read_iter(struct kiocb 
>> *iocb, struct iov_iter *to)
>>  return ret;
>>  }
>>  
>> -return generic_file_buffered_read(iocb, to, ret);
>> +if (iocb->ki_flags & IOCB_NOWAIT)
>> +iocb->ki_flags |= IOCB_NOIO;
>> +
>> +ret = generic_file_buffered_read(iocb, to, ret);
>> +
>> +if (iocb->ki_flags & IOCB_NOWAIT) {
>> +iocb->ki_flags &= ~IOCB_NOIO;
>> +if (ret == 0)
>> +ret = -EAGAIN;
>> +}
> Christoph has some doubts about the code,
> https://lore.kernel.org/lkml/20210226051626.ga2...@lst.de/
>
> The patch has been in for-next but as I'm not sure it's correct and
> don't have a reproducer, I'll remove it again. We do want to fix the
> warning, maybe there's only something trivial missing but we need to be
> sure, I don't have enough expertise here.

The general gist of the critism is kind of correct. It is 
generic_file_buffered_read/filemap_read that handles the IOCB_NOIO, however. It 
is only used from gfs2 since 5.8 and IOCB_NOIO was added to 5.8 with 
41da51bce36f44eefc1e3d0f47d18841cbd065ba 

However, I cannot see how to find out if readahead was called with IOCB_NOWAIT 
from extent_readahead/btrfs_readahead/readahead_control. So add an additional 
parameter to address_space_operations.readahead ? As mentioned, not too 
relevant to btrfs (because of the CRC calculation), but making readahead async 
in all cases (incl. IOCB_WAITQ) would be the proper solution.

W.r.t. testing: The most low-effort way I can think of is to add an io_uring 
switch to xfs_io, so that xfstests can be run using io_uring (where possible). 
Then check via tracing/perf that there aren't any call stacks with both 
io_uring_enter and wait_on_page_bit (or any other blocking call) in them.



Space cache

2021-02-03 Thread Martin Raiber
Hi,

I've been looking a bit into the btrfs space cache and came to following 
conclusions. Please correct me if I'm wrong:

1. The space cache mount option only modifies how the space cache is persisted 
and not the in-memory structures (hence why I have 2,3 GiB 
btrfs_free_space_bitmap slab with a file system mounted with space_cache=v2)
2. In-memory it is mostly kept as bitmap. Space_cache=v1 persists those bitmaps 
directly to disk
3. If it's mounted with nospace_cache it still gets all the benefits of "space 
cache" _after_ those in-memory bitmaps have been filled, it just isn't persisted
4. In-memory space cache doesn't react to memory pressure/is unevictable

This leads me to:

If one can live with slow startup/initial performance, mounting with 
nospace_cache has the highest performance.

Especially if I have a 1TB NVMe in a long-running server, I don't really care 
if it has to iterate over all block group metadata after mount for a few 
seconds, if that means it has less write IOs for every write. The calculus 
obivously changes for a hard disk where reading this metadata would talke 
forever due to low IOPS.

Regards,
Martin Raiber



Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8

2021-02-01 Thread Martin Raiber
On 27.01.2021 22:03 Chris Murphy wrote:
> On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber  wrote:
>> Hi,
>>
>> seems 5.10.8 still has the ENOSPC issue when compression is used 
>> (compress-force=zstd,space_cache=v2):
>>
>> Jan 27 11:02:14  kernel: [248571.569840] [ cut here ]
>> Jan 27 11:02:14  kernel: [248571.569843] BTRFS: Transaction aborted (error 
>> -28)
>> Jan 27 11:02:14  kernel: [248571.569845] BTRFS: error (device dm-0) in 
>> add_to_free_space_tree:1039: errno=-28 No space left
>> Jan 27 11:02:14  kernel: [248571.569848] BTRFS info (device dm-0): forced 
>> readonly
>> Jan 27 11:02:14  kernel: [248571.569851] BTRFS: error (device dm-0) in 
>> add_to_free_space_tree:1039: errno=-28 No space left
>> Jan 27 11:02:14  kernel: [248571.569852] BTRFS: error (device dm-0) in 
>> __btrfs_free_extent:3270: errno=-28 No space left
>> Jan 27 11:02:14  kernel: [248571.569854] BTRFS: error (device dm-0) in 
>> btrfs_run_delayed_refs:2191: errno=-28 No space left
>> Jan 27 11:02:14  kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at 
>> fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130
>> Jan 27 11:02:14  kernel: [248571.569913] BTRFS: error (device dm-0) in 
>> __btrfs_free_extent:3270: errno=-28 No space left
>> Jan 27 11:02:14  kernel: [248571.569939] Modules linked in:
>> Jan 27 11:02:14  kernel: [248571.569966] BTRFS: error (device dm-0) in 
>> btrfs_run_delayed_refs:2191: errno=-28 No space left
>> Jan 27 11:02:14  kernel: [248571.569992]  bfq zram bcache crc64 loop 
>> dm_crypt xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter 
>> bridge stp llc intel_powerclamp coretemp k$
>> Jan 27 11:02:14  kernel: [248571.570075] CPU: 3 PID: 21255 Comm: 
>> kworker/u50:22 Tainted: G  I   5.10.8 #1
>> Jan 27 11:02:14  kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge 
>> R510/0DPRKF, BIOS 1.13.0 03/02/2018
>> Jan 27 11:02:14  kernel: [248571.570079] Workqueue: events_unbound 
>> btrfs_async_reclaim_metadata_space
>> Jan 27 11:02:14  kernel: [248571.570081] RIP: 
>> 0010:add_to_free_space_tree+0xe8/0x130
>> Jan 27 11:02:14  kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 
>> 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 
>> 24 04 e8 8a 99 4a 00 <0f> 0b 8$
>> Jan 27 11:02:14  kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: 
>> 00010282
>> Jan 27 11:02:14  kernel: [248571.570084] RAX:  RBX: 
>> 4000 RCX: 0027
>> Jan 27 11:02:14  kernel: [248571.570085] RDX: 0027 RSI: 
>> 0004 RDI: 888617a58b88
>> Jan 27 11:02:14  kernel: [248571.570086] RBP: 8889ecb874e0 R08: 
>> 888617a58b80 R09: 
>> Jan 27 11:02:14  kernel: [248571.570087] R10: 0001 R11: 
>> 822372e0 R12: 00574151
>> Jan 27 11:02:14  kernel: [248571.570087] R13: 8884e05727e0 R14: 
>> 88815ae4fc00 R15: 88815ae4fdd8
>> Jan 27 11:02:14  kernel: [248571.570088] FS:  () 
>> GS:888617a4() knlGS:
>> Jan 27 11:02:14  kernel: [248571.570089] CS:  0010 DS:  ES:  CR0: 
>> 80050033
>> Jan 27 11:02:14  kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: 
>> 0260a005 CR4: 000206e0
>> Jan 27 11:02:14  kernel: [248571.570091] Call Trace:
>> Jan 27 11:02:14  kernel: [248571.570097]  
>> __btrfs_free_extent.isra.0+0x56a/0xa10
>> Jan 27 11:02:14  kernel: [248571.570100]  
>> __btrfs_run_delayed_refs+0x659/0xf20
>> Jan 27 11:02:14  kernel: [248571.570102]  btrfs_run_delayed_refs+0x73/0x200
>> Jan 27 11:02:14  kernel: [248571.570103]  flush_space+0x4e8/0x5e0
>> Jan 27 11:02:14  kernel: [248571.570105]  ? 
>> btrfs_get_alloc_profile+0x66/0x1b0
>> Jan 27 11:02:14  kernel: [248571.570106]  ? 
>> btrfs_get_alloc_profile+0x66/0x1b0
>> Jan 27 11:02:14  kernel: [248571.570107]  
>> btrfs_async_reclaim_metadata_space+0x107/0x3a0
>> Jan 27 11:02:14  kernel: [248571.570111]  process_one_work+0x1b6/0x350
>> Jan 27 11:02:14  kernel: [248571.570112]  worker_thread+0x50/0x3b0
>> Jan 27 11:02:14  kernel: [248571.570114]  ? process_one_work+0x350/0x350
>> Jan 27 11:02:14  kernel: [248571.570116]  kthread+0xfe/0x140
>> Jan 27 11:02:14  kernel: [248571.570117]  ? kthread_park+0x90/0x90
>> Jan 27 11:02:14  kernel: [248571.570120]  ret_from_fork+0x22/0x30
>> Jan 27 11:02:14  kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]---
>> Jan 27 11:02:14  kernel: [248571.570123] BTRFS: error (device dm-0)

ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd

2021-01-27 Thread Martin Raiber

Data,single: Size:884.48GiB, Used:776.79GiB (87.82%)
   /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533  884.48GiB

Metadata,single: Size:47.01GiB, Used:9.59GiB (20.41%)
   /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533   47.01GiB

System,single: Size:4.00MiB, Used:144.00KiB (3.52%)
   /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533    4.00MiB

Unallocated:
   /dev/mapper/LUKS-RC-a6414fd731ce4f878af44c3987bce533    1.00MiB


Regards,
Martin Raiber



Re: [PATCH] btrfs: Prevent nowait or async read from doing sync IO

2021-01-24 Thread Martin Raiber

On 12.01.2021 18:01 Pavel Begunkov wrote:

On 12/01/2021 15:36, David Sterba wrote:

On Fri, Jan 08, 2021 at 12:02:48AM +, Martin Raiber wrote:

When reading from btrfs file via io_uring I get following
call traces:

Is there a way to reproduce by common tools (fio) or is a specialized
one needed?

I'm not familiar with this particular issue, but:
should _probably_ be reproducible with fio with io_uring engine
or fio/t/io_uring tool.


[<0>] wait_on_page_bit+0x12b/0x270
[<0>] read_extent_buffer_pages+0x2ad/0x360
[<0>] btree_read_extent_buffer_pages+0x97/0x110
[<0>] read_tree_block+0x36/0x60
[<0>] read_block_for_search.isra.0+0x1a9/0x360
[<0>] btrfs_search_slot+0x23d/0x9f0
[<0>] btrfs_lookup_csum+0x75/0x170
[<0>] btrfs_lookup_bio_sums+0x23d/0x630
[<0>] btrfs_submit_data_bio+0x109/0x180
[<0>] submit_one_bio+0x44/0x70
[<0>] extent_readahead+0x37a/0x3a0
[<0>] read_pages+0x8e/0x1f0
[<0>] page_cache_ra_unbounded+0x1aa/0x1f0
[<0>] generic_file_buffered_read+0x3eb/0x830
[<0>] io_iter_do_read+0x1a/0x40
[<0>] io_read+0xde/0x350
[<0>] io_issue_sqe+0x5cd/0xed0
[<0>] __io_queue_sqe+0xf9/0x370
[<0>] io_submit_sqes+0x637/0x910
[<0>] __x64_sys_io_uring_enter+0x22e/0x390
[<0>] do_syscall_64+0x33/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Prevent those by setting IOCB_NOIO before calling
generic_file_buffered_read.

Async read has the same problem. So disable that by removing
FMODE_BUF_RASYNC. This was added with commit
8730f12b7962b21ea9ad2756abce1e205d22db84 ("btrfs: flag files as

Oh yeah that's the commit that went to btrfs code out-of-band. I am not
familiar with the io_uring support and have no good idea what the new
flag was supposed to do.

iirc, Jens did make buffered IO asynchronous by waiting on a page
with wait_page_queue, but don't remember well enough.


supporting buffered async reads") with 5.9. Io_uring will read
the data via worker threads if it can't be read without sync IO
this way.

What are the implications of that? Like more context switching (due to
the worker threads) or other potential performance related problems?

io_uring splits submission and completion steps and usually expect
submissions to happen quick and not block (at least for long),
otherwise it can't submit other requests, that reduces QD and so
forth. In the worst case it can serialise it to QD1. I guess the
same can be applied to AIO.


Io_submit historically had the problem that it is truely async only for 
certain operations. That's why everyone only uses it only for async 
direct I/O with preallocated files (and even then e.g. Mysql has 
innodb_use_native_aio as tuning option that replaces io_submit with a 
userspace thread pool). Io_uring is fixing that by making everything 
async, so the thread calling io_uring_enter never should do any io (only 
read from page cache etc.). The idea is that one can build e.g. a web 
server that uses only one thread and does all (former blocking) syscalls 
via io_uring and handles a large amount of connections. If btrfs does 
blocking io in this one thread this web server wouldn't work well with 
btrfs since the blocking call would e.g. delay accepting new connections.


Specifically w.r.t. read() io_uring has following logic:

 * Try read_iter with RWF_NOWAIT/IOCB_NOWAIT.
 * If read_iter returns -EAGAIN. Look at the FMODE_BUF_RASYNC flag. If
   set do the read with IOCB_WAITQ and callback set (AIO).
 * If FMODE_BUF_RASYNC is not set, sync read in a io_uring worker thread.

My guess is that since btrfs needs to do the checksum calculations in a 
worker anyway, that it's best/simpler to not support the AIO submission 
(so not set the FMODE_BUF_RASYNC).


W.r.t. RWF_NOWAIT the problem is that it synchronously reads the csum 
before async submitting the page reads. When reading randomly from a 
(large) file this means at least one synchronous read per io_uring 
submission. I guess the same happens for preadv2 with RWF_NOWAIT (and 
io_submit), man page:



*RWF_NOWAIT *(since Linux 4.14)
   Do not wait for data which is not immediately available.
   If this flag is specified, the*preadv2*() system call will
   return instantly if it would have to read data from the
   backing storage or wait for a lock.  If some data was
   successfully read, it will return the number of bytes
   read.  If no bytes were read, it will return -1 and set
   /errno <https://man7.org/linux/man-pages/man3/errno.3.html>/  
to*EAGAIN*.  Currently, this flag is meaningful only
   for*preadv2*().


I haven't tested this, but the same would probably happen if it doesn't 
have the extents in cache, though that might happen seldom enough that 
it's not worth fixing (for now).


I did also look at how ext4 with fs-ve

Re: [RFC][PATCH V5] btrfs: preferred_metadata: preferred device for metadata

2021-01-21 Thread Martin Svec
Hi all,

Dne 20.01.2021 v 0:12 Zygo Blaxell napsal(a):
> With the 4 device types we can trivially specify this arrangement.
>
> The sorted device lists would be:
>
> Metadata sort order Data sort order
> metadata only (3)   data only (2)
> metadata preferred (1)  data preferred (0)
>   data preferred (0)  metadata preferred (1)
> other devices (2 or other)  other devices (3 or other)
>
> We keep 3 device counts for the first 3 sort orders.  If the number of all
> preferred devices (type != 0) is zero, we just return ndevs; otherwise,
> we pick the first device count that is >= mindevs.  If all the device
> counts are < mindevs then we return the 3rd count (metadata only +
> metadata preferred + data preferred) and the caller will find ENOSPC.
>
> More sophisticated future implementations can alter the sort order, or
> operate in entirely separate parts of btrfs, without conflicting with
> this scheme.  If there is no mount option, then future implementations
> can't conflict with it.

I agree with Zygo and Josef that the mount option is ugly and needless. This 
should be a
_per-device_ setting as suggested by Zygo (metadata only, metadata preferred, 
data preferred, data
only, unspecified). Maybe in the future it might be useful to generalize this 
setting to something
like a 0..255 priority but the 4 device types look like a sufficient solution 
for now. I would
personally prefer a read-write sysfs option to change the device preference but 
btrfs-progs approach
is fine for me too.

Anyway, I'm REALLY happy that there's finally a patchset being actively 
discussed. I maintain a
naive patch implementing "preferred_metadata=metadata" option for years and the 
impact e.g. for
rsync backups is huge.

Martin




[PATCH] btrfs: Prevent nowait or async read from doing sync IO

2021-01-07 Thread Martin Raiber
When reading from btrfs file via io_uring I get following
call traces:

[<0>] wait_on_page_bit+0x12b/0x270
[<0>] read_extent_buffer_pages+0x2ad/0x360
[<0>] btree_read_extent_buffer_pages+0x97/0x110
[<0>] read_tree_block+0x36/0x60
[<0>] read_block_for_search.isra.0+0x1a9/0x360
[<0>] btrfs_search_slot+0x23d/0x9f0
[<0>] btrfs_lookup_csum+0x75/0x170
[<0>] btrfs_lookup_bio_sums+0x23d/0x630
[<0>] btrfs_submit_data_bio+0x109/0x180
[<0>] submit_one_bio+0x44/0x70
[<0>] extent_readahead+0x37a/0x3a0
[<0>] read_pages+0x8e/0x1f0
[<0>] page_cache_ra_unbounded+0x1aa/0x1f0
[<0>] generic_file_buffered_read+0x3eb/0x830
[<0>] io_iter_do_read+0x1a/0x40
[<0>] io_read+0xde/0x350
[<0>] io_issue_sqe+0x5cd/0xed0
[<0>] __io_queue_sqe+0xf9/0x370
[<0>] io_submit_sqes+0x637/0x910
[<0>] __x64_sys_io_uring_enter+0x22e/0x390
[<0>] do_syscall_64+0x33/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Prevent those by setting IOCB_NOIO before calling
generic_file_buffered_read.

Async read has the same problem. So disable that by removing
FMODE_BUF_RASYNC. This was added with commit
8730f12b7962b21ea9ad2756abce1e205d22db84 ("btrfs: flag files as
supporting buffered async reads") with 5.9. Io_uring will read
the data via worker threads if it can't be read without sync IO
this way.

Signed-off-by: Martin Raiber 
---
 fs/btrfs/file.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e41459b8..8bb561f6d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3589,7 +3589,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t 
offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-   filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+   filp->f_mode |= FMODE_NOWAIT;
return generic_file_open(inode, filp);
 }
 
@@ -3639,7 +3639,18 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
return ret;
}
 
-   return generic_file_buffered_read(iocb, to, ret);
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   iocb->ki_flags |= IOCB_NOIO;
+
+   ret = generic_file_buffered_read(iocb, to, ret);
+
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   iocb->ki_flags &= ~IOCB_NOIO;
+   if (ret == 0)
+   ret = -EAGAIN;
+   }
+
+   return ret;
 }
 
 const struct file_operations btrfs_file_operations = {
-- 
2.30.0



Re: 5.6-5.10 balance regression?

2020-12-29 Thread Martin Steigerwald
Qu Wenruo - 29.12.20, 01:44:07 CET:
> So what I can do is only to add a warning message to the problem.
> 
> To solve your problem, I also submitted a patch to btrfs-progs, to
> force v1 space cache cleaning even if the fs has v2 space cache
> enabled.
> 
> Or, you can disable v2 space cache first, using "btrfs check
> --clear-space-cache v2" first, then "btrfs check --clear-space_cache
> v1", and finally mount the fs with "space_cache=v2" again.
> 
> To verify there is no space cache v1 left, you can run the following
> command to verify:
> 
> # btrfs ins dump-tree -t root  | grep EXTENT_DATA
> 
> It should output nothing.

I have v1 space_cache stuff on filesystems which use v2 space_cache as 
well, so…

the fully working way to completely switch to spacecache_v2 for any 
BTRFS filesystem with space cache v1, is what you wrote above?

Or would it be more straight forward than that with a newer kernel?

Best,
-- 
Martin




Re: syncfs() returns no error on fs failure

2019-07-07 Thread Martin Raiber
On 07.07.2019 14:15 Qu Wenruo wrote:
>
> On 2019/7/6 上午4:28, Martin Raiber wrote:
>> More research on this. Seems a generic error reporting mechanism for
>> this is in the works https://lkml.org/lkml/2018/6/1/640 .
> sync() system call is defined as void sync(void); thus it has no error
> reporting.
>
> syncfs() could report error.
>
>> Wrt. to btrfs one should always use BTRFS_IOC_SYNC because only this one
>> seems to wait for delalloc work to finish:
>> https://patchwork.kernel.org/patch/2927491/ (five year old patch where
>> Filipe Manana added this to BTRFS_IOC_SYNC and with v2->v3 not to
>> syncfs() ).
>>
>> I was smart enough to check if the filesystem is still writable after a
>> syncfs() (so the missing error return doesn't matter that much) but I
>> guess the missing wait for delalloc can cause the application to think
>> data is on disk even though it isn't.
> Isn't syncfs() enough to return error for your use case?
>
> Another solution is fsync(). It's ensured to return error if data
> writeback or metadata update path has something wrong.
> IIRC there are quite some fstests test cases using this way to detect fs
> misbehavior.
>
> Testing if the fs can be written after sync() is not enough in fact.
> If you're doing buffer write, it only covers the buffered write part,
> which normally just includes space preallocation and copy data to page
> cache, doesn't include the data write back nor metadata update.
>
> So I'd recommend to stick to fsync() if you want to make sure your data
> reach disk. This does not only apply to btrfs, but all Linux filesystems.
>
> Thanks,
> Qu

This is for UrBackup (Open Source backup software). What it does is,
create btrfs snapshot of last backup of a client, sync the current
client fs to the btrfs snapshot, then call syncfs(btrfs snapshot), then
check if the snapshot is still writable, then set the backup to complete
in its internal database. Calling fsync() on every file would kill
performance (especially on btrfs).
The problem I had was that there was a (complete in database) backup
that had files with wrong checksums (UrBackup does its own checksums,
the btrfs ones were okay), and missing files. On the day the corrupted
backup completed the btrfs went read-only a few hours after the backup
completed and the syncfs() was called with:

[253018.670661] BTRFS: error (device md1) in
btrfs_run_delayed_refs:2950: errno=-5 IO failure

So my guess is using BTRFS_IOC_SYNC instead of syncfs() fixes the
problem in my case, while it would probably be nice if syncfs() returns
an error if the fs fails (it doesn't -- I tested it) and waits for
everything to be written to disk (as expected, and the man-page somewhat
confirms).

>
>> On 05.07.2019 16:22 Martin Raiber wrote:
>>> Hi,
>>>
>>> I realize this isn't a btrfs specific problem but syncfs() returns no
>>> error even on complete fs failure. The problem is (I think) that the
>>> return value of sb->s_op->sync_fs is being ignored in fs/sync.c. I kind
>>> of assumed it would return an error if it fails to write the file system
>>> changes to disk.
>>>
>>> For btrfs there is a work-around of using BTRFS_IOC_SYNC (which I am
>>> going to use now) but that is obviously less user friendly than syncfs().
>>>
>>> Regards,
>>> Martin Raiber
>>



Re: syncfs() returns no error on fs failure

2019-07-05 Thread Martin Raiber
More research on this. Seems a generic error reporting mechanism for
this is in the works https://lkml.org/lkml/2018/6/1/640 .

Wrt. to btrfs one should always use BTRFS_IOC_SYNC because only this one
seems to wait for delalloc work to finish:
https://patchwork.kernel.org/patch/2927491/ (five year old patch where
Filipe Manana added this to BTRFS_IOC_SYNC and with v2->v3 not to
syncfs() ).

I was smart enough to check if the filesystem is still writable after a
syncfs() (so the missing error return doesn't matter that much) but I
guess the missing wait for delalloc can cause the application to think
data is on disk even though it isn't.

On 05.07.2019 16:22 Martin Raiber wrote:
> Hi,
>
> I realize this isn't a btrfs specific problem but syncfs() returns no
> error even on complete fs failure. The problem is (I think) that the
> return value of sb->s_op->sync_fs is being ignored in fs/sync.c. I kind
> of assumed it would return an error if it fails to write the file system
> changes to disk.
>
> For btrfs there is a work-around of using BTRFS_IOC_SYNC (which I am
> going to use now) but that is obviously less user friendly than syncfs().
>
> Regards,
> Martin Raiber




syncfs() returns no error on fs failure

2019-07-05 Thread Martin Raiber
Hi,

I realize this isn't a btrfs specific problem but syncfs() returns no
error even on complete fs failure. The problem is (I think) that the
return value of sb->s_op->sync_fs is being ignored in fs/sync.c. I kind
of assumed it would return an error if it fails to write the file system
changes to disk.

For btrfs there is a work-around of using BTRFS_IOC_SYNC (which I am
going to use now) but that is obviously less user friendly than syncfs().

Regards,
Martin Raiber


Re: Global reserve and ENOSPC while deleting snapshots on 5.0.9 - still happens on 5.1.11

2019-06-24 Thread Martin Raiber
I've fixed the same problem(s) by increasing the global metadata size as
well. Though I haven't encountered them since Josef Bacik's block rsv
rework in 5.0.
Another problem with increasing the global metadata size is, that I
think it is the only way dirty metadata is throttled. If increased too
much (as a percentage of RAM) the system goes OOM depending on work
load. As far as I can see dirty metdata isn't included into the
dirty_ratio calculation as well, causing issues on that front as well.
Another thing that I think helps is to run with "nodatasum" -- probably
because then less metadata needs to be changed when deleting.

On 23.06.2019 16:14 Zygo Blaxell wrote:
> On Tue, Apr 23, 2019 at 07:06:51PM -0400, Zygo Blaxell wrote:
>> I had a test filesystem that ran out of unallocated space, then ran
>> out of metadata space during a snapshot delete, and forced readonly.
>> The workload before the failure was a lot of rsync and bees dedupe
>> combined with random snapshot creates and deletes.
> Had this happen again on a production filesystem, this time on 5.1.11,
> and it happened during orphan inode cleanup instead of snapshot delete:
>
>   [14303.076134][T20882] BTRFS: error (device dm-21) in 
> add_to_free_space_tree:1037: errno=-28 No space left
>   [14303.076144][T20882] BTRFS: error (device dm-21) in 
> __btrfs_free_extent:7196: errno=-28 No space left
>   [14303.076157][T20882] BTRFS: error (device dm-21) in 
> btrfs_run_delayed_refs:3008: errno=-28 No space left
>   [14303.076203][T20882] BTRFS error (device dm-21): Error removing 
> orphan entry, stopping orphan cleanup
>   [14303.076210][T20882] BTRFS error (device dm-21): could not do orphan 
> cleanup -22
>   [14303.076281][T20882] BTRFS error (device dm-21): commit super ret -30
>   [14303.357337][T20882] BTRFS error (device dm-21): open_ctree failed
>
> Same fix:  I bumped the reserved size limit from 512M to 2G and mounted
> normally.  (OK, technically, I booted my old 5.0.21 kernel--but my 5.0.21
> kernel has the 2G reserved space patch below in it.)
>
> I've not been able to repeat this ENOSPC behavior under test conditions
> in the last two months of trying, but it's now happened twice in different
> places, so it has non-zero repeatability.
>
>> I tried the usual fix strategies:
>>
>>  1.  Immediately after mount, try to balance to free space for
>>  metadata
>>
>>  2.  Immediately after mount, add additional disks to provide
>>  unallocated space for metadata
>>
>>  3.  Mount -o nossd to increase metadata density
>>
>> #3 had no effect.  #1 failed consistently.
>>
>> #2 was successful, but the additional space was not used because
>> btrfs couldn't allocate chunks for metadata because it ran out of
>> metadata space for new metadata chunks.
>>
>> When btrfs-cleaner tried to remove the first pending deleted snapshot,
>> it started a transaction that failed due to lack of metadata space.
>> Since the transaction failed, the filesystem reverts to its earlier state,
>> and exactly the same thing happens on the next mount.  The 'btrfs dev
>> add' in #2 is successful only if it is executed immediately after mount,
>> before the btrfs-cleaner thread wakes up.
>>
>> Here's what the kernel said during one of the attempts:
>>
>>  [41263.822252] BTRFS info (device dm-3): use zstd compression, level 0
>>  [41263.825135] BTRFS info (device dm-3): using free space tree
>>  [41263.827319] BTRFS info (device dm-3): has skinny extents
>>  [42046.463356] [ cut here ]
>>  [42046.463387] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: 
>> errno=-28 No space left
>>  [42046.463404] BTRFS: error (device dm-3) in __btrfs_free_extent:7056: 
>> errno=-28 No space left
>>  [42046.463407] BTRFS info (device dm-3): forced readonly
>>  [42046.463414] BTRFS: error (device dm-3) in 
>> btrfs_run_delayed_refs:3011: errno=-28 No space left
>>  [42046.463429] BTRFS: error (device dm-3) in 
>> btrfs_create_pending_block_groups:10517: errno=-28 No space left
>>  [42046.463548] BTRFS: error (device dm-3) in 
>> btrfs_create_pending_block_groups:10520: errno=-28 No space left
>>  [42046.471363] BTRFS: error (device dm-3) in 
>> btrfs_run_delayed_refs:3011: errno=-28 No space left
>>  [42046.471475] BTRFS: error (device dm-3) in 
>> btrfs_create_pending_block_groups:10517: errno=-28 No space left
>>  [42046.471506] BTRFS: error (device dm-3) in 
>> btrfs_create_pending_block_groups:10520: errno=-28 No space left
>>  [42046.473672] BTRFS: error (device dm-3) in btrfs_drop_snapshot:9489: 
>> errno=-28 No space left
>>  [42046.475643] WARNING: CPU: 0 PID: 10187 at 
>> fs/btrfs/extent-tree.c:7056 __btrfs_free_extent+0x364/0xf60
>>  [42046.475645] Modules linked in: mq_deadline bfq dm_cache_smq dm_cache 
>> dm_persistent_data dm_bio_prison dm_bufio joydev ppdev crct10dif_pclmul 
>> crc32_pclmul crc32c_intel ghash_clmulni_intel dm_mod snd_p

Re: Citation Needed: BTRFS Failure Resistance

2019-05-24 Thread Martin Raiber
On 23.05.2019 19:41 Austin S. Hemmelgarn wrote:
> On 2019-05-23 13:31, Martin Raiber wrote:
>> On 23.05.2019 19:13 Austin S. Hemmelgarn wrote:
>>> On 2019-05-23 12:24, Chris Murphy wrote:
>>>> On Thu, May 23, 2019 at 5:19 AM Austin S. Hemmelgarn
>>>>  wrote:
>>>>>
>>>>> On 2019-05-22 14:46, Cerem Cem ASLAN wrote:
>>>>>> Could you confirm or disclaim the following explanation:
>>>>>> https://unix.stackexchange.com/a/520063/65781
>>>>>>
>>>>> Aside from what Hugo mentioned (which is correct), it's worth
>>>>> mentioning
>>>>> that the example listed in the answer of how hardware issues could
>>>>> screw
>>>>> things up assumes that for some reason write barriers aren't honored.
>>>>> BTRFS explicitly requests write barriers to prevent that type of
>>>>> reordering of writes from happening, and it's actually pretty
>>>>> unusual on
>>>>> modern hardware for those write barriers to not be honored unless the
>>>>> user is doing something stupid (like mounting with 'nobarrier' or
>>>>> using
>>>>> LVM with write barrier support disabled).
>>>>
>>>> 'man xfs'
>>>>
>>>>  barrier|nobarrier
>>>>     Note: This option has been deprecated as of kernel
>>>> v4.10; in that version, integrity operations are always performed and
>>>> the mount option is ignored.  These mount options will be removed no
>>>> earlier than kernel v4.15.
>>>>
>>>> Since they're getting rid of it, I wonder if it's sane for most any
>>>> sane file system use case.
>>>>
>>> As Adam mentioned, it's mostly volatile storage that benefits from
>>> this.  For example, on the systems where I have /var/cache configured
>>> as a separate filesystem, I mount it with barriers disabled because
>>> the data there just doesn't matter (all of it can be regenerated
>>> easily) and it gives me a few percent better performance.  In essence,
>>> it's the mostly same type of stuff where you might consider running
>>> ext4 without a journal for performance reasons.
>>>
>>> In the case of XFS, it probably got removed to keep people who fancy
>>> themselves to be power users but really have no clue what they're
>>> doing from shooting themselves in the foot to try and get some more
>>> performance.
>>>
>>> IIRC, the option originally got added to both XFS and ext* because
>>> early write barrier support was a bigger performance hit than it is
>>> today, and BTRFS just kind of inherited it.
>>
>> When I google for it I find that flushing the device can also be
>> disabled via
>>
>> echo "write through" > /sys/block/$device/queue/write_cache
> Disabling write caching (which is what that does) is not really the
> same as mounting with 'nobarrier'.  Write caching actually improves
> performance in most cases, it just makes things a bit riskier because
> of the possibility of write reordering (which barriers prevent).

According to documentation it doesn't change any caching. This changes
how the kernel sees what kind of caching the device does. If the device
claims it does "write through" caching (e.g. battery backed RAID card)
the kernel doesn't need to send device cache flushes, otherwise is does.
If you set a device that has "write back" there to "write through", the
kernel will think it does not require flushes and not send any, thus
causing data loss at power loss (because the device obviously still does
write back caching).

>>
>> I actually used nobarrier recently (albeit with ext4), because a steam
>> download was taking forever (hours), when remounting with nobarrier it
>> went down to minutes (next time I started it with eatmydata). But ext4
>> fsck is probably able to recover nobarrier file systems with unfortunate
>> powerlosses and btrfs fsck... isn't. So combined with the above I'd
>> remove nobarrier.
>>
> Yeah, Steam is another pathological case actually, though that's
> mostly because their distribution format is generously described as
> 'excessively segmented' and they fsync after _every single file_.  If
> you ever use Steam's game backup feature, you'll see similar results
> because it actually serializes the data to the same format that is
> used when downloading the game in the first place.




Re: Citation Needed: BTRFS Failure Resistance

2019-05-23 Thread Martin Raiber
On 23.05.2019 19:13 Austin S. Hemmelgarn wrote:
> On 2019-05-23 12:24, Chris Murphy wrote:
>> On Thu, May 23, 2019 at 5:19 AM Austin S. Hemmelgarn
>>  wrote:
>>>
>>> On 2019-05-22 14:46, Cerem Cem ASLAN wrote:
 Could you confirm or disclaim the following explanation:
 https://unix.stackexchange.com/a/520063/65781

>>> Aside from what Hugo mentioned (which is correct), it's worth
>>> mentioning
>>> that the example listed in the answer of how hardware issues could
>>> screw
>>> things up assumes that for some reason write barriers aren't honored.
>>> BTRFS explicitly requests write barriers to prevent that type of
>>> reordering of writes from happening, and it's actually pretty
>>> unusual on
>>> modern hardware for those write barriers to not be honored unless the
>>> user is doing something stupid (like mounting with 'nobarrier' or using
>>> LVM with write barrier support disabled).
>>
>> 'man xfs'
>>
>>     barrier|nobarrier
>>    Note: This option has been deprecated as of kernel
>> v4.10; in that version, integrity operations are always performed and
>> the mount option is ignored.  These mount options will be removed no
>> earlier than kernel v4.15.
>>
>> Since they're getting rid of it, I wonder if it's sane for most any
>> sane file system use case.
>>
> As Adam mentioned, it's mostly volatile storage that benefits from
> this.  For example, on the systems where I have /var/cache configured
> as a separate filesystem, I mount it with barriers disabled because
> the data there just doesn't matter (all of it can be regenerated
> easily) and it gives me a few percent better performance.  In essence,
> it's the mostly same type of stuff where you might consider running
> ext4 without a journal for performance reasons.
>
> In the case of XFS, it probably got removed to keep people who fancy
> themselves to be power users but really have no clue what they're
> doing from shooting themselves in the foot to try and get some more
> performance.
>
> IIRC, the option originally got added to both XFS and ext* because
> early write barrier support was a bigger performance hit than it is
> today, and BTRFS just kind of inherited it.

When I google for it I find that flushing the device can also be
disabled via

echo "write through" > /sys/block/$device/queue/write_cache

I actually used nobarrier recently (albeit with ext4), because a steam
download was taking forever (hours), when remounting with nobarrier it
went down to minutes (next time I started it with eatmydata). But ext4
fsck is probably able to recover nobarrier file systems with unfortunate
powerlosses and btrfs fsck... isn't. So combined with the above I'd
remove nobarrier.



Re: backup uuid_tree generation not consistent across multi device (raid0) btrfs - won´t mount

2019-03-27 Thread Martin Raiber
On 26.03.2019 14:37 Qu Wenruo wrote:
> On 2019/3/26 下午6:24, berodual_xyz wrote:
>> Mount messages below.
>>
>> Thanks for your input, Qu!
>>
>> ##
>> [42763.884134] BTRFS info (device sdd): disabling free space tree
>> [42763.884138] BTRFS info (device sdd): force clearing of disk cache
>> [42763.884140] BTRFS info (device sdd): has skinny extents
>> [42763.885207] BTRFS error (device sdd): parent transid verify failed on 
>> 1048576 wanted 60234 found 60230
> So btrfs is using the latest superblock while the good one should be the
> old superblock.
>
> Btrfs-progs is able to just ignore the transid mismatch, but kernel
> doesn't and shouldn't.
>
> In fact we should allow btrfs rescue super to use super blocks from
> other device to replace the old one.
>
> So my patch won't help at all, the failure happens at the very beginning
> of the devices list initialization.
>
> BTW, if btrfs restore can't recover certain files, I don't believe any
> rescue kernel mount option can do more.
>
> Thanks,
> Qu

I have made btrfs limp along (till a rebuild) in the past by commenting
out/removing the transid checks. Obviously you should still mount it
read-only (and with no log replay) and it might crash, but there is a
small chance this would work.

>
>> [42763.885263] BTRFS error (device sdd): failed to read chunk root
>> [42763.900922] BTRFS error (device sdd): open_ctree failed
>> ##
>>
>>
>>
>>
>> Sent with ProtonMail Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, 26. March 2019 10:21, Qu Wenruo  wrote:
>>
>>> On 2019/3/26 下午4:52, berodual_xyz wrote:
>>>
 Thank you both for your input.
 see below.

>> You sda and sdb are at gen 60233 while sdd and sde are at gen 60234.
>> It's possible to allow kernel to manually assemble its device list using
>> "device=" mount option.
>> Since you're using RAID6, it's possible to recover using 2 devices only,
>> but in that case you need "degraded" mount option.
> He has btrfs raid0 profile on top of hardware RAID6 devices.
 Correct, my FS is a "raid0" across four hardware-raid based raid6 devices. 
 The underlying devices of the raid controller are fine, same as the 
 volumes themselves.
>>> Then there is not much we can do.
>>>
>>> The super blocks shows all your 4 devices are in 2 different states.
>>> (older generation with dirt log, newer generation without log).
>>>
>>> This means some writes didn't reach all devices.
>>>
 Only corruption seems to be on the btrfs side.
>>> Please provide the kernel message when trying to mount the fs.
>>>
 Does your tip regarding mounting by explicitly specifying the devices 
 still make sense?
>>> Not really. For RAID0 case, it doesn't make much sense.
>>>
 Will this figure out automatically which generation to use?
>>> You could try, as all the mount option is making btrfs completely RO (no
>>> log replay), so it should be pretty safe.
>>>
 I am at the moment in the process of using "btrfs restore" to pull more 
 data from the filesystem without making any further changes.
 After that I am happy to continue testing, and will happily test your 
 mentioned "skip_bg" patch - but if you think that there is some other way 
 to mount (just for recovery purpose - read only is fine!) while having 
 different gens on the devices, I highly appreciate it.
>>> With mounting failure dmesg, it should be pretty easy to determine
>>> whether my skip_bg will work.
>>>
>>> Thanks,
>>> Qu
>>>
 Thanks Qu and Andrei!




Re: psa, wiki needs updating now that Btrfs supports swapfiles in 5.0

2019-03-14 Thread Martin Raiber
On 14.03.2019 23:20 Chris Murphy wrote:
> If you install btrfs-progs 4.20+ you'll see the documentation for
> supporting swapfiles on Btrfs, supported in kernel 5.0+. `man 5 btrfs`
>
> Anyone with access to the wiki should update the FAQ
> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_btrfs_support_swap_files.3F

Yeah, and remove that tip about swap file via loop device. That will
only cause memory allocation lock-ups and is not advisable.



Allow sending of rw-subvols if file system is mounted ro

2019-03-12 Thread Martin Raiber
Hi,

I know there are corner cases that probably make this difficult (such as
remounting the file system rw while a send is in progress), but it would
be nice if one could send all subvolumes as long as a file system is
mounted read-only (pretend every subvol ist read-only if the file system
is mounted read-only).

Background/use case:

Through no fault of btrfs, metadata got damaged, which makes a file
system go read-only after a while and I'd like to btrfs send/receive the
subvolumes and snapshots that are still readable to another btrfs file
system (btrfs send/receive being the only option that does this somewhat
efficiently). Only I cannot send the subvolumes that are not set to
read-only prior to the file system going read-only.
I patched the kernel and btrfs-tools (just commenting the checks out) to
support this in my case, but it would be great if this would be possible
without patching.

Regards,
Martin Raiber



Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

2019-02-22 Thread Martin K. Petersen


Roman,

>> Consequently, many of the modern devices that claim to support
>> discard to make us software folks happy (or to satisfy a purchase
>> order requirements) complete the commands without doing anything at
>> all.  We're simply wasting queue slots.
>
> Any example of such devices? Let alone "many"? Where you would issue a
> full-device blkdiscard, but then just read back old data.

I obviously can't mention names or go into implementation details. But
there are many drives out there that return old data. And that's
perfectly within spec.

At least some of the pain in the industry in this department can be
attributed to us Linux folks and RAID device vendors. We all wanted
deterministic zeroes on completion of DSM TRIM, UNMAP, or DEALLOCATE.
The device vendors weren't happy about that and we ended up with weasel
language in the specs. This lead to the current libata whitelist mess
for SATA SSDs and ongoing vendor implementation confusion in SCSI and
NVMe devices.

On the Linux side the problem was that we originally used discard for
two distinct purposes: Clearing block ranges and deallocating block
ranges. We cleaned that up a while back and now have BLKZEROOUT and
BLKDISCARD. Those operations get translated to different operations
depending on the device. We also cleaned up several of the
inconsistencies in the SCSI and NVMe specs to facilitate making this
distinction possible in the kernel.

In the meantime the SSD vendors made great strides in refining their
flash management. To the point where pretty much all enterprise device
vendors will ask you not to issue discards. The benefits simply do not
outweigh the costs.

If you have special workloads where write amplification is a major
concern it may still be advantageous to do the discards and reduce WA
and prolong drive life. However, these workloads are increasingly moving
away from the classic LBA read/write model. Open Channel originally
targeted this space. Right now work is underway on Zoned Namespaces and
Key-Value command sets in NVMe.

These curated application workload protocols are fundamental departures
from the traditional way of accessing storage. And my postulate is that
where tail latency and drive lifetime management is important, those new
command sets offer much better bang for the buck. And they make the
notion of discard completely moot. That's why I don't think it's going
to be terribly important in the long term.

This leaves consumer devices and enterprise devices using the
traditional LBA I/O model.

For consumer devices I still think fstrim is a good compromise. Lack of
queuing for DSM hurt us for a long time. And when it was finally added
to the ATA command set, many device vendors got their implementations
wrong. So it sucked for a lot longer than it should have. And of course
FTL implementations differ.

For enterprise devices we're still in the situation where vendors
generally prefer for us not to use discard. I would love for the
DEALLOCATE/WRITE ZEROES mess to be sorted out in their FTLs, but I have
fairly low confidence that it's going to happen. Case in point: Despite
a lot of leverage and purchasing power, the cloud industry has not been
terribly successful in compelling the drive manufacturers to make
DEALLOCATE perform well for typical application workloads. So I'm not
holding my breath...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

2019-02-21 Thread Martin K. Petersen


Jeff,

> We've always been told "don't worry about what the internal block size
> is, that only matters to the FTL."  That's obviously not true, but
> when devices only report a 512 byte granularity, we believe them and
> will issue discard for the smallest size that makes sense for the file
> system regardless of whether it makes sense (internally) for the SSD.
> That means 4k for pretty much anything except btrfs metadata nodes,
> which are 16k.

The devices are free to report a bigger discard granularity. We already
support and honor that (for SCSI, anyway). It's completely orthogonal to
reported the logical block size, although it obviously needs to be a
multiple.

The real problem is that vendors have zero interest in optimizing for
discard. They are so confident in their FTL and overprovisioning that
they don't view it as an important feature. At all.

Consequently, many of the modern devices that claim to support discard
to make us software folks happy (or to satisfy a purchase order
requirements) complete the commands without doing anything at all.
We're simply wasting queue slots.

Personally, I think discard is dead on anything but the cheapest
devices.  And on those it is probably going to be
performance-prohibitive to use it in any other way than a weekly fstrim.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [LSF/MM TOPIC] More async operations for file systems - async discard?

2019-02-21 Thread Martin K. Petersen


Keith,

> With respect to fs block sizes, one thing making discards suck is that
> many high capacity SSDs' physical page sizes are larger than the fs
> block size, and a sub-page discard is worse than doing nothing.

That ties into the whole zeroing as a side-effect thing.

The devices really need to distinguish between discard-as-a-hint where
it is free to ignore anything that's not a whole multiple of whatever
the internal granularity is, and the WRITE ZEROES use case where the end
result needs to be deterministic.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: Btrfs corruption: Cannot mount partition

2019-02-17 Thread Martin Pöhlmann
DS:  ES:  CR0: 80050033
[   51.357362] CR2: 7f252cd16af0 CR3: 00020140a006 CR4: 003606e0
[   51.357362] Call Trace:
[   51.357364]  ? _raw_spin_lock+0x13/0x30
[   51.357365]  ? _raw_spin_unlock+0x16/0x30
[   51.357376]  ? btrfs_merge_delayed_refs+0x315/0x350 [btrfs]
[   51.357401]  __btrfs_run_delayed_refs+0x6f2/0x10e0 [btrfs]
[   51.357403]  ? preempt_count_add+0x79/0xb0
[   51.357411]  btrfs_run_delayed_refs+0x64/0x180 [btrfs]
[   51.357418]  delayed_ref_async_start+0x81/0x90 [btrfs]
[   51.357428]  normal_work_helper+0xbd/0x350 [btrfs]
[   51.357430]  process_one_work+0x1eb/0x410
[   51.357432]  worker_thread+0x2d/0x3d0
[   51.357433]  ? process_one_work+0x410/0x410
[   51.357434]  kthread+0x112/0x130
[   51.357435]  ? kthread_park+0x80/0x80
[   51.357437]  ret_from_fork+0x35/0x40
[   51.357438] ---[ end trace 0be7e900e0369796 ]---
[   51.357439] BTRFS: error (device dm-0) in __btrfs_free_extent:6828:
errno=-2 No such entry
[   51.357441] BTRFS info (device dm-0): forced readonly
[   51.357442] BTRFS: error (device dm-0) in
btrfs_run_delayed_refs:2978: errno=-2 No such entry

On Sun, Feb 17, 2019 at 5:27 PM Martin Pöhlmann  wrote:
>
> Tried zero-log. After reboot the system booted again. But all
> sub-volumes are mounted read-only.
>
> This should be the relevant dmesg excerpt (note to last lines, there
> it mentions forced to ro mode)
>
> [   51.356769] WARNING: CPU: 3 PID: 54 at fs/btrfs/extent-tree.c:6822
> __btrfs_free_extent.isra.25+0x61e/0x940 [btrfs]
> [   51.356770] Modules linked in: isofs thunderbolt ccm rfcomm fuse
> cmac snd_hda_codec_hdmi bnep snd_hda_codec_realtek
> snd_hda_codec_generic hid_multitouch joydev arc4 iTCO_wdt
> iTCO_vendor_support nls_iso8859_1 nls_cp437 vfat fat uvcvideo btusb
> btrtl videobuf2_vmalloc btbcm videobuf2_memops videobuf2_v4l2 btintel
> videobuf2_common ath10k_pci bluetooth ath10k_core videodev mousedev
> i915 intel_rapl ath snd_soc_skl snd_soc_hdac_hda ecdh_generic
> x86_pkg_temp_thermal intel_powerclamp snd_hda_ext_core media crc16
> coretemp snd_soc_skl_ipc mac80211 uas snd_soc_sst_ipc kvm_intel
> snd_soc_sst_dsp snd_soc_acpi_intel_match kvmgt snd_soc_acpi vfio_mdev
> mdev mei_wdt dell_laptop vfio_iommu_type1 dell_wmi wmi_bmof
> snd_soc_core vfio intel_wmi_thunderbolt dell_smbios
> dell_wmi_descriptor snd_compress i2c_algo_bit dcdbas kvm ac97_bus
> snd_pcm_dmaengine drm_kms_helper snd_hda_intel snd_hda_codec cfg80211
> irqbypass intel_cstate intel_uncore snd_hda_core snd_hwdep input_leds
> snd_pcm intel_rapl_perf drm snd_timer
> [   51.356792]  psmouse rtsx_pci_ms pcspkr memstick idma64 rfkill snd
> intel_gtt mei_me processor_thermal_device agpgart intel_soc_dts_iosf
> soundcore mei syscopyarea sysfillrect i2c_i801 sysimgblt fb_sys_fops
> intel_lpss_pci intel_lpss intel_pch_thermal ucsi_acpi tpm_crb
> typec_ucsi wmi typec i2c_hid battery soc_button_array intel_vbtn
> tpm_tis tpm_tis_core tpm int3403_thermal evdev int340x_thermal_zone
> mac_hid intel_hid rng_core ac int3400_thermal acpi_thermal_rel
> sparse_keymap pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE)
> vboxdrv(OE) sg crypto_user ip_tables x_tables btrfs libcrc32c
> crc32c_generic xor raid6_pq algif_skcipher af_alg sd_mod usb_storage
> scsi_mod hid_generic usbhid hid dm_crypt dm_mod crct10dif_pclmul
> crc32_pclmul crc32c_intel ghash_clmulni_intel rtsx_pci_sdmmc mmc_core
> serio_raw atkbd libps2 aesni_intel aes_x86_64 xhci_pci crypto_simd
> cryptd glue_helper xhci_hcd rtsx_pci i8042 serio
> [   51.356817] CPU: 3 PID: 54 Comm: kworker/u8:1 Tainted: G U
> OE 4.20.6-arch1-1-ARCH #1
> [   51.356817] Hardware name: Dell Inc. XPS 13 9360/0PF86Y, BIOS 2.1.0
> 08/02/2017
> [   51.356830] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [   51.356839] RIP: 0010:__btrfs_free_extent.isra.25+0x61e/0x940 [btrfs]
> [   51.356840] Code: b8 00 00 00 48 8b 7c 24 08 e8 ae 1a ff ff 41 89
> c5 58 c6 44 24 2c 00 45 85 ed 0f 84 f2 fa ff ff 41 83 fd fe 0f 85 e1
> fb ff ff <0f> 0b 49 8b 3c 24 e8 87 32 00 00 49 89 d9 4d 89 f8 4c 89 f1
> ff b4
> [   51.356841] RSP: 0018:acf5c1b37c38 EFLAGS: 00010246
> [   51.356842] RAX: fffe RBX:  RCX: 
> 
> [   51.356842] RDX: fffe RSI:  RDI: 
> 9b03dbb6f3b0
> [   51.356843] RBP: 005ae1ce8000 R08:  R09: 
> 009b
> [   51.356844] R10: 003c R11:  R12: 
> 9b03da498ee0
> [   51.356844] R13: fffe R14:  R15: 
> 0002
> [   51.356845] FS:  () GS:9b04ae38()
> knlGS:
> [   51.356846] CS:  0010 DS:  ES:  CR0: 80050033
> [   51.356847] CR2: 7f252cd16af0 CR3: 00020140a006 CR4: 
> 003606e0
> [

Re: Btrfs corruption: Cannot mount partition

2019-02-17 Thread Martin Pöhlmann
ror -2)
[   51.357304] WARNING: CPU: 3 PID: 54 at fs/btrfs/extent-tree.c:6828
__btrfs_free_extent.isra.25+0x67b/0x940 [btrfs]
[   51.357304] Modules linked in: isofs thunderbolt ccm rfcomm fuse
cmac snd_hda_codec_hdmi bnep snd_hda_codec_realtek
snd_hda_codec_generic hid_multitouch joydev arc4 iTCO_wdt
iTCO_vendor_support nls_iso8859_1 nls_cp437 vfat fat uvcvideo btusb
btrtl videobuf2_vmalloc btbcm videobuf2_memops videobuf2_v4l2 btintel
videobuf2_common ath10k_pci bluetooth ath10k_core videodev mousedev
i915 intel_rapl ath snd_soc_skl snd_soc_hdac_hda ecdh_generic
x86_pkg_temp_thermal intel_powerclamp snd_hda_ext_core media crc16
coretemp snd_soc_skl_ipc mac80211 uas snd_soc_sst_ipc kvm_intel
snd_soc_sst_dsp snd_soc_acpi_intel_match kvmgt snd_soc_acpi vfio_mdev
mdev mei_wdt dell_laptop vfio_iommu_type1 dell_wmi wmi_bmof
snd_soc_core vfio intel_wmi_thunderbolt dell_smbios
dell_wmi_descriptor snd_compress i2c_algo_bit dcdbas kvm ac97_bus
snd_pcm_dmaengine drm_kms_helper snd_hda_intel snd_hda_codec cfg80211
irqbypass intel_cstate intel_uncore snd_hda_core snd_hwdep input_leds
snd_pcm intel_rapl_perf drm snd_timer
[   51.357319]  psmouse rtsx_pci_ms pcspkr memstick idma64 rfkill snd
intel_gtt mei_me processor_thermal_device agpgart intel_soc_dts_iosf
soundcore mei syscopyarea sysfillrect i2c_i801 sysimgblt fb_sys_fops
intel_lpss_pci intel_lpss intel_pch_thermal ucsi_acpi tpm_crb
typec_ucsi wmi typec i2c_hid battery soc_button_array intel_vbtn
tpm_tis tpm_tis_core tpm int3403_thermal evdev int340x_thermal_zone
mac_hid intel_hid rng_core ac int3400_thermal acpi_thermal_rel
sparse_keymap pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE)
vboxdrv(OE) sg crypto_user ip_tables x_tables btrfs libcrc32c
crc32c_generic xor raid6_pq algif_skcipher af_alg sd_mod usb_storage
scsi_mod hid_generic usbhid hid dm_crypt dm_mod crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel rtsx_pci_sdmmc mmc_core
serio_raw atkbd libps2 aesni_intel aes_x86_64 xhci_pci crypto_simd
cryptd glue_helper xhci_hcd rtsx_pci i8042 serio
[   51.357335] CPU: 3 PID: 54 Comm: kworker/u8:1 Tainted: G U  W
OE 4.20.6-arch1-1-ARCH #1
[   51.357335] Hardware name: Dell Inc. XPS 13 9360/0PF86Y, BIOS 2.1.0
08/02/2017
[   51.357347] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
[   51.357355] RIP: 0010:__btrfs_free_extent.isra.25+0x67b/0x940 [btrfs]
[   51.357356] Code: 08 48 8b 40 50 f0 48 0f ba a8 90 12 00 00 02 0f
92 c0 5f 84 c0 0f 85 cc 0f 09 00 44 89 ee 48 c7 c7 70 82 35 c0 e8 af
7e 5d df <0f> 0b e9 b6 0f 09 00 4c 89 e7 e8 a6 7d fe ff 48 8b 3c 24 4d
89 f8
[   51.357356] RSP: 0018:acf5c1b37c38 EFLAGS: 00010282
[   51.357357] RAX:  RBX:  RCX: 
[   51.357358] RDX: 0007 RSI: a08a427e RDI: 
[   51.357358] RBP: 005ae1ce8000 R08: 0001 R09: 06b2
[   51.357359] R10: 0004 R11:  R12: 9b03da498ee0
[   51.357359] R13: fffe R14:  R15: 0002
[   51.357360] FS:  () GS:9b04ae38()
knlGS:
[   51.357361] CS:  0010 DS:  ES:  CR0: 80050033
[   51.357362] CR2: 7f252cd16af0 CR3: 00020140a006 CR4: 003606e0
[   51.357362] Call Trace:
[   51.357364]  ? _raw_spin_lock+0x13/0x30
[   51.357365]  ? _raw_spin_unlock+0x16/0x30
[   51.357376]  ? btrfs_merge_delayed_refs+0x315/0x350 [btrfs]
[   51.357401]  __btrfs_run_delayed_refs+0x6f2/0x10e0 [btrfs]
[   51.357403]  ? preempt_count_add+0x79/0xb0
[   51.357411]  btrfs_run_delayed_refs+0x64/0x180 [btrfs]
[   51.357418]  delayed_ref_async_start+0x81/0x90 [btrfs]
[   51.357428]  normal_work_helper+0xbd/0x350 [btrfs]
[   51.357430]  process_one_work+0x1eb/0x410
[   51.357432]  worker_thread+0x2d/0x3d0
[   51.357433]  ? process_one_work+0x410/0x410
[   51.357434]  kthread+0x112/0x130
[   51.357435]  ? kthread_park+0x80/0x80
[   51.357437]  ret_from_fork+0x35/0x40
[   51.357438] ---[ end trace 0be7e900e0369796 ]---
[   51.357439] BTRFS: error (device dm-0) in __btrfs_free_extent:6828:
errno=-2 No such entry
[   51.357441] BTRFS info (device dm-0): forced readonly
[   51.357442] BTRFS: error (device dm-0) in
btrfs_run_delayed_refs:2978: errno=-2 No such entry

On Sat, Feb 16, 2019 at 9:46 PM Martin Pöhlmann  wrote:
>
> Thanks a lot for your help.
>
> @Qu Wenruo: WIll zero log after completing the backup
> @Chris Murphy: First of all, mount -ro,nologreplay works.
>
> dump-tree displays two items:
>
> # btrfs insp dump-tree -b 88560877568 --follow /dev/mapper/cryptroot
> btrfs-progs v4.19.1
> leaf 88560877568 items 2 free space 15355 generation 554510 owner TREE_LOG
> leaf 88560877568 flags 0x1(WRITTEN) backref revision 1
> fs uuid bbd941a4-5525-4ba6-a4d8-3ead02b8aae1
> chunk uuid 25cacaa1-59ec-4c71-92e0-4b31f7937521
> item 0 key (TREE_LOG ROOT_ITEM 258) itemoff 15844 itemsize 439
> generation 554510

Re: Btrfs corruption: Cannot mount partition

2019-02-16 Thread Martin Pöhlmann
Thanks a lot for your help.

@Qu Wenruo: WIll zero log after completing the backup
@Chris Murphy: First of all, mount -ro,nologreplay works.

dump-tree displays two items:

# btrfs insp dump-tree -b 88560877568 --follow /dev/mapper/cryptroot
btrfs-progs v4.19.1
leaf 88560877568 items 2 free space 15355 generation 554510 owner TREE_LOG
leaf 88560877568 flags 0x1(WRITTEN) backref revision 1
fs uuid bbd941a4-5525-4ba6-a4d8-3ead02b8aae1
chunk uuid 25cacaa1-59ec-4c71-92e0-4b31f7937521
item 0 key (TREE_LOG ROOT_ITEM 258) itemoff 15844 itemsize 439
generation 554510 root_dirid 0 bytenr 88560812032 level 1 refs 0
lastsnap 0 byte_limit 0 bytes_used 376832 flags 0x0(none)
uuid ----
drop key (0 UNKNOWN.0 0) level 0
item 1 key (TREE_LOG ROOT_ITEM 259) itemoff 15405 itemsize 439
generation 554510 root_dirid 0 bytenr 917389312 level 0 refs 0
lastsnap 0 byte_limit 0 bytes_used 0 flags 0x0(none)
uuid ----
drop key (0 UNKNOWN.0 0) level 0


Regards 2nd mail:

1. as mentioned, mount with nologreplay works. Will update backups with that.
2. Used btrfs restore already for initial backup. Did a good job.
3. Have to figure out how to get a usb-bootable recovery system w/ 5.0rc6 first.

On Sat, Feb 16, 2019 at 1:54 AM Qu Wenruo  wrote:
>
>
>
> On 2019/2/16 上午5:31, Martin Pöhlmann wrote:
> > Hello,
> >
> > After a reboot I am lost with an unmountable BTRFS partition. Before
> > reboot I had first compile problems with freezing IntelliJ. These
> > persisted after a first reboot, after a second reboot I am faced with
> > the following error after entering the dm-crypt password (also after
> > manual mount with -o ro,recovery, see attached dmesg):
>
> [Move check result here]
> > # btrfs check --readonly /dev/mapper/cryptroot
> > [1/7] checking root items
> > [2/7] checking extents
> > [3/7] checking free space cache
> > [4/7] checking fs roots
> > root 258 inode 776 errors 200, dir isize wrong
> > root 258 inode 1131031 errors 1, no inode item
> > unresolved ref dir 776 index 87215 namelen 17 name
> > TransportSecurity filetype 1 errors 5, no dir item, no inode ref
> > root 258 inode 2911226 errors 1, no inode item
> > unresolved ref dir 776 index 160611 namelen 17 name
> > TransportSecurity filetype 1 errors 5, no dir item, no inode ref
> > ERROR: errors found in fs roots
> > Opening filesystem to check...
> > Checking filesystem on /dev/mapper/cryptroot
> > UUID: bbd941a4-5525-4ba6-a4d8-3ead02b8aae1
> > found 409699909636 bytes used, error(s) found
> > total csum bytes: 390595732
> > total tree bytes: 5061541888
> > total fs tree bytes: 4224024576
> > total extent tree bytes: 339312640
> > btree space waste bytes: 892618468
> > file data blocks allocated: 529336496128
> >  referenced 490479570944
> >
> So there is just some minor problem in fs trees, not a big problem, and
> your extent tree passes the check, so it's not on-disk data corruption.
>
> >
> > [ 6098.921985] BTRFS error (device dm-0): unable to find ref byte nr
> > 390335463424 parent 0 root 2
> > [ 6098.922473] BTRFS: error (device dm-0) in __btrfs_free_extent:6828:
> > errno=-2 No such entry
> > [ 6098.922526] BTRFS: error (device dm-0) in
> > btrfs_run_delayed_refs:2978: errno=-2 No such entry
> > [ 6098.922601] BTRFS: error (device dm-0) in btrfs_replay_log:2267:
> > errno=-2 No such entry (Failed to recover log tree)
> > [ 6098.972326] BTRFS error (device dm-0): open_ctree failed
>
> It's log recovery causing problem.
>
> You could just use "btrfs rescue zero-log" to recovery it.
>
> Thanks,
> Qu
>
> >
> > I've searched for a solution on the web, but most articles tell to do
> > nothing, but write to this mailing list. So my hopes are that you can
> > shed some light into what I can do.
> >
> > I've found a quite recent thread here
> > (https://lore.kernel.org/linux-btrfs/5b0d2e94-6e4e-aecd-3eda-459c4a96b...@mokrynskyi.com/)
> > but this just mentions a fix for 'Fix missing reference aborts when
> > resuming snapshot delete' and is not further specific.
> >
> > Setup of my SSD looks like:
> >
> > * efi
> > * dm-crypt plain. Contains BTRFS (w/o lvm or similar). Several
> > subvolumes (/, /home, ...)
> > * swap
> >
> > I've already run btrfs restore on volid 258 (home) and gathered lots
> > of data from the disk (>200GB). I also have a dd backup of the
> > cryptroot after the failure happened (in case something goes wrong).
> > Besides I did not do any fix attempts yet. If there is anything I can
> > 

Btrfs corruption: Cannot mount partition

2019-02-15 Thread Martin Pöhlmann
Hello,

After a reboot I am lost with an unmountable BTRFS partition. Before
reboot I had first compile problems with freezing IntelliJ. These
persisted after a first reboot, after a second reboot I am faced with
the following error after entering the dm-crypt password (also after
manual mount with -o ro,recovery, see attached dmesg):

[ 6098.921985] BTRFS error (device dm-0): unable to find ref byte nr
390335463424 parent 0 root 2
[ 6098.922473] BTRFS: error (device dm-0) in __btrfs_free_extent:6828:
errno=-2 No such entry
[ 6098.922526] BTRFS: error (device dm-0) in
btrfs_run_delayed_refs:2978: errno=-2 No such entry
[ 6098.922601] BTRFS: error (device dm-0) in btrfs_replay_log:2267:
errno=-2 No such entry (Failed to recover log tree)
[ 6098.972326] BTRFS error (device dm-0): open_ctree failed

I've searched for a solution on the web, but most articles tell to do
nothing, but write to this mailing list. So my hopes are that you can
shed some light into what I can do.

I've found a quite recent thread here
(https://lore.kernel.org/linux-btrfs/5b0d2e94-6e4e-aecd-3eda-459c4a96b...@mokrynskyi.com/)
but this just mentions a fix for 'Fix missing reference aborts when
resuming snapshot delete' and is not further specific.

Setup of my SSD looks like:

* efi
* dm-crypt plain. Contains BTRFS (w/o lvm or similar). Several
subvolumes (/, /home, ...)
* swap

I've already run btrfs restore on volid 258 (home) and gathered lots
of data from the disk (>200GB). I also have a dd backup of the
cryptroot after the failure happened (in case something goes wrong).
Besides I did not do any fix attempts yet. If there is anything I can
do to get the system working again, I'm happy to hear.

Thanks!

My Linux system is Arch Linux (up to date), logs below come from the
Arch install medium .

# uname -a
Linux archiso 4.20.6-arch1-1-ARCH #1 SMP PREEMPT Thu Jan 31 08:22:01
UTC 2019 x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.19.1

# btrfs fi show
Label: 'root'  uuid: bbd941a4-5525-4ba6-a4d8-3ead02b8aae1
Total devices 1 FS bytes used 381.56GiB
devid1 size 460.39GiB used 393.01GiB path /dev/mapper/cryptroot

# btrfs check --readonly /dev/mapper/cryptroot
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 258 inode 776 errors 200, dir isize wrong
root 258 inode 1131031 errors 1, no inode item
unresolved ref dir 776 index 87215 namelen 17 name
TransportSecurity filetype 1 errors 5, no dir item, no inode ref
root 258 inode 2911226 errors 1, no inode item
unresolved ref dir 776 index 160611 namelen 17 name
TransportSecurity filetype 1 errors 5, no dir item, no inode ref
ERROR: errors found in fs roots
Opening filesystem to check...
Checking filesystem on /dev/mapper/cryptroot
UUID: bbd941a4-5525-4ba6-a4d8-3ead02b8aae1
found 409699909636 bytes used, error(s) found
total csum bytes: 390595732
total tree bytes: 5061541888
total fs tree bytes: 4224024576
total extent tree bytes: 339312640
btree space waste bytes: 892618468
file data blocks allocated: 529336496128
 referenced 490479570944
[ 6098.200152] BTRFS warning (device dm-0): 'recovery' is deprecated, use 'usebackuproot' instead
[ 6098.200155] BTRFS info (device dm-0): trying to use backup root at mount time
[ 6098.200158] BTRFS info (device dm-0): disk space caching is enabled
[ 6098.200161] BTRFS info (device dm-0): has skinny extents
[ 6098.318699] BTRFS info (device dm-0): enabling ssd optimizations
[ 6098.920655] WARNING: CPU: 2 PID: 1581 at fs/btrfs/extent-tree.c:6822 __btrfs_free_extent.isra.25+0x61e/0x940 [btrfs]
[ 6098.920657] Modules linked in: btrfs libcrc32c crc32c_generic xor raid6_pq dm_crypt algif_skcipher af_alg dm_mod hid_multitouch hid_generic snd_hda_codec_hdmi joydev mousedev arc4 wl(POE) ath10k_pci ath10k_core snd_soc_skl ath snd_soc_hdac_hda intel_rapl snd_hda_ext_core mac80211 snd_soc_skl_ipc x86_pkg_temp_thermal intel_powerclamp coretemp snd_soc_sst_ipc btusb snd_soc_sst_dsp kvm_intel snd_soc_acpi_intel_match btrtl snd_soc_acpi btbcm btintel snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic bluetooth snd_compress ac97_bus cfg80211 snd_pcm_dmaengine uvcvideo mei_wdt crct10dif_pclmul snd_hda_intel videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 iTCO_wdt iTCO_vendor_support videobuf2_common ghash_clmulni_intel dell_laptop snd_hda_codec videodev ecdh_generic rtsx_pci_ms intel_cstate snd_hda_core psmouse tpm_crb intel_uncore rfkill media memstick intel_rapl_perf crc16 snd_hwdep snd_pcm pcspkr input_leds i2c_hid snd_timer intel_wmi_thunderbolt snd hid idma64 soundcore mei_me
[ 6098.920703]  soc_button_array i2c_i801 tpm_tis mei tpm_tis_core intel_lpss_pci intel_vbtn intel_lpss intel_hid dell_wmi tpm battery dell_smbios processor_thermal_device evdev sparse_keymap intel_pch_thermal dcdbas intel_soc_dts_iosf rng_core ac ucsi_acpi int3400_thermal typec_ucsi int3403_thermal acpi_thermal_rel int340x_thermal_zone wmi_bmof dell_wmi_descriptor typec mac_hid pcc_cpufreq

Re: btrfs as / filesystem in RAID1

2019-02-07 Thread Martin Steigerwald
Chris Murphy - 07.02.19, 18:15:
> > So please change the normal behavior
> 
> In the case of no device loss, but device delay, with 'degraded' set
> in fstab you risk a non-deterministic degraded mount. And there is no
> automatic balance (sync) after recovering from a degraded mount. And
> as far as I know there's no automatic transition from degraded to
> normal operation upon later discovery of a previously missing device.
> It's just begging for data loss. That's why it's not the default.
> That's why it's not recommended.

Still the current behavior is not really user-friendly. And does not 
meet expectations that users usually have about how RAID 1 works. I know 
BTRFS RAID 1 is no RAID 1, although it is called like this.

I also somewhat get that with the current state of BTRFS the current 
behavior of not allowing a degraded mount may be better… however… I see 
clearly room for improvement here. And there very likely will be 
discussions like this on this list… until BTRFS acts in a more user 
friendly way here.

I faced this myself during recovery from a failure of one SSD of a dual 
SSD BTRFS RAID 1 and it caused me having to spend *hours* instead of 
what in my eyes could be minutes to recover the machine to a working 
state again. Luckily the SSDs I use do not tend to fail all that often. 
And the Intel SSD 320 that has this "Look, I am 8 MiB big and all your 
data is gone" firmware bug – even with the firmware version that was 
supposed to fix this issue – is out of service now. Although I was able 
to bring it back to a working (but blank) state with a secure erase, I 
am just not going to use such a SSD for anything serious.

Thanks,
-- 
Martin




Re: New hang (Re: Kernel traces), sysreq+w output

2019-02-05 Thread Martin Raiber
On 06.02.2019 01:22 Qu Wenruo wrote:
> On 2019/2/6 上午6:18, Stephen R. van den Berg wrote:
>> Are these Sysreq+w dumps not usable?
>>
> Sorry for the late reply.
>
> The hang looks pretty strange, and doesn't really look like previous
> deadlock caused by tree block locking.
> But some strange behavior about metadata dirty pages:
>
> This looks like to be the cause of the problem.
>
> kworker/u16:1   D0 19178  2 0x8000
> Workqueue: btrfs-endio-write btrfs_endio_write_helper
> Call Trace:
>  ? __schedule+0x4db/0x524
>  ? schedule+0x60/0x71
>  ? schedule_timeout+0xb2/0xec
>  ? __next_timer_interrupt+0xae/0xae
>  ? io_schedule_timeout+0x1b/0x3d
>  ? balance_dirty_pages+0x7a7/0x861
>  ? usleep_range+0x7e/0x7e
>  ? schedule+0x60/0x71
>  ? schedule_timeout+0x32/0xec
>  ? balance_dirty_pages_ratelimited+0x204/0x225
>  ? btrfs_finish_ordered_io+0x584/0x5ac
>  ? normal_work_helper+0xfe/0x243
>  ? process_one_work+0x18d/0x271
>  ? rescuer_thread+0x278/0x278
>  ? worker_thread+0x194/0x23f
>  ? kthread+0xeb/0xf0
>  ? kthread_associate_blkcg+0x86/0x86
>  ? ret_from_fork+0x35/0x40
>
> But I'm not familiar with balance_dirty_pages part, thus can't provide
> much details about this.
That balance_dirty_pages call was removed with the latest stable kernels
(
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/fs/btrfs?h=linux-4.20.y&id=480c6fb23eb80e88eba7e4603304710ee7a9416f
).


Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

2018-12-17 Thread Martin Raiber
On 14.12.2018 09:07 ethanlien wrote:
> Martin Raiber 於 2018-12-12 23:22 寫到:
>> On 12.12.2018 15:47 Chris Mason wrote:
>>> On 28 May 2018, at 1:48, Ethan Lien wrote:
>>>
>>> It took me a while to trigger, but this actually deadlocks ;)  More
>>> below.
>>>
>>>> [Problem description and how we fix it]
>>>> We should balance dirty metadata pages at the end of
>>>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>>>> potentially produce dirty metadata which is multiple times larger than
>>>> the data itself. For example, a small, unmergeable 4KiB write may
>>>> produce:
>>>>
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>>>>     16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>>>>
>>>> Although we do call balance dirty pages in write side, but in the
>>>> buffered write path, most metadata are dirtied only after we reach the
>>>> dirty background limit (which by far only counts dirty data pages) and
>>>> wakeup the flusher thread. If there are many small, unmergeable random
>>>> writes spread in a large btree, we'll find a burst of dirty pages
>>>> exceeds the dirty_bytes limit after we wakeup the flusher thread -
>>>> which
>>>> is not what we expect. In our machine, it caused out-of-memory problem
>>>> since a page cannot be dropped if it is marked dirty.
>>>>
>>>> Someone may worry about we may sleep in
>>>> btrfs_btree_balance_dirty_nodelay,
>>>> but since we do btrfs_finish_ordered_io in a separate worker, it will
>>>> not
>>>> stop the flusher consuming dirty pages. Also, we use different worker
>>>> for
>>>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us
>>>> throttle
>>>> the size of dirty metadata pages.
>>> In general, slowing down btrfs_finish_ordered_io isn't ideal because it
>>> adds latency to places we need to finish quickly.  Also,
>>> btrfs_finish_ordered_io is used by the free space cache.  Even though
>>> this happens from its own workqueue, it means completing free space
>>> cache writeback may end up waiting on balance_dirty_pages, something
>>> like this stack trace:
>>>
>>> [..]
>>>
>>> Eventually, we have every process in the system waiting on
>>> balance_dirty_pages(), and nobody is able to make progress on
>>> paclear page's writebackge
>>> writeback.
>>>
>> I had lockups with this patch as well. If you put e.g. a loop device on
>> top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
>> loop causing delays. The task balancing dirty pages in
>> btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In
>> my case it managed to cause a feedback loop where it queues other
>> btrfs_finish_ordered_io and gets stuck completely.
>>
>
> The data writepage endio will queue a work for
> btrfs_finish_ordered_io() in a separate workqueue and clear page's
> writeback, so throttling in btrfs_finish_ordered_io() should not slow
> down flusher thread. One suspicious point is while the caller is
> waiting a range of ordered_extents to complete, they will be
> blocked until balance_dirty_pages_ratelimited() make some
> progress, since we finish ordered_extents in
> btrfs_finish_ordered_io().
> Do you have call stack information for stuck processes or using
> fsync/sync frequently? If this is the case, maybe we should pull
> this thing out and try balance dirty metadata pages somewhere.

Yeah like,

[875317.071433] Call Trace:
[875317.071438]  ? __schedule+0x306/0x7f0
[875317.071442]  schedule+0x32/0x80
[875317.071447]  btrfs_start_ordered_extent+0xed/0x120
[875317.071450]  ? remove_wait_queue+0x60/0x60
[875317.071454]  btrfs_wait_ordered_range+0xa0/0x100
[875317.071457]  btrfs_sync_file+0x1d6/0x400
[875317.071461]  ? do_fsync+0x38/0x60
[875317.071463]  ? btrfs_fdatawrite_range+0x50/0x50
[875317.071465]  do_fsync+0x38/0x60
[875317.071468]  __x64_sys_fsync+0x10/0x20
[875317.071470]  do_syscall_64+0x55/0x100
[875317.071473]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

so I guess the problem is that the calling balance_dirty_pages causes
fsyncs to the same btrfs (via my unusual setup of loop+fuse)? Those
fsyncs are deadlocked because they are called indirectly from
btrfs_finish_ordered_io... It is a unusal setup, which is why I did not
post it to the mailing list initially.




Re: [PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

2018-12-12 Thread Martin Raiber
On 12.12.2018 15:47 Chris Mason wrote:
> On 28 May 2018, at 1:48, Ethan Lien wrote:
>
> It took me a while to trigger, but this actually deadlocks ;)  More 
> below.
>
>> [Problem description and how we fix it]
>> We should balance dirty metadata pages at the end of
>> btrfs_finish_ordered_io, since a small, unmergeable random write can
>> potentially produce dirty metadata which is multiple times larger than
>> the data itself. For example, a small, unmergeable 4KiB write may
>> produce:
>>
>> 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
>> 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
>> 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
>>
>> Although we do call balance dirty pages in write side, but in the
>> buffered write path, most metadata are dirtied only after we reach the
>> dirty background limit (which by far only counts dirty data pages) and
>> wakeup the flusher thread. If there are many small, unmergeable random
>> writes spread in a large btree, we'll find a burst of dirty pages
>> exceeds the dirty_bytes limit after we wakeup the flusher thread - 
>> which
>> is not what we expect. In our machine, it caused out-of-memory problem
>> since a page cannot be dropped if it is marked dirty.
>>
>> Someone may worry about we may sleep in 
>> btrfs_btree_balance_dirty_nodelay,
>> but since we do btrfs_finish_ordered_io in a separate worker, it will 
>> not
>> stop the flusher consuming dirty pages. Also, we use different worker 
>> for
>> metadata writeback endio, sleep in btrfs_finish_ordered_io help us 
>> throttle
>> the size of dirty metadata pages.
> In general, slowing down btrfs_finish_ordered_io isn't ideal because it 
> adds latency to places we need to finish quickly.  Also, 
> btrfs_finish_ordered_io is used by the free space cache.  Even though 
> this happens from its own workqueue, it means completing free space 
> cache writeback may end up waiting on balance_dirty_pages, something 
> like this stack trace:
>
> [..]
>
> Eventually, we have every process in the system waiting on 
> balance_dirty_pages(), and nobody is able to make progress on page 
> writeback.
>
I had lockups with this patch as well. If you put e.g. a loop device on
top of a btrfs file, loop sets PF_LESS_THROTTLE to avoid a feed back
loop causing delays. The task balancing dirty pages in
btrfs_finish_ordered_io doesn't have the flag and causes slow-downs. In
my case it managed to cause a feedback loop where it queues other
btrfs_finish_ordered_io and gets stuck completely.

Regards,
Martin Raiber



Re: Possible deadlock when writing

2018-12-01 Thread Martin Bakiev
I was having the same issue with kernels 4.19.2 and 4.19.4. I don’t appear to 
have the issue with 4.20.0-0.rc1 on Fedora Server 29.

The issue is very easy to reproduce on my setup, not sure how much of it is 
actually relevant, but here it is:

- 3 drive RAID5 created
- Some data moved to it
- Expanded to 7 drives
- No balancing

The issue is easily reproduced (within 30 mins) by starting multiple transfers 
to the volume (several TB in the form of many 30GB+ files). Multiple concurrent 
‘rsync’ transfers seems to take a bit longer to trigger the issue, but multiple 
‘cp’ commands will do it much quicker (again not sure if relevant).

I have not seen the issue occur with a single ‘rsync’ or ‘cp’ transfer, but I 
haven’t left one running alone for too long (copying the data from multiple 
drives, so there is a lot to be gained from parallelizing the transfers).

I’m not sure what state the FS is left in after Magic SysRq reboot after it 
deadlocks, but seemingly it’s fine. No problems mounting and ‘btrfs check’ 
passes OK. I’m sure some of the data doesn’t get flushed, but it’s no problem 
for my use case.

I’ve been running nonstop concurrent transfers with kernel 4.20.0-0.rc1 for 
24hr nonstop and I haven’t experienced the issue.

Hope this helps.

Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-11-27 Thread Martin Steigerwald
Misono Tomohiro - 27.11.18, 06:24:
> Importantly, in order to make output consistent for both root and
> non-privileged user, this changes the behavior of "subvolume list":
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.

Does that work recursively?

I wound find it quite unexpected if I did btrfs subvol list in or on the 
root directory of a BTRFS filesystem would not display any subvolumes on 
that filesystem no matter where they are.

Thanks,
-- 
Martin




Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Martin Steigerwald
Hugo Mills - 15.10.18, 16:26:
> On Mon, Oct 15, 2018 at 05:24:08PM +0300, Anton Shepelev wrote:
> > Hello, all
> > 
> > While trying to resolve free space problems, and found that
> > 
> > I cannot interpret the output of:
> > > btrfs filesystem show
> > 
> > Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> > Total devices 1 FS bytes used 34.06GiB
> > devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> > 
> > How come the total used value is less than the value listed
> > for the only device?
> 
>"Used" on the device is the mount of space allocated. "Used" on the
> FS is the total amount of actual data and metadata in that
> allocation.
> 
>You will also need to look at the output of "btrfs fi df" to see
> the breakdown of the 37.82 GiB into data, metadata and currently
> unused.

I usually use btrfs fi usage -T, cause

1. It has all the information.

2. It differentiates between used and allocated.

% btrfs fi usage -T /
Overall:
Device size: 100.00GiB
Device allocated: 54.06GiB
Device unallocated:   45.94GiB
Device missing:  0.00B
Used: 46.24GiB
Free (estimated): 25.58GiB  (min: 25.58GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   70.91MiB  (used: 0.00B)

Data Metadata  System  
Id Path RAID1RAID1 RAID1Unallocated
--   -  ---
 2 /dev/mapper/msata-debian 25.00GiB   2.00GiB 32.00MiB22.97GiB
 1 /dev/mapper/sata-debian  25.00GiB   2.00GiB 32.00MiB22.97GiB
--   -  ---
   Total25.00GiB   2.00GiB 32.00MiB45.94GiB
   Used 22.38GiB 754.66MiB 16.00KiB  


For RAID it in some place reports the raw size and sometimes the logical 
size. Especially in the "Total" line I find this a bit inconsistent. 
"RAID1" columns show logical size, "Unallocated" shows raw size.

Also "Used:" in the global section shows raw size and "Free 
(estimated):" shows logical size.

Thanks
-- 
Martin




Re: BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery

2018-10-05 Thread Martin Steigerwald
Filipe Manana - 05.10.18, 17:21:
> On Fri, Oct 5, 2018 at 3:23 PM Martin Steigerwald 
 wrote:
> > Hello!
> > 
> > On ThinkPad T520 after battery was discharged and machine just
> > blacked out.
> > 
> > Is that some sign of regular consistency check / replay or something
> > to investigate further?
> 
> I think it's harmless, if anything were messed up with link counts or
> mismatches between those and dir entries, fsck (btrfs check) should
> have reported something.
> I'll dig a big further and remove the warning if it's really harmless.

I just scrubbed the filesystem. I did not run btrfs check on it.

> > I already scrubbed all data and there are no errors. Also btrfs
> > device stats reports no errors. SMART status appears to be okay as
> > well on both SSD.
> > 
> > [4.524355] BTRFS info (device dm-4): disk space caching is
> > enabled [… backtrace …]
-- 
Martin




BTRFS related kernel backtrace on boot on 4.18.7 after blackout due to discharged battery

2018-10-05 Thread Martin Steigerwald
 83 c8 ff c3 66 2e 
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 3e e4 0b 00 f7 d8 64 89 01 48 
[6.123872] RSP: 002b:7ffc0e3466a8 EFLAGS: 0202 ORIG_RAX: 
00a5
[6.131285] RAX: ffda RBX: 55f3ed7ee9c0 RCX: 7f0715b89a1a
[6.131286] RDX: 55f3ed7eebc0 RSI: 55f3ed7eec40 RDI: 55f3ed7ef900
[6.131287] RBP: 7f0715ecff04 R08: 55f3ed7eec00 R09: 55f3ed7eebc0
[6.131288] R10: c0ed0400 R11: 0202 R12: 
[6.131289] R13: c0ed0400 R14: 55f3ed7ef900 R15: 55f3ed7eebc0
[6.131292] ---[ end trace bd5d30b2fea7fb77 ]---
[6.251219] BTRFS info (device dm-3): checking UUID tree

Thanks,
-- 
Martin




Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-19 Thread Martin Steigerwald
Hans van Kranenburg - 19.09.18, 19:58:
> However, as soon as we remount the filesystem with space_cache=v2 -
> 
> > writes drop to just around 3-10 MB/s to each disk. If we remount to
> > space_cache - lots of writes, system unresponsive. Again remount to
> > space_cache=v2 - low writes, system responsive.
> > 
> > That's a huuge, 10x overhead! Is it expected? Especially that
> > space_cache=v1 is still the default mount option?
> 
> Yes, that does not surprise me.
> 
> https://events.static.linuxfound.org/sites/events/files/slides/vault20
> 16_0.pdf
> 
> Free space cache v1 is the default because of issues with btrfs-progs,
> not because it's unwise to use the kernel code. I can totally
> recommend using it. The linked presentation above gives some good
> background information.

What issues in btrfs-progs are that?

I am wondering whether to switch to freespace tree v2. Would it provide 
benefit for a regular / and /home filesystems as dual SSD BTRFS RAID-1 
on a laptop?

Thanks,
-- 
Martin




Re: Transactional btrfs

2018-09-08 Thread Martin Raiber
Am 08.09.2018 um 18:24 schrieb Adam Borowski:
> On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
>> On 2018-09-06 03:23, Nathan Dehnel wrote:
>>> So I guess my question is, does btrfs support atomic writes across
>>> multiple files? Or is anyone interested in such a feature?
>>>
>> I'm fairly certain that it does not currently, but in theory it would not be
>> hard to add.
>>
>> Realistically, the only cases I can think of where cross-file atomic
>> _writes_ would be of any benefit are database systems.
>>
>> However, if this were extended to include rename, unlink, touch, and a
>> handful of other VFS operations, then I can easily think of a few dozen use
>> cases.  Package managers in particular would likely be very interested in
>> being able to atomically rename a group of files as a single transaction, as
>> it would make their job _much_ easier.
> I wonder, what about:
> sync; mount -o remount,commit=999,flushoncommit
> eatmydata apt dist-upgrade
> sync; mount -o remount,commit=30,noflushoncommit
>
> Obviously, this gets fooled by fsyncs, and makes the transaction affects the
> whole system (if you have unrelated writes they won't get committed until
> the end of transaction).  Then there are nocow files, but you already made
> the decision to disable most features of btrfs for them.
>
> So unless something forces a commit, this should already work, giving
> cross-file atomic writes, renames and so on.

Now combine this with snapshot root, then on success rename exchange to
root and you are there.

Btrfs had in the past TRANS_START and TRANS_END ioctls (for ceph, I
think), but no rollback (and therefore no error handling incl. ENOSPC).

If you want to look at a working file system transaction mechanism, you
should look at transactional NTFS (TxF). They are writing they are
deprecating it, so it's perhaps not very widely used. Windows uses it
for updates, I think.

Specifically for btrfs, the problem would be that it really needs to
support multiple simultaneous writers, otherwise one transaction can
block the whole system.




Re: lazytime mount option—no support in Btrfs

2018-08-19 Thread Martin Steigerwald
waxhead - 18.08.18, 22:45:
> Adam Hunt wrote:
> > Back in 2014 Ted Tso introduced the lazytime mount option for ext4
> > and shortly thereafter a more generic VFS implementation which was
> > then merged into mainline. His early patches included support for
> > Btrfs but those changes were removed prior to the feature being
> > merged. His> 
> > changelog includes the following note about the removal:
> >- Per Christoph's suggestion, drop support for btrfs and xfs for
> >now,
> >
> >  issues with how btrfs and xfs handle dirty inode tracking.  We
> >  can add btrfs and xfs support back later or at the end of this
> >  series if we want to revisit this decision.
> > 
> > My reading of the current mainline shows that Btrfs still lacks any
> > support for lazytime. Has any thought been given to adding support
> > for lazytime to Btrfs?
[…]
> Is there any new regarding this?

I´d like to know whether there is any news about this as well.

If I understand it correctly this could even help BTRFS performance a 
lot cause it is COW´ing metadata.

Thanks,
-- 
Martin




Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-18 Thread Martin Steigerwald
Roman Mamedov - 18.08.18, 09:12:
> On Fri, 17 Aug 2018 23:17:33 +0200
> 
> Martin Steigerwald  wrote:
> > > Do not consider SSD "compression" as a factor in any of your
> > > calculations or planning. Modern controllers do not do it anymore,
> > > the last ones that did are SandForce, and that's 2010 era stuff.
> > > You
> > > can check for yourself by comparing write speeds of compressible
> > > vs
> > > incompressible data, it should be the same. At most, the modern
> > > ones
> > > know to recognize a stream of binary zeroes and have a special
> > > case
> > > for that.
> > 
> > Interesting. Do you have any backup for your claim?
> 
> Just "something I read". I follow quote a bit of SSD-related articles
> and reviews which often also include a section to talk about the
> controller utilized, its background and technological
> improvements/changes -- and the compression going out of fashion
> after SandForce seems to be considered a well-known fact.
> 
> Incidentally, your old Intel 320 SSDs actually seem to be based on
> that old SandForce controller (or at least license some of that IP to
> extend on it), and hence those indeed might perform compression.

Interesting. Back then I read the Intel SSD 320 would not compress.
I think its difficult to know for sure with those proprietary controllers.

> > As the data still needs to be transferred to the SSD at least when
> > the SATA connection is maxed out I bet you won´t see any difference
> > in write speed whether the SSD compresses in real time or not.
> 
> Most controllers expose two readings in SMART:
> 
>   - Lifetime writes from host (SMART attribute 241)
>   - Lifetime writes to flash (attribute 233, or 177, or 173...)
>
> It might be difficult to get the second one, as often it needs to be
> decoded from others such as "Average block erase count" or "Wear
> leveling count". (And seems to be impossible on Samsung NVMe ones,
> for example)

I got the impression every manufacturer does their own thing here. And I
would not even be surprised when its different between different generations
of SSDs by one manufacturer.

# Crucial mSATA

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x002f   100   100   000Pre-fail  Always   
-   0
  5 Reallocated_Sector_Ct   0x0033   100   100   000Pre-fail  Always   
-   0
  9 Power_On_Hours  0x0032   100   100   000Old_age   Always   
-   16345
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always   
-   4193
171 Program_Fail_Count  0x0032   100   100   000Old_age   Always   
-   0
172 Erase_Fail_Count0x0032   100   100   000Old_age   Always   
-   0
173 Wear_Leveling_Count 0x0032   078   078   000Old_age   Always   
-   663
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000Old_age   Always   
-   362
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000Pre-fail  Always   
-   8219
183 SATA_Iface_Downshift0x0032   100   100   000Old_age   Always   
-   1
184 End-to-End_Error0x0032   100   100   000Old_age   Always   
-   0
187 Reported_Uncorrect  0x0032   100   100   000Old_age   Always   
-   0
194 Temperature_Celsius 0x0022   046   020   000Old_age   Always   
-   54 (Min/Max -10/80)
196 Reallocated_Event_Count 0x0032   100   100   000Old_age   Always   
-   16
197 Current_Pending_Sector  0x0032   100   100   000Old_age   Always   
-   0
198 Offline_Uncorrectable   0x0030   100   100   000Old_age   Offline  
-   0
199 UDMA_CRC_Error_Count0x0032   100   100   000Old_age   Always   
-   0
202 Percent_Lifetime_Used   0x0031   078   078   000Pre-fail  Offline  
-   22

I expect the raw value of this to raise more slowly now there are almost
100 GiB completely unused and there is lots of free space in the filesystems.
But even if not, the SSD is in use since March 2014. So it has plenty of time
to go.

206 Write_Error_Rate0x000e   100   100   000Old_age   Always   
-   0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000Old_age   Always   
-   0
246 Total_Host_Sector_Write 0x0032   100   100   ---Old_age   Always   
-   91288276930

^^ In sectors. 91288276930 * 512 / 1024 / 1024 / 1024 ~= 43529 GiB

Could be 4 KiB… but as its telling about Host_Sector and the value multiplied
by eight does not make any sense, I bet its 512 Bytes.

% smartctl /dev/sdb --all |grep 

Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Austin S. Hemmelgarn - 17.08.18, 14:55:
> On 2018-08-17 08:28, Martin Steigerwald wrote:
> > Thanks for your detailed answer.
> > 
> > Austin S. Hemmelgarn - 17.08.18, 13:58:
> >> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> >>> Anyway, creating a new filesystem may have been better here
> >>> anyway,
> >>> cause it replaced an BTRFS that aged over several years with a new
> >>> one. Due to the increased capacity and due to me thinking that
> >>> Samsung 860 Pro compresses itself, I removed LZO compression. This
> >>> would also give larger extents on files that are not fragmented or
> >>> only slightly fragmented. I think that Intel SSD 320 did not
> >>> compress, but Crucial m500 mSATA SSD does. That has been the
> >>> secondary SSD that still had all the data after the outage of the
> >>> Intel SSD 320.
> >> 
> >> First off, keep in mind that the SSD firmware doing compression
> >> only
> >> really helps with wear-leveling.  Doing it in the filesystem will
> >> help not only with that, but will also give you more space to work
> >> with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> No, the better you compress the data, the _less_ data you are
> physically putting on the SSD, just like compressing a file makes it
> take up less space.  This actually makes it easier for the firmware
> to do wear-leveling.  Wear-leveling is entirely about picking where
> to put data, and by reducing the total amount of data you are writing
> to the SSD, you're making that decision easier for the firmware, and
> also reducing the number of blocks of flash memory needed (which also
> helps with SSD life expectancy because it translates to fewer erase
> cycles).

On one hand I can go with this, but:

If I fill the SSD 99% with already compressed data, in case it 
compresses itself for wear leveling, it has less chance to wear level 
than with 99% of not yet compressed data that it could compress itself.

That was the point I was trying to make.

Sure, with a fill rate of about 46% for home, compression would help the 
wear leveling. And if the controller does not compress at all, it would 
also.

Hmmm, maybe I enable "zstd", but on the other hand I save CPU cycles 
with not enabling it. 

> > However… I am not all that convinced that it would benefit me as
> > long as I have enough space. That SSD replacement more than doubled
> > capacity from about 680 TB to 1480 TB. I have ton of free space in
> > the filesystems – usage of /home is only 46% for example – and
> > there are 96 GiB completely unused in LVM on the Crucial SSD and
> > even more than 183 GiB completely unused on Samsung SSD. The system
> > is doing weekly "fstrim" on all filesystems. I think that this is
> > more than is needed for the longevity of the SSDs, but well
> > actually I just don´t need the space, so…
> > 
> > Of course, in case I manage to fill up all that space, I consider
> > using compression. Until then, I am not all that convinced that I´d
> > benefit from it.
> > 
> > Of course it may increase read speeds and in case of nicely
> > compressible data also write speeds, I am not sure whether it even
> > matters. Also it uses up some CPU cycles on a dual core (+
> > hyperthreading) Sandybridge mobile i5. While I am not sure about
> > it, I bet also having larger possible extent sizes may help a bit.
> > As well as no compression may also help a bit with fragmentation.
> 
> It generally does actually. Less data physically on the device means
> lower chances of fragmentation.  In your case, it may not improve

I thought "no compression" may help with fragmentation, but I think you 
think that "compression" helps with fragmentation and misunderstood what 
I wrote.

> speed much though (your i5 _probably_ can't compress data much faster
> than it can access your SSD's, which means you likely won't see much
> performance benefit other than reducing fragmentation).
> 
> > Well putting this to a (non-scientific) test:
> > 
> > […]/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head
> > -5 3,1Gparttable.ibd
> > 
> > […]/.local/share/akonadi/db_data/akonadi> filefrag parttable.ibd
> > parttable.ibd: 11583 extents found
> > 
> > Hmmm, already quite many extents after just about one week with the
> > new filesystem. On the old filesystem I had somewhat around
>

Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Hi Roman.

Now with proper CC.

Roman Mamedov - 17.08.18, 14:50:
> On Fri, 17 Aug 2018 14:28:25 +0200
> 
> Martin Steigerwald  wrote:
> > > First off, keep in mind that the SSD firmware doing compression
> > > only
> > > really helps with wear-leveling.  Doing it in the filesystem will
> > > help not only with that, but will also give you more space to
> > > work with.> 
> > While also reducing the ability of the SSD to wear-level. The more
> > data I fit on the SSD, the less it can wear-level. And the better I
> > compress that data, the less it can wear-level.
> 
> Do not consider SSD "compression" as a factor in any of your
> calculations or planning. Modern controllers do not do it anymore,
> the last ones that did are SandForce, and that's 2010 era stuff. You
> can check for yourself by comparing write speeds of compressible vs
> incompressible data, it should be the same. At most, the modern ones
> know to recognize a stream of binary zeroes and have a special case
> for that.

Interesting. Do you have any backup for your claim?

> As for general comment on this thread, always try to save the exact
> messages you get when troubleshooting or getting failures from your
> system. Saying just "was not able to add" or "btrfs replace not
> working" without any exact details isn't really helpful as a bug
> report or even as a general "experiences" story, as we don't know
> what was the exact cause of those, could that have been avoided or
> worked around, not to mention what was your FS state at the time (as
> in "btrfs fi show" and "fi df").

I had a screen.log, but I put it on the filesystem after the 
backup was made, so it was lost.

Anyway, the reason for not being able to add the device was the read 
only state of the BTRFS, as I wrote. Same goes for replace. I was able 
to read the error message just fine. AFAIR the exact wording was "read 
only filesystem".

In any case: It was a experience report, no request for help, so I don´t 
see why exact error messages are absolutely needed. If I had a support 
inquiry that would be different, I agree.

Thanks,
-- 
Martin




Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Austin S. Hemmelgarn - 17.08.18, 15:01:
> On 2018-08-17 08:50, Roman Mamedov wrote:
> > On Fri, 17 Aug 2018 14:28:25 +0200
> > 
> > Martin Steigerwald  wrote:
> >>> First off, keep in mind that the SSD firmware doing compression
> >>> only
> >>> really helps with wear-leveling.  Doing it in the filesystem will
> >>> help not only with that, but will also give you more space to
> >>> work with.>> 
> >> While also reducing the ability of the SSD to wear-level. The more
> >> data I fit on the SSD, the less it can wear-level. And the better
> >> I compress that data, the less it can wear-level.
> > 
> > Do not consider SSD "compression" as a factor in any of your
> > calculations or planning. Modern controllers do not do it anymore,
> > the last ones that did are SandForce, and that's 2010 era stuff.
> > You can check for yourself by comparing write speeds of
> > compressible vs incompressible data, it should be the same. At
> > most, the modern ones know to recognize a stream of binary zeroes
> > and have a special case for that.
> 
> All that testing write speeds forz compressible versus incompressible
> data tells you is if the SSD is doing real-time compression of data,
> not if they are doing any compression at all..  Also, this test only
> works if you turn the write-cache on the device off.

As the data still needs to be transferred to the SSD at least when the 
SATA connection is maxed out I bet you won´t see any difference in write 
speed whether the SSD compresses in real time or not.

> Besides, you can't prove 100% for certain that any manufacturer who
> does not sell their controller chips isn't doing this, which means
> there are a few manufacturers that may still be doing it.

Who really knows what SSD controller manufacturers are doing? I have not 
seen any Open Channel SSD stuff for laptops so far.

Thanks,
-- 
Martin




Hang after growing file system (4.14.48)

2018-08-17 Thread Martin Raiber
Hi,

after growing a single btrfs file system (on a loop device, with btrfs
fi resize max /fs ) it hangs later (sometimes much later). Symptoms:

* A unkillable btrfs process using 100% (of one) CPU in R state (no
kernel trace, cannot attach with strace, gdb or run linux perf)
* Other processes with following stack trace:

Fri Aug 17 16:21:06 2018] INFO: task python3:46794 blocked for more than
120 seconds.
[Fri Aug 17 16:21:06 2018]   Not tainted 4.14.48 #2
[Fri Aug 17 16:21:06 2018] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Aug 17 16:21:06 2018] python3 D    0 46794  46702 0x
[Fri Aug 17 16:21:06 2018] Call Trace:
[Fri Aug 17 16:21:06 2018]  ? __schedule+0x2de/0x7b0
[Fri Aug 17 16:21:06 2018]  schedule+0x32/0x80
[Fri Aug 17 16:21:06 2018]  schedule_preempt_disabled+0xa/0x10
[Fri Aug 17 16:21:06 2018]  __mutex_lock.isra.1+0x295/0x4c0
[Fri Aug 17 16:21:06 2018]  ? btrfs_show_devname+0x25/0xd0
[Fri Aug 17 16:21:06 2018]  btrfs_show_devname+0x25/0xd0
[Fri Aug 17 16:21:06 2018]  show_vfsmnt+0x44/0x150
[Fri Aug 17 16:21:06 2018]  seq_read+0x314/0x3d0
[Fri Aug 17 16:21:06 2018]  __vfs_read+0x26/0x130
[Fri Aug 17 16:21:06 2018]  vfs_read+0x91/0x130
[Fri Aug 17 16:21:06 2018]  SyS_read+0x42/0x90
[Fri Aug 17 16:21:06 2018]  do_syscall_64+0x6e/0x120
[Fri Aug 17 16:21:06 2018]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[Fri Aug 17 16:21:06 2018] RIP: 0033:0x7f67fd41b6d0
[Fri Aug 17 16:21:06 2018] RSP: 002b:7ffd80be2678 EFLAGS: 0246
ORIG_RAX: 
[Fri Aug 17 16:21:06 2018] RAX: ffda RBX: 56521bf7bb00
RCX: 7f67fd41b6d0
[Fri Aug 17 16:21:06 2018] RDX: 0400 RSI: 56521bf7bd30
RDI: 0004
[Fri Aug 17 16:21:06 2018] RBP: 0d68 R08: 7f67fe655700
R09: 0101
[Fri Aug 17 16:21:06 2018] R10: 56521bf7c0cc R11: 0246
R12: 7f67fd6d6440
[Fri Aug 17 16:21:06 2018] R13: 7f67fd6d5900 R14: 0064
R15: 0000

Regards,
Martin Raiber



Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Thanks for your detailed answer.  

Austin S. Hemmelgarn - 17.08.18, 13:58:
> On 2018-08-17 05:08, Martin Steigerwald wrote:
[…]
> > I have seen a discussion about the limitation in point 2. That
> > allowing to add a device and make it into RAID 1 again might be
> > dangerous, cause of system chunk and probably other reasons. I did
> > not completely read and understand it tough.
> > 
> > So I still don´t get it, cause:
> > 
> > Either it is a RAID 1, then, one disk may fail and I still have
> > *all*
> > data. Also for the system chunk, which according to btrfs fi df /
> > btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
> > why it would need to disallow me to make it into an RAID 1 again
> > after one device has been lost.
> > 
> > Or it is no RAID 1 and then what is the point to begin with? As I
> > was
> > able to copy of all date of the degraded mount, I´d say it was a
> > RAID 1.
> > 
> > (I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
> > does two copies regardless of how many drives you use.)
> 
> So, what's happening here is a bit complicated.  The issue is entirely
> with older kernels that are missing a couple of specific patches, but
> it appears that not all distributions have their kernels updated to
> include those patches yet.
> 
> In short, when you have a volume consisting of _exactly_ two devices
> using raid1 profiles that is missing one device, and you mount it
> writable and degraded on such a kernel, newly created chunks will be
> single-profile chunks instead of raid1 chunks with one half missing.
> Any write has the potential to trigger allocation of a new chunk, and
> more importantly any _read_ has the potential to trigger allocation of
> a new chunk if you don't use the `noatime` mount option (because a
> read will trigger an atime update, which results in a write).
> 
> When older kernels then go and try to mount that volume a second time,
> they see that there are single-profile chunks (which can't tolerate
> _any_ device failures), and refuse to mount at all (because they
> can't guarantee that metadata is intact).  Newer kernels fix this
> part by checking per-chunk if a chunk is degraded/complete/missing,
> which avoids this because all the single chunks are on the remaining
> device.

How new the kernel needs to be for that to happen?

Do I get this right that it would be the kernel used for recovery, i.e. 
the one on the live distro that needs to be new enough? To one on this 
laptop meanwhile is already 4.18.1.

I used latest GRML stable release 2017.05 which has an 4.9 kernel.

> As far as avoiding this in the future:

I hope that with the new Samsung Pro 860 together with the existing 
Crucial m500 I am spared from this for years to come. That Crucial SSD 
according to SMART status about lifetime used has still quite some time 
to go.

> * If you're just pulling data off the device, mark the device
> read-only in the _block layer_, not the filesystem, before you mount
> it.  If you're using LVM, just mark the LV read-only using LVM
> commands  This will make 100% certain that nothing gets written to
> the device, and thus makes sure that you won't accidentally cause
> issues like this.

> * If you're going to convert to a single device,
> just do it and don't stop it part way through.  In particular, make
> sure that your system will not lose power.

> * Otherwise, don't mount the volume unless you know you're going to
> repair it.

Thanks for those. Good to keep in mind.

> > For this laptop it was not all that important but I wonder about
> > BTRFS RAID 1 in enterprise environment, cause restoring from backup
> > adds a significantly higher downtime.
> > 
> > Anyway, creating a new filesystem may have been better here anyway,
> > cause it replaced an BTRFS that aged over several years with a new
> > one. Due to the increased capacity and due to me thinking that
> > Samsung 860 Pro compresses itself, I removed LZO compression. This
> > would also give larger extents on files that are not fragmented or
> > only slightly fragmented. I think that Intel SSD 320 did not
> > compress, but Crucial m500 mSATA SSD does. That has been the
> > secondary SSD that still had all the data after the outage of the
> > Intel SSD 320.
> 
> First off, keep in mind that the SSD firmware doing compression only
> really helps with wear-leveling.  Doing it in the filesystem will help
> not only with that, but will also give you more space to work with.

While also reducing the ability of the SSD to wear-level. The more data 
I fit on the SSD, the less it can wear-level. And the b

Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Martin Steigerwald
Hi!

This happened about two weeks ago. I already dealt with it and all is 
well.

Linux hung on suspend so I switched off this ThinkPad T520 forcefully. 
After that it did not boot the operating system anymore. Intel SSD 320, 
latest firmware, which should patch this bug, but apparently does not, 
is only 8 MiB big. Those 8 MiB just contain zeros.

Access via GRML and "mount -fo degraded" worked. I initially was even 
able to write onto this degraded filesystem. First I copied all data to 
a backup drive.

I even started a balance to "single" so that it would work with one SSD.

But later I learned that secure erase may recover the Intel SSD 320 and 
since I had no other SSD at hand, did that. And yes, it did. So I 
canceled the balance.

I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But 
at that time I was not able to mount the degraded BTRFS on the other SSD 
as writable anymore, not even with "-f" "I know what I am doing". Thus I 
was not able to add a device to it and btrfs balance it to RAID 1. Even 
"btrfs replace" was not working.

I thus formatted a new BTRFS RAID 1 and restored.

A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again 
via one full backup and restore cycle. However, this time I was able to 
copy most of the data of the Intel SSD 320 with "mount -fo degraded" via 
eSATA and thus the copy operation was way faster.

So conclusion:

1. Pro: BTRFS RAID 1 really protected my data against a complete SSD 
outage.

2. Con:  It does not allow me to add a device and balance to RAID 1 or 
replace one device that is already missing at this time.

3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical 
data.

4. And yes, I know it does not replace a backup. As it was holidays and 
I was lazy backup was two weeks old already, so I was happy to have all 
my data still on the other SSD.

5. The error messages in kernel when mounting without "-o degraded" are 
less than helpful. They indicate a corrupted filesystem instead of just 
telling that one device is missing and "-o degraded" would help here.


I have seen a discussion about the limitation in point 2. That allowing 
to add a device and make it into RAID 1 again might be dangerous, cause 
of system chunk and probably other reasons. I did not completely read 
and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have *all* 
data. Also for the system chunk, which according to btrfs fi df / btrfs 
fi sh was indeed RAID 1. If so, then period. Then I don´t see why it 
would need to disallow me to make it into an RAID 1 again after one 
device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I was 
able to copy of all date of the degraded mount, I´d say it was a RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does 
two copies regardless of how many drives you use.)


For this laptop it was not all that important but I wonder about BTRFS 
RAID 1 in enterprise environment, cause restoring from backup adds a 
significantly higher downtime.

Anyway, creating a new filesystem may have been better here anyway, 
cause it replaced an BTRFS that aged over several years with a new one. 
Due to the increased capacity and due to me thinking that Samsung 860 
Pro compresses itself, I removed LZO compression. This would also give 
larger extents on files that are not fragmented or only slightly 
fragmented. I think that Intel SSD 320 did not compress, but Crucial 
m500 mSATA SSD does. That has been the secondary SSD that still had all 
the data after the outage of the Intel SSD 320.


Overall I am happy, cause BTRFS RAID 1 gave me access to the data after 
the SSD outage. That is the most important thing about it for me.

Thanks,
-- 
Martin




Re: BTRFS and databases

2018-08-02 Thread Martin Raiber
On 02.08.2018 14:27 Austin S. Hemmelgarn wrote:
> On 2018-08-02 06:56, Qu Wenruo wrote:
>>
>> On 2018年08月02日 18:45, Andrei Borzenkov wrote:
>>>
>>> Отправлено с iPhone
>>>
 2 авг. 2018 г., в 10:02, Qu Wenruo 
 написал(а):

> On 2018年08月01日 11:45, MegaBrutal wrote:
> Hi all,
>
> I know it's a decade-old question, but I'd like to hear your thoughts
> of today. By now, I became a heavy BTRFS user. Almost everywhere I
> use
> BTRFS, except in situations when it is obvious there is no benefit
> (e.g. /var/log, /boot). At home, all my desktop, laptop and server
> computers are mainly running on BTRFS with only a few file systems on
> ext4. I even installed BTRFS in corporate productive systems (in
> those
> cases, the systems were mainly on ext4; but there were some specific
> file systems those exploited BTRFS features).
>
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
>
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a
> database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?

 Since I'm not a expert in database, so I can totally be wrong, but
 what
 about completely disabling database write-ahead-log (WAL), and let
 btrfs' data CoW to handle data consistency completely?

>>>
>>> This would make content of database after crash completely
>>> unpredictable, thus making it impossible to reliably roll back
>>> transaction.
>>
>> Btrfs itself (with datacow) can ensure the fs is updated completely.
>>
>> That's to say, even a crash happens, the content of the fs will be the
>> same state as previous btrfs transaction (btrfs sync).
>>
>> Thus there is no need to rollback database transaction though.
>> (Unless database transaction is not sync to btrfs transaction)
>>
> Two issues with this statement:
>
> 1. Not all database software properly groups logically related
> operations that need to be atomic as a unit into transactions.
> 2. Even aside from point 1 and the possibility of database corruption,
> there are other legitimate reasons that you might need to roll-back a
> transaction (for example, the rather obvious case of a transaction
> that should not have happened in the first place).

I thought of a database transaction scheme that is based on btrfs
features before. It has practical issues, though.
One would put a b-tree database file into a subvolume (e.g. trans_0).
When changing the b-tree database one would create a snapshot (trans_1),
then change the file in the snapshot. On commit sync trans_1, then
delete trans_0. On rollback, delete trans_1.

Problems:
* Large overhead for small transactions (OLTP) -- problem in general for
copy-on-write b-tree databases
* Only root can create or destroy snapshots
* Per default the Linux memory system starts write-back pretty much
immediately, so pages that get overwritten more than once in a
transaction (and not kept in RAM) unless Linux is tuned to not do this.

I have used this method, albeit by reflinking the database, then
modifying the reflink, but I think reflinking it slower than creating a
snapshot?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-02 Thread Martin Steigerwald
Andrei Borzenkov - 02.08.18, 12:35:
> Отправлено с iPhone
> 
> > 2 авг. 2018 г., в 12:16, Martin Steigerwald 
> > написал(а):> 
> > Hugo Mills - 01.08.18, 10:56:
> >>> On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> >>> I know it's a decade-old question, but I'd like to hear your
> >>> thoughts
> >>> of today. By now, I became a heavy BTRFS user. Almost everywhere I
> >>> use BTRFS, except in situations when it is obvious there is no
> >>> benefit (e.g. /var/log, /boot). At home, all my desktop, laptop
> >>> and
> >>> server computers are mainly running on BTRFS with only a few file
> >>> systems on ext4. I even installed BTRFS in corporate productive
> >>> systems (in those cases, the systems were mainly on ext4; but
> >>> there
> >>> were some specific file systems those exploited BTRFS features).
> >>> 
> >>> But there is still one question that I can't get over: if you
> >>> store
> >>> a
> >>> database (e.g. MySQL), would you prefer having a BTRFS volume
> >>> mounted
> >>> with nodatacow, or would you just simply use ext4?
> >>> 
> >>   Personally, I'd start with btrfs with autodefrag. It has some
> >> 
> >> degree of I/O overhead, but if the database isn't
> >> performance-critical and already near the limits of the hardware,
> >> it's unlikely to make much difference. Autodefrag should keep the
> >> fragmentation down to a minimum.
> > 
> > I read that autodefrag would only help with small databases.
> 
> I wonder if anyone actually
> 
> a) quantified performance impact
> b) analyzed the cause
> 
> I work with NetApp for a long time and I can say from first hand
> experience that fragmentation had zero impact on OLTP workload. It
> did affect backup performance as was expected, but this could be
> fixed by periodic reallocation (defragmentation).
> 
> And even that needed quite some time to observe (years) on pretty high
>  load database with regular backup and replication snapshots.
> 
> If btrfs is so susceptible to fragmentation, what is the reason for
> it?

In the end of my original mail I mentioned a blog article that also had 
some performance graphs. Did you actually read it?

Thanks,
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-02 Thread Martin Steigerwald
Hugo Mills - 01.08.18, 10:56:
> On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> > I know it's a decade-old question, but I'd like to hear your
> > thoughts
> > of today. By now, I became a heavy BTRFS user. Almost everywhere I
> > use BTRFS, except in situations when it is obvious there is no
> > benefit (e.g. /var/log, /boot). At home, all my desktop, laptop and
> > server computers are mainly running on BTRFS with only a few file
> > systems on ext4. I even installed BTRFS in corporate productive
> > systems (in those cases, the systems were mainly on ext4; but there
> > were some specific file systems those exploited BTRFS features).
> > 
> > But there is still one question that I can't get over: if you store
> > a
> > database (e.g. MySQL), would you prefer having a BTRFS volume
> > mounted
> > with nodatacow, or would you just simply use ext4?
> 
>Personally, I'd start with btrfs with autodefrag. It has some
> degree of I/O overhead, but if the database isn't performance-critical
> and already near the limits of the hardware, it's unlikely to make
> much difference. Autodefrag should keep the fragmentation down to a
> minimum.

I read that autodefrag would only help with small databases.

I also read that even on SSDs there is a notable performance penalty. 
4.2 GiB akonadi database  for tons of mails appears to work okayish on 
dual SSD BTRFS RAID 1 here with LZO compression here. However I have no 
comparison, for example how it would run on XFS. And its fragmented 
quite a bit, example for the largest file of 3 GiB – I know this in part 
is also due to LZO compression.

[…].local/share/akonadi/db_data/akonadi> time /usr/sbin/filefrag 
parttable.ibd
parttable.ibd: 45380 extents found
/usr/sbin/filefrag parttable.ibd  0,00s user 0,86s system 41% cpu 2,054 
total

However it digs out those extents quite fast.

I would not feel comfortable with setting this file to nodatacow.


However I wonder: Is this it? Is there nothing that can be improved in 
BTRFS to handle database and VM files in a better way, without altering 
any default settings?

Is it also an issue on ZFS? ZFS does also copy on write. How does ZFS 
handle this? Can anything be learned from it? I never head people 
complain about poor database performance on ZFS, but… I don´t use it and 
I am not subscribed to any ZFS mailing lists, so they may have similar 
issues and I just do not know it.

Well there seems to be a performance penalty at least when compared to 
XFS:

About ZFS Performance
Yves Trudeau, May 15, 2018

https://www.percona.com/blog/2018/05/15/about-zfs-performance/

The article described how you can use NVMe devices as cache to mitigate 
the performance impact. That would hint that BTRFS with VFS Hot Data 
Tracking and relocating data to SSD or NVMe devices could be a way to 
set this up.


But as said I read about bad database performance even on SSDs with 
BTRFS. I do not find the original reference at the moment, but I got 
this for example, however it is from 2015 (on kernel 4.0 which is a bit 
old):

Friends don't let friends use BTRFS for OLTP
2015/09/16 by Tomas Vondra

https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

Interestingly it also compares with ZFS which is doing much better. So 
maybe there is really something to be learned from ZFS.

I did not get clearly whether the benchmark was on an SSD, as Tomas 
notes the "ssd" mount option, it might have been.

Thanks,
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-17 Thread Martin Steigerwald
Nikolay Borisov - 17.07.18, 10:16:
> On 17.07.2018 11:02, Martin Steigerwald wrote:
> > Nikolay Borisov - 17.07.18, 09:20:
> >> On 16.07.2018 23:58, Wolf wrote:
> >>> Greetings,
> >>> I would like to ask what what is healthy amount of free space to
> >>> keep on each device for btrfs to be happy?
> >>> 
> >>> This is how my disk array currently looks like
> >>> 
> >>> [root@dennas ~]# btrfs fi usage /raid
> >>> 
> >>> Overall:
> >>> Device size:  29.11TiB
> >>> Device allocated: 21.26TiB
> >>> Device unallocated:7.85TiB
> >>> Device missing:  0.00B
> >>> Used: 21.18TiB
> >>> Free (estimated):  3.96TiB  (min: 3.96TiB)
> >>> Data ratio:   2.00
> >>> Metadata ratio:   2.00
> >>> Global reserve:  512.00MiB  (used: 0.00B)
> > 
> > […]
> > 
> >>> Btrfs does quite good job of evenly using space on all devices.
> >>> No,
> >>> how low can I let that go? In other words, with how much space
> >>> free/unallocated remaining space should I consider adding new
> >>> disk?
> >> 
> >> Btrfs will start running into problems when you run out of
> >> unallocated space. So the best advice will be monitor your device
> >> unallocated, once it gets really low - like 2-3 gb I will suggest
> >> you run balance which will try to free up unallocated space by
> >> rewriting data more compactly into sparsely populated block
> >> groups. If after running balance you haven't really freed any
> >> space then you should consider adding a new drive and running
> >> balance to even out the spread of data/metadata.
> > 
> > What are these issues exactly?
> 
> For example if you have plenty of data space but your metadata is full
> then you will be getting ENOSPC.

Of that one I am aware.

This just did not happen so far.

I did not yet add it explicitly to the training slides, but I just make 
myself a note to do that.

Anything else?

> > I have
> > 
> > % btrfs fi us -T /home
> > 
> > Overall:
> > Device size: 340.00GiB
> > Device allocated:340.00GiB
> > Device unallocated:2.00MiB
> > Device missing:  0.00B
> > Used:308.37GiB
> > Free (estimated): 14.65GiB  (min: 14.65GiB)
> > Data ratio:   2.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 0.00B)
> > 
> >   Data  Metadata System
> > 
> > Id Path   RAID1 RAID1RAID1Unallocated
> > -- -- -   ---
> > 
> >  1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB 1.00MiB
> >  2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB 1.00MiB
> > 
> > -- -- -   ---
> > 
> >Total  165.89GiB  4.08GiB 32.00MiB 2.00MiB
> >Used   151.24GiB  2.95GiB 48.00KiB
>
> You already have only 33% of your metadata full so if your workload
> turned out to actually be making more metadata-heavy changed i.e
> snapshots you could exhaust this and get ENOSPC, despite having around
> 14gb of free data space. Furthermore this data space is spread around
> multiple data chunks, depending on how populated they are a balance
> could be able to free up unallocated space which later could be
> re-purposed for metadata (again, depending on what you are doing).

The filesystem above IMO is not fit for snapshots. It would fill up 
rather quickly, I think even when I balance metadata. Actually I tried 
this and as I remember it took at most a day until it was full.

If I read above figures currently at maximum I could gain one additional 
GiB by balancing metadata. That would not make a huge difference.

I bet I am already running this filesystem beyond recommendation, as I 
bet many would argue it is to full already for regular usage… I do not 
see the benefit of squeezing the last free space out of it just to fit 
in another GiB.

So I still do not get the point why it would make sense to balance it at 
this point in time. Especially as this 1 GiB I could regain is not even 
needed. And I do not s

Re: Healthy amount of free space?

2018-07-17 Thread Martin Steigerwald
Hi Nikolay.

Nikolay Borisov - 17.07.18, 09:20:
> On 16.07.2018 23:58, Wolf wrote:
> > Greetings,
> > I would like to ask what what is healthy amount of free space to
> > keep on each device for btrfs to be happy?
> > 
> > This is how my disk array currently looks like
> > 
> > [root@dennas ~]# btrfs fi usage /raid
> > 
> > Overall:
> > Device size:  29.11TiB
> > Device allocated: 21.26TiB
> > Device unallocated:7.85TiB
> > Device missing:  0.00B
> > Used: 21.18TiB
> > Free (estimated):  3.96TiB  (min: 3.96TiB)
> > Data ratio:   2.00
> > Metadata ratio:   2.00
> > Global reserve:  512.00MiB  (used: 0.00B)
[…]
> > Btrfs does quite good job of evenly using space on all devices. No,
> > how low can I let that go? In other words, with how much space
> > free/unallocated remaining space should I consider adding new disk?
> 
> Btrfs will start running into problems when you run out of unallocated
> space. So the best advice will be monitor your device unallocated,
> once it gets really low - like 2-3 gb I will suggest you run balance
> which will try to free up unallocated space by rewriting data more
> compactly into sparsely populated block groups. If after running
> balance you haven't really freed any space then you should consider
> adding a new drive and running balance to even out the spread of
> data/metadata.

What are these issues exactly?

I have

% btrfs fi us -T /home
Overall:
Device size: 340.00GiB
Device allocated:340.00GiB
Device unallocated:2.00MiB
Device missing:  0.00B
Used:308.37GiB
Free (estimated): 14.65GiB  (min: 14.65GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

  Data  Metadata System  
Id Path   RAID1 RAID1RAID1Unallocated
-- -- -   ---
 1 /dev/mapper/msata-home 165.89GiB  4.08GiB 32.00MiB 1.00MiB
 2 /dev/mapper/sata-home  165.89GiB  4.08GiB 32.00MiB 1.00MiB
-- -- -   ---
   Total  165.89GiB  4.08GiB 32.00MiB 2.00MiB
   Used   151.24GiB  2.95GiB 48.00KiB

on a RAID-1 filesystem one, part of the time two Plasma desktops + 
KDEPIM and Akonadi + Baloo desktop search + you name it write to like 
mad.

Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS 
crawled to an halt on searching for free blocks, and I had to switch off 
the laptop uncleanly. If that happened, a balance helped for a while. 
But since 4.5 or 4.6 this did not happen anymore.

I found with SLES 12 SP 3 or so there is btrfsmaintenance running a 
balance weekly. Which created an issue on our Proxmox + Ceph on Intel 
NUC based opensource demo lab. This is for sure no recommended 
configuration for Ceph and Ceph is quite slow on these 2,5 inch 
harddisks and 1 GBit network link, despite albeit somewhat minimal, 
limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled 
to a halt and the kernel gave task hung for more than 120 seconds 
messages. The VM was basically unusable during the balance. Sure that 
should not happen with a "proper" setup, also it also did not happen 
without the automatic balance.

Also what would happen on a hypervisor setup with several thousands of 
VMs with BTRFS, when several 100 of them decide to start the balance at 
a similar time? It could probably bring the I/O system below to an halt, 
as many enterprise storage systems are designed to sustain burst I/O 
loads, but not maximum utilization during an extended period of time.

I am really wondering what to recommend in my Linux performance tuning 
and analysis courses. On my own laptop I do not do regular balances so 
far. Due to my thinking: If it is not broken, do not fix it.

My personal opinion here also is: If the filesystem degrades that much 
that it becomes unusable without regular maintenance from user space, 
the filesystem needs to be fixed. Ideally I would not have to worry on 
whether to regularly balance an BTRFS or not. In other words: I should 
not have to visit a performance analysis and tuning course in order to 
use a computer with BTRFS filesystem.

Thanks,
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190

2018-07-10 Thread Martin Raiber
On 10.07.2018 09:04 Pete wrote:
> I've just had the error in the subject which caused the file system to
> go read-only.
>
> Further part of error message:
> WARNING: CPU: 14 PID: 1351 at fs/btrfs/extent-tree.c:3076
> btrfs_run_delayed_refs*0x163/0x190
>
> 'Screenshot' here:
> https://drive.google.com/file/d/1qw7TE1bec8BKcmffrOmg2LS15IOq8Jwc/view?usp=sharing
>
> The kernel is 4.17.4.  There are three hard drives in the file system.
> dmcrypt (luks) is used between btrfs and the disks.
This is probably a known issue. See
https://www.spinics.net/lists/linux-btrfs/msg75647.html
You could apply the patch in this thread and mount with enospc_debug to
confirm it is the same issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: Check each block group has corresponding chunk at mount time

2018-07-03 Thread Martin Steigerwald
Nikolay Borisov - 03.07.18, 11:08:
> On  3.07.2018 11:47, Qu Wenruo wrote:
> > On 2018年07月03日 16:33, Nikolay Borisov wrote:
> >> On  3.07.2018 11:08, Qu Wenruo wrote:
> >>> Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199837, if
> >>> a
> >>> crafted btrfs with incorrect chunk<->block group mapping, it could
> >>> leads to a lot of unexpected behavior.
> >>> 
> >>> Although the crafted image can be catched by block group item
> >>> checker
> >>> added in "[PATCH] btrfs: tree-checker: Verify block_group_item",
> >>> if one crafted a valid enough block group item which can pass
> >>> above check but still mismatch with existing chunk, it could
> >>> cause a lot of undefined behavior.
> >>> 
> >>> This patch will add extra block group -> chunk mapping check, to
> >>> ensure we have a completely matching (start, len, flags) chunk
> >>> for each block group at mount time.
> >>> 
> >>> Reported-by: Xu Wen 
> >>> Signed-off-by: Qu Wenruo 
> >>> ---
> >>> changelog:
> >>> 
> >>> v2:
> >>>   Add better error message for each mismatch case.
> >>>   Rename function name, to co-operate with later patch.
> >>>   Add flags mismatch check.
> >>> 
> >>> ---
> >> 
> >> It's getting really hard to keep track of the various validation
> >> patches you sent with multiple versions + new checks. Please batch
> >> everything in a topic series i.e "Making checks stricter" or some
> >> such and send everything again nicely packed, otherwise the risk
> >> of mis-merging is increased.
> > 
> > Indeed, I'll send the branch and push it to github.
> > 
> >> I now see that Gu Jinxiang from fujitsu also started sending
> >> validation fixes.
> > 
> > No need to worry, that will be the only patch related to that thread
> > of bugzilla from Fujitsu.
> > As all the other cases can be addressed by my patches, sorry Fujitsu
> > guys :)> 
> >> Also for evry patch which fixes a specific issue from one of the
> >> reported on bugzilla.kernel.org just use the Link: tag to point to
> >> the original report on bugzilla that will make it easier to relate
> >> the fixes to the original report.
> > 
> > Never heard of "Link:" tag.
> > Maybe it's a good idea to added it to "submitting-patches.rst"?
> 
> I guess it's not officially documented but if you do git log --grep
> "Link:" you'd see quite a lot of patches actually have a Link pointing
> to the original thread if it has sparked some pertinent discussion.
> In this case those patches are a direct result of a bugzilla
> bugreport so having a Link: tag makes sense.

For Bugzilla reports I saw something like

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=43511

in a patch I was Cc´d to.

Of course that does only apply if the patch in question fixes the 
reported bug.

> In the example of the qgroup patch I sent yesterday resulting from
> Misono's report there was also an involved discussion hence I added a
> link to the original thread.
[…]
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "decompress failed" in 1-2 files always causes kernel oops, check/scrub pass

2018-05-12 Thread Martin Steigerwald
Hey James.

james harvey - 12.05.18, 07:08:
> 100% reproducible, booting from disk, or even Arch installation ISO.
> Kernel 4.16.7.  btrfs-progs v4.16.
> 
> Reading one of two journalctl files causes a kernel oops.  Initially
> ran into it from "journalctl --list-boots", but cat'ing the file does
> it too.  I believe this shows there's compressed data that is invalid,
> but its btrfs checksum is invalid.  I've cat'ed every file on the
> disk, and luckily have the problems narrowed down to only these 2
> files in /var/log/journal.
> 
> This volume has always been mounted with lzo compression.
> 
> scrub has never found anything, and have ran it since the oops.
> 
> Found a user a few years ago who also ran into this, without
> resolution, at:
> https://www.spinics.net/lists/linux-btrfs/msg52218.html
> 
> 1. Cat'ing a (non-essential) file shouldn't be able to bring down the
> system.
> 
> 2. If this is infact invalid compressed data, there should be a way to
> check for that.  Btrfs check and scrub pass.

I think systemd-journald sets those files to nocow on BTRFS in order to 
reduce fragmentation: That means no checksums, no snapshots, no nothing. 
I just removed /var/log/journal and thus disabled journalling to disk. 
Its sufficient for me to have the recent state in /run/journal.

Can you confirm nocow being set via lsattr on those files?

Still they should be decompressible just fine.

> Hardware is fine.  Passes memtest86+ in SMP mode.  Works fine on all
> other files.
> 
> 
> 
> [  381.869940] BUG: unable to handle kernel paging request at
> 00390e50 [  381.870881] BTRFS: decompress failed
[…]
-- 
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs cont. [Was: Re: metadata_ratio mount option?]

2018-05-08 Thread Martin Svec
Hello Chris,

Dne 7.5.2018 v 18:37 Chris Mason napsal(a):
>
>
> On 7 May 2018, at 12:16, Martin Svec wrote:
>
>> Hello Chris,
>>
>> Dne 7.5.2018 v 16:49 Chris Mason napsal(a):
>>> On 7 May 2018, at 7:40, Martin Svec wrote:
>>>
>>>> Hi,
>>>>
>>>> According to man btrfs [1], I assume that metadata_ratio=1 mount option 
>>>> should
>>>> force allocation of one metadata chunk after every allocated data chunk. 
>>>> However,
>>>> when I set this option and start filling btrfs with "dd if=/dev/zero 
>>>> of=dummyfile.dat",
>>>> only data chunks are allocated but no metadata ones. So, how does the 
>>>> metadata_ratio
>>>> option really work?
>>>>
>>>> Note that I'm trying to use this option as a workaround of the bug 
>>>> reported here:
>>>>
>>>
>>> [ urls that FB email server eats, sorry ]
>>
>> It's link to "Btrfs remounted read-only due to ENOSPC in 
>> btrfs_run_delayed_refs" thread :)
>
> Oh yeah, the link worked fine, it just goes through this url defense monster 
> that munges it in
> replies.
>
>>
>>>
>>>>
>>>> i.e. I want to manually preallocate metadata chunks to avoid nightly 
>>>> ENOSPC errors.
>>>
>>>
>>> metadata_ratio is almost but not quite what you want.  It sets a flag on 
>>> the space_info to force a
>>> chunk allocation the next time we decide to call should_alloc_chunk().  
>>> Thanks to the overcommit
>>> code, we usually don't call that until the metadata we think we're going to 
>>> need is bigger than
>>> the metadata space available.  In other words, by the time we're into the 
>>> code that honors the
>>> force flag, reservations are already high enough to make us allocate the 
>>> chunk anyway.
>>
>> Yeah, that's how I understood the code. So I think metadata_ratio man 
>> section is quite confusing
>> because it implies that btrfs guarantees given metadata to data chunk space 
>> ratio, which isn't true.
>>
>>>
>>> I tried to use metadata_ratio to experiment with forcing more metadata slop 
>>> space, but really I
>>> have to tweak the overcommit code first.
>>> Omar beat me to a better solution, tracking down our transient ENOSPC 
>>> problems here at FB to
>>> reservations done for orphans.  Do you have a lot of deleted files still 
>>> being held open?  lsof
>>> /mntpoint | grep deleted will list them.
>>
>> I'll take a look during backup window. The initial bug report describes our 
>> rsync workload in
>> detail, for your reference. 

No, there're no trailing deleted files during backup. However, I noticed 
something interesting in
strace output: rsync does ftruncate() of every transferred file before closing 
it. In 99.9% cases
the file is truncated to its own size, so it should be a no-op. But these 
ftruncates are by far the
slowest syscalls according to strace timing and btrfs_truncate() comments 
itself as "indeed ugly".
Could it be the root cause of global reservations pressure?

I've found this patch from Filipe (Cc'd): 
https://patchwork.kernel.org/patch/10205013/. Should I
apply it to our 4.14.y kernel and try the impact on intensive rsync workloads?

Thank you
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: metadata_ratio mount option?

2018-05-07 Thread Martin Svec
Hello Chris,

Dne 7.5.2018 v 16:49 Chris Mason napsal(a):
> On 7 May 2018, at 7:40, Martin Svec wrote:
>
>> Hi,
>>
>> According to man btrfs [1], I assume that metadata_ratio=1 mount option 
>> should
>> force allocation of one metadata chunk after every allocated data chunk. 
>> However,
>> when I set this option and start filling btrfs with "dd if=/dev/zero 
>> of=dummyfile.dat",
>> only data chunks are allocated but no metadata ones. So, how does the 
>> metadata_ratio
>> option really work?
>>
>> Note that I'm trying to use this option as a workaround of the bug reported 
>> here:
>>
>
> [ urls that FB email server eats, sorry ]

It's link to "Btrfs remounted read-only due to ENOSPC in 
btrfs_run_delayed_refs" thread :)

>
>>
>> i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC 
>> errors.
>
>
> metadata_ratio is almost but not quite what you want.  It sets a flag on the 
> space_info to force a
> chunk allocation the next time we decide to call should_alloc_chunk().  
> Thanks to the overcommit
> code, we usually don't call that until the metadata we think we're going to 
> need is bigger than
> the metadata space available.  In other words, by the time we're into the 
> code that honors the
> force flag, reservations are already high enough to make us allocate the 
> chunk anyway.

Yeah, that's how I understood the code. So I think metadata_ratio man section 
is quite confusing
because it implies that btrfs guarantees given metadata to data chunk space 
ratio, which isn't true.

>
> I tried to use metadata_ratio to experiment with forcing more metadata slop 
> space, but really I
> have to tweak the overcommit code first.
> Omar beat me to a better solution, tracking down our transient ENOSPC 
> problems here at FB to
> reservations done for orphans.  Do you have a lot of deleted files still 
> being held open?  lsof
> /mntpoint | grep deleted will list them.

I'll take a look during backup window. The initial bug report describes our 
rsync workload in
detail, for your reference.

>
> We're working through a patch for the orphans here.  You've got a ton of 
> bytes pinned, which isn't
> a great match for the symptoms we see:
>
> [285169.096630] BTRFS info (device sdb): space_info 4 has 
> 18446744072120172544 free, is not full
> [285169.096633] BTRFS info (device sdb): space_info total=273804165120, 
> used=269218267136,
> pinned=3459629056, reserved=52396032, may_use=2663120896, readonly=131072
>
> But, your may_use count is high enough that you might be hitting this 
> problem.  Otherwise I'll
> work out a patch to make some more metadata chunks while Josef is perfecting 
> his great delayed ref
> update.

As mentioned in the bug report, we have a custom patch that dedicates SSDs for 
metadata chunks and
HDDs for data chunks. So, all we need is to preallocate metadata chunks to 
occupy all of the SSD
space and our issues will be gone.
Note that btrfs with SSD-backed metadata works absolutely great for rsync 
backups, even if there're
billions of files and thousands of snapshots. The global reservation ENOSPC is 
the last issue we're
struggling with.

Thank you

Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


metadata_ratio mount option?

2018-05-07 Thread Martin Svec
Hi,

According to man btrfs [1], I assume that metadata_ratio=1 mount option should
force allocation of one metadata chunk after every allocated data chunk. 
However,
when I set this option and start filling btrfs with "dd if=/dev/zero 
of=dummyfile.dat",
only data chunks are allocated but no metadata ones. So, how does the 
metadata_ratio
option really work?

Note that I'm trying to use this option as a workaround of the bug reported 
here: 

https://www.spinics.net/lists/linux-btrfs/msg75104.html

i.e. I want to manually preallocate metadata chunks to avoid nightly ENOSPC 
errors.

Best regards.

Martin


[1] https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5)#MOUNT_OPTIONS



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: extent-tree.c no space left (4.9.77 + 4.16.2)

2018-04-21 Thread Martin Svec
Hi David,

this looks like the bug that I already reported two times:

https://www.spinics.net/lists/linux-btrfs/msg54394.html
https://www.spinics.net/lists/linux-btrfs/msg75104.html

The second thread contains Nikolay's debug patch that can confirm if you run 
out of global metadata
reservations too.

Martin

Dne 21.4.2018 v 9:38 David Goodwin napsal(a):
> Hi,
>
> I'm running a 3TiB EBS based (2+1TiB devices) volume in EC2 which contains 
> about 500 read-only
> snapshots.
>
> btrfs-progs v4.7.3
>
> There are two dmesg trace things below. The first one from a 4.9.77 kernel -
>
> [ cut here ]
> BTRFS: error (device xvdg) in btrfs_run_delayed_refs:2967: errno=-28 No space 
> left
> BTRFS info (device xvdg): forced readonlyApr 19 11:44:40 gateway1 kernel: 
> [7648104.300115]
> WARNING: CPU: 2 PID: 963 at fs/btrfs/extent-tree.c:2967 
> btrfs_run_delayed_refs+0x27e/0x2b0
> [btrfs]Apr 19 11:44:40 gateway1 kernel: [7648104.313268] BTRFS: Transaction 
> aborted (error -28)
> Modules linked in: dm_mod nfsv3 ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE 
> nf_nat_masquerade_ipv4
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp 
> nf_conntrack_ftp nf_nat
> nf_conntrack xt_mu
> nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc evdev 
> intel_rapl
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_pcsp snd_pcm 
> aesni_intel aes_x86_64 lrw
> gf128mul glue_helper snd_timer ablk_helper snd cryptd soundcore ext4 crc16 
> jbd2 mbcache btrfs xor
> raid6_pq xen_netfront xen_blkfront crc32c_intel
> CPU: 2 PID: 963 Comm: btrfs-transacti Not tainted 4.9.77-dg1 #1Apr 19 
> 11:44:40 gateway1 kernel:
> [7648104.408561]   812f17a4 c90043203d08 
> 
>  8107389e a0157d5a c90043203d58 8802ccfd7170
>  880394684800 880394684800 0007315c 8107390f
> Call Trace:
>  [] ? dump_stack+0x5c/0x78
>  [] ? __warn+0xbe/0xe0
>  [] ? warn_slowpath_fmt+0x4f/0x60
>  [] ? btrfs_run_delayed_refs+0x27e/0x2b0 [btrfs]
>  [] ? btrfs_release_path+0x13/0x80 [btrfs]
>  [] ? btrfs_start_dirty_block_groups+0x2c2/0x450 [btrfs]
>  [] ? btrfs_commit_transaction+0x14c/0xa30 [btrfs]
>  [] ? start_transaction+0x96/0x480 [btrfs]
>  [] ? transaction_kthread+0x1dc/0x200 [btrfs]
>  [] ? btrfs_cleanup_transaction+0x550/0x550 [btrfs]
>  [] ? kthread+0xc7/0xe0
>  [] ? kthread_park+0x60/0x60
>  [] ? ret_from_fork+0x54/0x60
> ---[ end trace 69ca1332d91b4310 ]---
> BTRFS: error (device xvdg) in btrfs_run_delayed_refs:2967: errno=-28 No space 
> left
> BTRFS error (device xvdg): parent transid verify failed on 5400398217216 
> wanted 1893543 found 1893366
>
>
>
> On checking btrfs fi us there was plenty of unallocated space left.
>
> % btrfs fi us /broken/
>
> Overall:
>     Device size:   3.06TiB
>     Device allocated:   2.43TiB
>     Device unallocated: 643.09GiB
>     Device missing: 0.00B
>     Used:   2.43TiB
>     Free (estimated): 646.41GiB    (min: 646.41GiB)
>     Data ratio:  1.00
>     Metadata ratio:  1.00
>     Global reserve: 512.00MiB    (used: 0.00B)
>
> 
>
> The VM was then rebooted with a 4.16.2 kernel, which encountered what I 
> assume is the same problem:
>
>
> [ cut here ]
> BTRFS: Transaction aborted (error -28)
> WARNING: CPU: 2 PID: 981 at fs/btrfs/extent-tree.c:6990 
> __btrfs_free_extent.isra.63+0x3d2/0xd20
> [btrfs]
> Modules linked in: nfsv3 ipt_REJECT nf_reject_ipv4 ipt_MASQUERADE 
> nf_nat_masquerade_ipv4
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp 
> nf_conntrack_ftp nf_nat
> nf_conntrack libcrc32c crc32c_generic xt_multiport iptable_filter ip_tables 
> x_tables autofs4 nfsd
> auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc intel_rapl 
> crct10dif_pclmul crc32_pclmul
> ghash_clmulni_intel evdev pcbc snd_pcsp aesni_intel snd_pcm aes_x86_64 
> snd_timer crypto_simd
> glue_helper snd cryptd soundcore ext4 crc16 mbcache jbd2 btrfs xor 
> zstd_decompress zstd_compress
> xxhash raid6_pq xen_netfront xen_blkfront crc32c_intel
> CPU: 2 PID: 981 Comm: btrfs-transacti Not tainted 4.16.2-dg1 #1
> RIP: e030:__btrfs_free_extent.isra.63+0x3d2/0xd20 [btrfs]
> RSP: e02b:c900428d7c68 EFLAGS: 00010292
> RAX: 0026 RBX: 01fb8031c000 RCX: 0006
> RDX: 0007 RSI: 0001 RDI: 88039a916650
> RBP: ffe4 R08: 0001 R09: 010a
> R10: 0001 R11: 010a R12: 8803957e6000
> R13: 88036f5a9e70 R14:  R15:

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-13 Thread Martin Svec
Dne 10.3.2018 v 15:51 Martin Svec napsal(a):
> Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a):
>> 
>>
>>>>> And then report back on the output of the extra debug 
>>>>> statements. 
>>>>>
>>>>> Your global rsv is essentially unused, this means 
>>>>> in the worst case the code should fallback to using the global rsv
>>>>> for satisfying the memory allocation for delayed refs. So we should
>>>>> figure out why this isn't' happening. 
>>>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as 
>>>> we hit ENOSPC again.
>>> There is the output:
>>>
>>> [24672.573075] BTRFS info (device sdb): space_info 4 has 
>>> 18446744072971649024 free, is not full
>>> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, 
>>> used=304593289216, pinned=2321940480, reserved=174800896, 
>>> may_use=1811644416, readonly=131072
>>> [24672.573079] use_block_rsv: Not using global blockrsv! Current 
>>> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 
>>> global_rsv->space_info = 999a57db7000
>>> [24672.573083] BTRFS: Transaction aborted (error -28)
>> Bummer, so you are indeed running out of global space reservations in
>> context which can't really use any other reservation type, thus the
>> ENOSPC. Was the stacktrace again during processing of running delayed refs?
> Yes, the stacktrace is below.
>
> [24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 
> btrfs_run_delayed_refs+0x259/0x270 [btrfs]
> [24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
> iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
> ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
> nf_nat nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle 
> intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd 
> glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev 
> iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich 
> mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac 
> ipmi_si ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq 
> ip_tables x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor 
> zstd_decompress zstd_compress
> [24672.573161]  xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel 
> psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
> [24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I
>  4.14.23-znr8+ #73
> [24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
> 02/01/2011
> [24672.573172] task: 999a23229140 task.stack: a85642094000
> [24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs]
> [24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282
> [24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: 
> 0006
> [24672.573189] RDX:  RSI: 0082 RDI: 
> 999a6fcd66f0
> [24672.573190] RBP: 95c24d68 R08: 0001 R09: 
> 0479
> [24672.573190] R10: 99974b1960e0 R11: 0479 R12: 
> 999a5a65
> [24672.573191] R13: 999a5a6511f0 R14:  R15: 
> 
> [24672.573192] FS:  () GS:999a6fcc() 
> knlGS:
> [24672.573193] CS:  0010 DS:  ES:  CR0: 80050033
> [24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: 
> 000206e0
> [24672.573195] Call Trace:
> [24672.573215]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
> [24672.573231]  ? start_transaction+0x89/0x410 [btrfs]
> [24672.573246]  transaction_kthread+0x195/0x1b0 [btrfs]
> [24672.573249]  kthread+0xfc/0x130
> [24672.573265]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> [24672.573266]  ? kthread_create_on_node+0x70/0x70
> [24672.573269]  ret_from_fork+0x35/0x40
> [24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 
> 8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> 
> 0b 44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
> [24672.573292] ---[ end trace b17d927a946cb02e ]---
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Again, another ENOSPC due to lack of global rsv space in the context of d

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-10 Thread Martin Svec
Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a):
>
> 
>
 And then report back on the output of the extra debug 
 statements. 

 Your global rsv is essentially unused, this means 
 in the worst case the code should fallback to using the global rsv
 for satisfying the memory allocation for delayed refs. So we should
 figure out why this isn't' happening. 
>>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as 
>>> we hit ENOSPC again.
>> There is the output:
>>
>> [24672.573075] BTRFS info (device sdb): space_info 4 has 
>> 18446744072971649024 free, is not full
>> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, 
>> used=304593289216, pinned=2321940480, reserved=174800896, 
>> may_use=1811644416, readonly=131072
>> [24672.573079] use_block_rsv: Not using global blockrsv! Current 
>> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 
>> global_rsv->space_info = 999a57db7000
>> [24672.573083] BTRFS: Transaction aborted (error -28)
> Bummer, so you are indeed running out of global space reservations in
> context which can't really use any other reservation type, thus the
> ENOSPC. Was the stacktrace again during processing of running delayed refs?

Yes, the stacktrace is below.

[24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 
btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle 
intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd 
glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev 
iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich 
mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac ipmi_si 
ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq ip_tables 
x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor zstd_decompress 
zstd_compress
[24672.573161]  xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel 
psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
[24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I 
4.14.23-znr8+ #73
[24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
02/01/2011
[24672.573172] task: 999a23229140 task.stack: a85642094000
[24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282
[24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: 0006
[24672.573189] RDX:  RSI: 0082 RDI: 999a6fcd66f0
[24672.573190] RBP: 95c24d68 R08: 0001 R09: 0479
[24672.573190] R10: 99974b1960e0 R11: 0479 R12: 999a5a65
[24672.573191] R13: 999a5a6511f0 R14:  R15: 
[24672.573192] FS:  () GS:999a6fcc() 
knlGS:
[24672.573193] CS:  0010 DS:  ES:  CR0: 80050033
[24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: 000206e0
[24672.573195] Call Trace:
[24672.573215]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[24672.573231]  ? start_transaction+0x89/0x410 [btrfs]
[24672.573246]  transaction_kthread+0x195/0x1b0 [btrfs]
[24672.573249]  kthread+0xfc/0x130
[24672.573265]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[24672.573266]  ? kthread_create_on_node+0x70/0x70
[24672.573269]  ret_from_fork+0x35/0x40
[24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 
8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> 0b 
44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
[24672.573292] ---[ end trace b17d927a946cb02e ]---


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-10 Thread Martin Svec
Dne 9.3.2018 v 20:03 Martin Svec napsal(a):
> Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a):
>> On 23.02.2018 16:28, Martin Svec wrote:
>>> Hello,
>>>
>>> we have a btrfs-based backup system using btrfs snapshots and rsync. 
>>> Sometimes,
>>> we hit ENOSPC bug and the filesystem is remounted read-only. However, 
>>> there's 
>>> still plenty of unallocated space according to "btrfs fi usage". So I think 
>>> this
>>> isn't another edge condition when btrfs runs out of space due to fragmented 
>>> chunks,
>>> but a bug in disk space allocation code. It suffices to umount the 
>>> filesystem and
>>> remount it back and it works fine again. The frequency of ENOSPC seems to be
>>> dependent on metadata chunks usage. When there's a lot of free space in 
>>> existing
>>> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
>>> are
>>> above ~98%, we hit the bug every few days. Below are details regarding the 
>>> backup
>>> server and btrfs.
>>>
>>> The backup works as follows: 
>>>
>>>   * Every night, we create a btrfs snapshot on the backup server and rsync 
>>> data
>>> from a production server into it. This snapshot is then marked 
>>> read-only and
>>> will be used as a base subvolume for the next backup snapshot.
>>>   * Every day, expired snapshots are removed and their space is freed. 
>>> Cleanup
>>> is scheduled in such a way that it doesn't interfere with the backup 
>>> window.
>>>   * Multiple production servers are backed up in parallel to one backup 
>>> server.
>>>   * The backed up servers are mostly webhosting servers and mail servers, 
>>> i.e.
>>> hundreds of billions of small files. (Yes, we push btrfs to the limits 
>>> :-))
>>>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>>>   * Rsync is configured to use whole file copying.
>>>
>>> System configuration:
>>>
>>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
>>> (see below) and
>>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
>>> metadata_reserve_bytes)
>>>
>>> btrfs mount options: 
>>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>>>
>>> $ btrfs fi df /backup:
>>>
>>> Data, single: total=28.05TiB, used=26.37TiB
>>> System, single: total=32.00MiB, used=3.53MiB
>>> Metadata, single: total=255.00GiB, used=250.73GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>> $ btrfs fi show /backup:
>>>
>>> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
>>> Total devices 2 FS bytes used 26.62TiB
>>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
>>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>>>
>>> $ btrfs fi usage /backup:
>>>
>>> Overall:
>>> Device size:  36.79TiB
>>> Device allocated: 28.30TiB
>>> Device unallocated:8.49TiB
>>> Device missing:  0.00B
>>> Used: 26.62TiB
>>> Free (estimated): 10.17TiB  (min: 10.17TiB)
>>> Data ratio:   1.00
>>> Metadata ratio:   1.00
>>> Global reserve:  512.00MiB  (used: 0.00B)
>>>
>>> Data,single: Size:28.05TiB, Used:26.37TiB
>>>/dev/sdc   28.05TiB
>>>
>>> Metadata,single: Size:255.00GiB, Used:250.73GiB
>>>/dev/sdb  255.00GiB
>>>
>>> System,single: Size:32.00MiB, Used:3.53MiB
>>>/dev/sdb   32.00MiB
>>>
>>> Unallocated:
>>>/dev/sdb  161.59GiB
>>>/dev/sdc8.33TiB
>>>
>>> Btrfs filesystem uses two logical drives in single mode, backed by
>>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
>>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>>>
>>> Please note that we have a simple custom patch in btrfs which ensures
>>> that metadata chunks are allocated preferably on SSD volume and data
>>> chunks are allocated only on SATA volume. The patch slightly modifies
>>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
>>> dev

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Martin Svec
Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a):
>
> On 23.02.2018 16:28, Martin Svec wrote:
>> Hello,
>>
>> we have a btrfs-based backup system using btrfs snapshots and rsync. 
>> Sometimes,
>> we hit ENOSPC bug and the filesystem is remounted read-only. However, 
>> there's 
>> still plenty of unallocated space according to "btrfs fi usage". So I think 
>> this
>> isn't another edge condition when btrfs runs out of space due to fragmented 
>> chunks,
>> but a bug in disk space allocation code. It suffices to umount the 
>> filesystem and
>> remount it back and it works fine again. The frequency of ENOSPC seems to be
>> dependent on metadata chunks usage. When there's a lot of free space in 
>> existing
>> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
>> are
>> above ~98%, we hit the bug every few days. Below are details regarding the 
>> backup
>> server and btrfs.
>>
>> The backup works as follows: 
>>
>>   * Every night, we create a btrfs snapshot on the backup server and rsync 
>> data
>> from a production server into it. This snapshot is then marked read-only 
>> and
>> will be used as a base subvolume for the next backup snapshot.
>>   * Every day, expired snapshots are removed and their space is freed. 
>> Cleanup
>> is scheduled in such a way that it doesn't interfere with the backup 
>> window.
>>   * Multiple production servers are backed up in parallel to one backup 
>> server.
>>   * The backed up servers are mostly webhosting servers and mail servers, 
>> i.e.
>> hundreds of billions of small files. (Yes, we push btrfs to the limits 
>> :-))
>>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>>   * Rsync is configured to use whole file copying.
>>
>> System configuration:
>>
>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
>> (see below) and
>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
>> metadata_reserve_bytes)
>>
>> btrfs mount options: 
>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>>
>> $ btrfs fi df /backup:
>>
>> Data, single: total=28.05TiB, used=26.37TiB
>> System, single: total=32.00MiB, used=3.53MiB
>> Metadata, single: total=255.00GiB, used=250.73GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> $ btrfs fi show /backup:
>>
>> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
>> Total devices 2 FS bytes used 26.62TiB
>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>>
>> $ btrfs fi usage /backup:
>>
>> Overall:
>> Device size:  36.79TiB
>> Device allocated: 28.30TiB
>> Device unallocated:8.49TiB
>> Device missing:  0.00B
>> Used: 26.62TiB
>> Free (estimated): 10.17TiB  (min: 10.17TiB)
>> Data ratio:   1.00
>> Metadata ratio:   1.00
>> Global reserve:  512.00MiB  (used: 0.00B)
>>
>> Data,single: Size:28.05TiB, Used:26.37TiB
>>/dev/sdc   28.05TiB
>>
>> Metadata,single: Size:255.00GiB, Used:250.73GiB
>>/dev/sdb  255.00GiB
>>
>> System,single: Size:32.00MiB, Used:3.53MiB
>>/dev/sdb   32.00MiB
>>
>> Unallocated:
>>/dev/sdb  161.59GiB
>>/dev/sdc8.33TiB
>>
>> Btrfs filesystem uses two logical drives in single mode, backed by
>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>>
>> Please note that we have a simple custom patch in btrfs which ensures
>> that metadata chunks are allocated preferably on SSD volume and data
>> chunks are allocated only on SATA volume. The patch slightly modifies
>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
>> devices when a metadata chunk is requested and vice versa. However, 
>> I'm quite sure that this patch doesn't cause the reported bug because
>> we log every call of the modified code and there're no __btrfs_alloc_chunk()
>> calls when ENOSPC is triggered. Moreover, we observed the same bug before
>> we developed the patch. (IIRC, Chris Mason mentioned that they work on
>> a similar feature in faceb

Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-09 Thread Martin Svec
Nobody knows?

I'm particularly interested why debug space_info 4 shows negative (unsigned 
18446744072120172544)
value as free metadata space, please see the original report. Is it a bug in 
dump_space_info(), or
metadata reservations can temporarily exceed the total space, or is it an 
indication of a damaged
filesystem? Also note that rebuilding free space cache doesn't help.

Thank you.

Martin

Dne 23.2.2018 v 15:28 Martin Svec napsal(a):
> Hello,
>
> we have a btrfs-based backup system using btrfs snapshots and rsync. 
> Sometimes,
> we hit ENOSPC bug and the filesystem is remounted read-only. However, there's 
> still plenty of unallocated space according to "btrfs fi usage". So I think 
> this
> isn't another edge condition when btrfs runs out of space due to fragmented 
> chunks,
> but a bug in disk space allocation code. It suffices to umount the filesystem 
> and
> remount it back and it works fine again. The frequency of ENOSPC seems to be
> dependent on metadata chunks usage. When there's a lot of free space in 
> existing
> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
> are
> above ~98%, we hit the bug every few days. Below are details regarding the 
> backup
> server and btrfs.
>
> The backup works as follows: 
>
>   * Every night, we create a btrfs snapshot on the backup server and rsync 
> data
> from a production server into it. This snapshot is then marked read-only 
> and
> will be used as a base subvolume for the next backup snapshot.
>   * Every day, expired snapshots are removed and their space is freed. Cleanup
> is scheduled in such a way that it doesn't interfere with the backup 
> window.
>   * Multiple production servers are backed up in parallel to one backup 
> server.
>   * The backed up servers are mostly webhosting servers and mail servers, i.e.
> hundreds of billions of small files. (Yes, we push btrfs to the limits 
> :-))
>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>   * Rsync is configured to use whole file copying.
>
> System configuration:
>
> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
> (see below) and
> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
> metadata_reserve_bytes)
>
> btrfs mount options: 
> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>
> $ btrfs fi df /backup:
>
> Data, single: total=28.05TiB, used=26.37TiB
> System, single: total=32.00MiB, used=3.53MiB
> Metadata, single: total=255.00GiB, used=250.73GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
> $ btrfs fi show /backup:
>
> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
> Total devices 2 FS bytes used 26.62TiB
> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>
> $ btrfs fi usage /backup:
>
> Overall:
> Device size:  36.79TiB
> Device allocated: 28.30TiB
> Device unallocated:8.49TiB
> Device missing:  0.00B
> Used: 26.62TiB
> Free (estimated): 10.17TiB  (min: 10.17TiB)
> Data ratio:   1.00
> Metadata ratio:   1.00
> Global reserve:  512.00MiB  (used: 0.00B)
>
> Data,single: Size:28.05TiB, Used:26.37TiB
>/dev/sdc   28.05TiB
>
> Metadata,single: Size:255.00GiB, Used:250.73GiB
>/dev/sdb  255.00GiB
>
> System,single: Size:32.00MiB, Used:3.53MiB
>/dev/sdb   32.00MiB
>
> Unallocated:
>/dev/sdb  161.59GiB
>/dev/sdc8.33TiB
>
> Btrfs filesystem uses two logical drives in single mode, backed by
> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>
> Please note that we have a simple custom patch in btrfs which ensures
> that metadata chunks are allocated preferably on SSD volume and data
> chunks are allocated only on SATA volume. The patch slightly modifies
> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
> devices when a metadata chunk is requested and vice versa. However, 
> I'm quite sure that this patch doesn't cause the reported bug because
> we log every call of the modified code and there're no __btrfs_alloc_chunk()
> calls when ENOSPC is triggered. Moreover, we observed the same bug before
> we developed the patch. (IIRC, Chris Mason mentioned that they work on
> a similar feature in facebook, but I've found no official patches yet.)
>
> Dmesg dump:
>
>

Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-02-23 Thread Martin Svec
[285167.750879] RSP: 0018:ba48c1ecf958 EFLAGS: 00010282
[285167.750880] RAX: 001d RBX: 9c4a1c2ce128 RCX: 
0006
[285167.750881] RDX:  RSI: 0096 RDI: 
9c4a2fd566f0
[285167.750882] RBP: 4000 R08: 0001 R09: 
03dc
[285167.750883] R10: 0001 R11: 03dc R12: 
9c4a1c2ce000
[285167.750883] R13: 9c4a17692800 R14: 0001 R15: 
ffe4
[285167.750885] FS:  () GS:9c4a2fd4() 
knlGS:
[285167.750885] CS:  0010 DS:  ES:  CR0: 80050033
[285167.750886] CR2: 56250e55bfd0 CR3: 0ee0a003 CR4: 
000206e0
[285167.750887] Call Trace:
[285167.750903]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
[285167.750917]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
[285167.750929]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
[285167.750943]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
[285167.750961]  ? set_extent_bit+0x19/0x20 [btrfs]
[285167.750974]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
[285167.750992]  ? btrfs_merge_delayed_refs+0x63/0x560 [btrfs]
[285167.751006]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
[285167.751021]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
[285167.751037]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[285167.751053]  ? start_transaction+0x89/0x410 [btrfs]
[285167.751068]  transaction_kthread+0x195/0x1b0 [btrfs]
[285167.751071]  kthread+0xfc/0x130
[285167.751087]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[285167.751088]  ? kthread_create_on_node+0x70/0x70
[285167.751091]  ret_from_fork+0x35/0x40
[285167.751092] Code: ff 48 c7 c6 28 d7 44 c0 48 c7 c7 a0 21 4a c0 e8 3c a5 4b 
cb 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 c0 4c 45 c0 e8 80 fd f1 ca <0f> ff 
e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
[285167.751114] ---[ end trace 8721883b5af677ec ]---
[285169.096630] BTRFS info (device sdb): space_info 4 has 18446744072120172544 
free, is not full
[285169.096633] BTRFS info (device sdb): space_info total=273804165120, 
used=269218267136, pinned=3459629056, reserved=52396032, may_use=2663120896, 
readonly=131072
[285169.096638] BTRFS: Transaction aborted (error -28)
[285169.096664] [ cut here ]
[285169.096691] WARNING: CPU: 7 PID: 443 at fs/btrfs/extent-tree.c:3089 
btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[285169.096692] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntr
[285169.096722]  zstd_compress xxhash raid6_pq sd_mod crc32c_intel psmouse 
uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
[285169.096729] CPU: 7 PID: 443 Comm: btrfs-transacti Tainted: GW I 
4.14.20-znr1+ #69
[285169.096730] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
02/01/2011
[285169.096731] task: 9c4a1740e280 task.stack: ba48c1ecc000
[285169.096745] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[285169.096746] RSP: 0018:ba48c1ecfde0 EFLAGS: 00010282
[285169.096747] RAX: 0026 RBX: 9c47990c0780 RCX: 
0006
[285169.096748] RDX:  RSI: 0082 RDI: 
9c4a2fdd66f0
[285169.096749] RBP: 9c493d509b68 R08: 0001 R09: 
0403
[285169.096749] R10: 9c49731d6620 R11: 0403 R12: 
9c4a1c2ce000
[285169.096750] R13: 9c4a1c2cf1f0 R14:  R15: 

[285169.096751] FS:  () GS:9c4a2fdc() 
knlGS:
[285169.096752] CS:  0010 DS:  ES:  CR0: 80050033
[285169.096753] CR2: 55e70555bfe0 CR3: 0ee0a005 CR4: 
000206e0
[285169.096754] Call Trace:
[285169.096774]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[285169.096790]  ? start_transaction+0x89/0x410 [btrfs]
[285169.096806]  transaction_kthread+0x195/0x1b0 [btrfs]
[285169.096809]  kthread+0xfc/0x130
[285169.096825]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[285169.096826]  ? kthread_create_on_node+0x70/0x70
[285169.096828]  ret_from_fork+0x35/0x40
[285169.096830] Code: c7 c6 20 d8 44 c0 48 89 df 44 89 04 24 e8 19 bb 09 00 44 
8b 04 24 eb 86 44 89 c6 48 c7 c7 30 48 45 c0 44 89 04 24 e8 d2 40 f2 ca <0f> ff 
44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
[285169.096852] ---[ end trace 8721883b5af677ed ]---
[285169.096918] BTRFS: error (device sdb) in btrfs_run_delayed_refs:3089: 
errno=-28 No space left
[285169.096976] BTRFS info (device sdb): forced readonly
[285169.096979] BTRFS warning (device sdb): Skipping commit of aborted 
transaction.
[285169.096981] BTRFS: error (device sdb) in cleanup_transaction:1873: 
errno=-28 No space left


How can I help you to fix this issue?

Regards,

Martin Svec




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a m

Re: Recommendations for balancing as part of regular maintenance?

2018-01-08 Thread Martin Raiber
On 08.01.2018 19:34 Austin S. Hemmelgarn wrote:
> On 2018-01-08 13:17, Graham Cobb wrote:
>> On 08/01/18 16:34, Austin S. Hemmelgarn wrote:
>>> Ideally, I think it should be as generic as reasonably possible,
>>> possibly something along the lines of:
>>>
>>> A: While not strictly necessary, running regular filtered balances (for
>>> example `btrfs balance start -dusage=50 -dlimit=2 -musage=50
>>> -mlimit=4`,
>>> see `man btrfs-balance` for more info on what the options mean) can
>>> help
>>> keep a volume healthy by mitigating the things that typically cause
>>> ENOSPC errors.  Full balances by contrast are long and expensive
>>> operations, and should be done only as a last resort.
>>
>> That recommendation is similar to what I do and it works well for my use
>> case. I would recommend it to anyone with my usage, but cannot say how
>> well it would work for other uses. In my case, I run balances like that
>> once a week: some weeks nothing happens, other weeks 5 or 10 blocks may
>> get moved.
>
> In my own usage I've got a pretty varied mix of other stuff going on.
> All my systems are Gentoo, so system updates mean that I'm building
> software regularly (though on most of the systems that happens on
> tmpfs in RAM), I run a home server with a dozen low use QEMU VM's and
> a bunch of transient test VM's, all of which I'm currently storing
> disk images for raw on top of BTRFS (which is actually handling all of
> it pretty well, though that may be thanks to all the VM's using
> PV-SCSI for their disks), I run a BOINC client system that sees pretty
> heavy filesystem usage, and have a lot of personal files that get
> synced regularly across systems, and all of this is on raid1 with
> essentially no snapshots.  For me the balance command I mentioned
> above run daily seems to help, even if the balance doesn't move much
> most of the time on most filesystems, and the actual balance
> operations take at most a few seconds most of the time (I've got
> reasonably nice SSD's in everything).

There have been reports of (rare) corruption caused by balance (won't be
detected by a scrub) here on the mailing list. So I would stay a away
from btrfs balance unless it is absolutely needed (ENOSPC), and while it
is run I would try not to do anything else wrt. to writes simultaneously.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs blocked by too many delayed refs

2017-12-21 Thread Martin Raiber
Hi,

I have the problem that too many delayed refs block a btrfs storage. I
have one thread that does work:

[] io_schedule+0x16/0x40
[] wait_on_page_bit+0x116/0x150
[] read_extent_buffer_pages+0x1c5/0x290
[] btree_read_extent_buffer_pages+0x9d/0x100
[] read_tree_block+0x32/0x50
[] read_block_for_search.isra.30+0x120/0x2e0
[] btrfs_search_slot+0x385/0x990
[] btrfs_insert_empty_items+0x71/0xc0
[] insert_extent_data_ref.isra.49+0x11b/0x2a0
[] __btrfs_inc_extent_ref.isra.59+0x1ee/0x220
[] __btrfs_run_delayed_refs+0x924/0x12c0
[] btrfs_run_delayed_refs+0x7a/0x260
[] create_pending_snapshot+0x5e4/0xf00
[] create_pending_snapshots+0x97/0xc0
[] btrfs_commit_transaction+0x395/0x930
[] btrfs_mksubvol+0x4a6/0x4f0
[] btrfs_ioctl_snap_create_transid+0x185/0x190
[] btrfs_ioctl_snap_create_v2+0x104/0x150
[] btrfs_ioctl+0x5e1/0x23b0
[] do_vfs_ioctl+0x92/0x5a0
[] SyS_ioctl+0x79/0x9

the others are in 'D' state e.g. with

[] call_rwsem_down_write_failed+0x17/0x30
[] filename_create+0x6b/0x150
[] SyS_mkdir+0x44/0xe0

Slabtop shows 2423910 btrfs_delayed_ref_head structs, slowly decreasing.

What I think is happening is that delayed refs are added without
throttling them with btrfs_should_throttle_delayed_refs . Maybe by
creating a snapshot of a file and then modifying it (some action that
creates delayed refs, is not truncate which is already throttled and
does not commit a transaction which is also throttled).

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: again "out of space" and remount read only, with 4.14

2017-12-18 Thread Martin Raiber
On 03.12.2017 16:39 Martin Raiber wrote:
> Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski:
>> On 2017-11-27 00:37, Martin Raiber wrote:
>>> On 26.11.2017 08:46 Tomasz Chmielewski wrote:
>>>> Got this one on a 4.14-rc7 filesystem with some 400 GB left:
>>> I guess it is too late now, but I guess the "btrfs fi usage" output of
>>> the file system (especially after it went ro) would be useful.
>> It was more or less similar as it went ro:
>>
>> # btrfs fi usage /srv
>> Overall:
>>     Device size:   5.25TiB
>>     Device allocated:  4.45TiB
>>     Device unallocated:  823.97GiB
>>     Device missing:  0.00B
>>     Used:  4.33TiB
>>     Free (estimated):    471.91GiB  (min: 471.91GiB)
>>     Data ratio:   2.00
>>     Metadata ratio:   2.00
>>     Global reserve:  512.00MiB  (used: 0.00B)
>>
>> Unallocated:
>>    /dev/sda4 411.99GiB
>>    /dev/sdb4 411.99GiB
> I wanted to check if is the same issue I have, e.g. with 4.14.1
> space_cache=v2:
>
> [153245.341823] BTRFS: error (device loop0) in
> btrfs_run_delayed_refs:3089: errno=-28 No space left
> [153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317:
> errno=-28 No space left
> [153245.341848] BTRFS info (device loop0): forced readonly
> [153245.341972] BTRFS warning (device loop0): Skipping commit of aborted
> transaction.
> [153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873:
> errno=-28 No space left
> # btrfs fi usage /media/backup
> Overall:
>     Device size:  49.60TiB
>     Device allocated: 38.10TiB
>     Device unallocated:   11.50TiB
>     Device missing:  0.00B
>     Used: 36.98TiB
>     Free (estimated): 12.59TiB  (min: 12.59TiB)
>     Data ratio:   1.00
>     Metadata ratio:   1.00
>     Global reserve:    2.00GiB  (used: 1.99GiB)
>
> Data,single: Size:37.70TiB, Used:36.61TiB
>    /dev/loop0 37.70TiB
>
> Metadata,single: Size:411.01GiB, Used:380.98GiB
>    /dev/loop0    411.01GiB
>
> System,single: Size:36.00MiB, Used:4.00MiB
>    /dev/loop0 36.00MiB
>
> Unallocated:
>    /dev/loop0 11.50TiB
>
> Note the global reserve being at maximum. I already increased that in
> the code to 2G and that seems to make this issue appear more rarely.

This time with enospc_debug mount option:

With Linux 4.14.3. Single large device.

[15179.739038] [ cut here ]
[15179.739059] WARNING: CPU: 0 PID: 28694 at fs/btrfs/extent-tree.c:8458
btrfs_alloc_tree_block+0x38f/0x4a0
[15179.739060] Modules linked in: bcache loop dm_crypt algif_skcipher
af_alg st sr_mod cdrom xfs libcrc32c zbud intel_rapl sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt kvm_intel kvm
iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel pcbc raid1 mgag200 snd_pcm aesni_intel ttm snd_timer
drm_kms_helper snd soundcore aes_x86_64 crypto_simd glue_helper cryptd
pcspkr i2c_i801 joydev drm mei_me evdev lpc_ich mei mfd_core ipmi_si
ipmi_devintf ipmi_msghandler tpm_tis tpm_tis_core tpm wmi ioatdma button
shpchp fuse autofs4 hid_generic usbhid hid sg sd_mod dm_mod dax md_mod
crc32c_intel isci ahci mpt3sas libsas libahci igb raid_class ehci_pci
i2c_algo_bit libata dca ehci_hcd scsi_transport_sas ptp nvme pps_core
scsi_mod usbcore nvme_core
[15179.739133] CPU: 0 PID: 28694 Comm: btrfs Not tainted 4.14.3 #2
[15179.739134] Hardware name: Supermicro
X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[15179.739136] task: 8813e4f02ac0 task.stack: c9000aea
[15179.739140] RIP: 0010:btrfs_alloc_tree_block+0x38f/0x4a0
[15179.739141] RSP: 0018:c9000aea3558 EFLAGS: 00010292
[15179.739144] RAX: 001d RBX: 4000 RCX:

[15179.739146] RDX: 880c4fa15b38 RSI: 880c4fa0de58 RDI:
880c4fa0de58
[15179.739147] RBP: c9000aea35d0 R08: 0001 R09:
0662
[15179.739149] R10: 1600 R11: 0662 R12:
880c0a454000
[15179.739151] R13: 880c4ba33800 R14: 0001 R15:
880c0a454128
[15179.739153] FS:  7f0d699128c0() GS:880c4fa0()
knlGS:
[15179.739155] CS:  0010 DS:  ES:  CR0: 80050033
[15179.739156] CR2: 7bbfcdf2c6e8 CR3: 00151da91003 CR4:
000606f0
[15179.739158] Call Trace:
[15179.739166]  __btrfs_cow_block+0x117/0x580
[15179.739169]  btrfs_cow_block+0xdf/0x200
[15179.739171]  btrfs_search_slot+0x1ea/0x990
[15179.739174]  lookup_inline_extent_backref+0x

Re: again "out of space" and remount read only, with 4.14

2017-12-03 Thread Martin Raiber
Am 26.11.2017 um 17:02 schrieb Tomasz Chmielewski:
> On 2017-11-27 00:37, Martin Raiber wrote:
>> On 26.11.2017 08:46 Tomasz Chmielewski wrote:
>>> Got this one on a 4.14-rc7 filesystem with some 400 GB left:
>> I guess it is too late now, but I guess the "btrfs fi usage" output of
>> the file system (especially after it went ro) would be useful.
> It was more or less similar as it went ro:
>
> # btrfs fi usage /srv
> Overall:
>     Device size:   5.25TiB
>     Device allocated:  4.45TiB
>     Device unallocated:  823.97GiB
>     Device missing:  0.00B
>     Used:  4.33TiB
>     Free (estimated):    471.91GiB  (min: 471.91GiB)
>     Data ratio:   2.00
>     Metadata ratio:   2.00
>     Global reserve:  512.00MiB  (used: 0.00B)
>
> Unallocated:
>    /dev/sda4 411.99GiB
>    /dev/sdb4 411.99GiB

I wanted to check if is the same issue I have, e.g. with 4.14.1
space_cache=v2:

[153245.341823] BTRFS: error (device loop0) in
btrfs_run_delayed_refs:3089: errno=-28 No space left
[153245.341845] BTRFS: error (device loop0) in btrfs_drop_snapshot:9317:
errno=-28 No space left
[153245.341848] BTRFS info (device loop0): forced readonly
[153245.341972] BTRFS warning (device loop0): Skipping commit of aborted
transaction.
[153245.341975] BTRFS: error (device loop0) in cleanup_transaction:1873:
errno=-28 No space left
# btrfs fi usage /media/backup
Overall:
    Device size:  49.60TiB
    Device allocated: 38.10TiB
    Device unallocated:   11.50TiB
    Device missing:  0.00B
    Used: 36.98TiB
    Free (estimated): 12.59TiB  (min: 12.59TiB)
    Data ratio:   1.00
    Metadata ratio:   1.00
    Global reserve:    2.00GiB  (used: 1.99GiB)

Data,single: Size:37.70TiB, Used:36.61TiB
   /dev/loop0 37.70TiB

Metadata,single: Size:411.01GiB, Used:380.98GiB
   /dev/loop0    411.01GiB

System,single: Size:36.00MiB, Used:4.00MiB
   /dev/loop0 36.00MiB

Unallocated:
   /dev/loop0 11.50TiB

Note the global reserve being at maximum. I already increased that in
the code to 2G and that seems to make this issue appear more rarely.

Regards,
Martin Raiber


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-15 Thread Martin Steigerwald
David Sterba - 15.11.17, 15:39:
> On Tue, Nov 14, 2017 at 07:53:31PM +0100, David Sterba wrote:
> > On Mon, Nov 13, 2017 at 11:50:46PM +0100, David Sterba wrote:
> > > Up to now, there are no bootloaders supporting ZSTD.
> > 
> > I've tried to implement the support to GRUB, still incomplete and hacky
> > but most of the code is there.  The ZSTD implementation is copied from
> > kernel. The allocators need to be properly set up, as it needs to use
> > grub_malloc/grub_free for the workspace thats called from some ZSTD_*
> > functions.
> > 
> > https://github.com/kdave/grub/tree/btrfs-zstd
> 
> The branch is now in a state that can be tested. Turns out the memory
> requirements are too much for grub, so the boot fails with "not enough
> memory". The calculated value
> 
> ZSTD_BTRFS_MAX_INPUT: 131072
> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424
> 
> This is not something I could fix easily, we'd probalby need a tuned
> version of ZSTD for grub constraints. Adding Nick to CC.

Somehow I am happy that I still have a plain Ext4 for /boot. :)

Thanks for looking into Grub support anyway.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-14 Thread Martin Steigerwald
David Sterba - 14.11.17, 19:49:
> On Tue, Nov 14, 2017 at 08:34:37AM +0100, Martin Steigerwald wrote:
> > Hello David.
> > 
> > David Sterba - 13.11.17, 23:50:
> > > while 4.14 is still fresh, let me address some concerns I've seen on
> > > linux
> > > forums already.
> > > 
> > > The newly added ZSTD support is a feature that has broader impact than
> > > just the runtime compression. The btrfs-progs understand filesystem with
> > > ZSTD since 4.13. The remaining key part is the bootloader.
> > > 
> > > Up to now, there are no bootloaders supporting ZSTD. This could lead to
> > > an
> > > unmountable filesystem if the critical files under /boot get
> > > accidentally
> > > or intentionally compressed by ZSTD.
> > 
> > But otherwise ZSTD is safe to use? Are you aware of any other issues?
> 
> No issues from my own testing or reported by other users.

Thanks to you and the others. I think I try this soon.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read before you deploy btrfs + zstd

2017-11-13 Thread Martin Steigerwald
Hello David.

David Sterba - 13.11.17, 23:50:
> while 4.14 is still fresh, let me address some concerns I've seen on linux
> forums already.
> 
> The newly added ZSTD support is a feature that has broader impact than
> just the runtime compression. The btrfs-progs understand filesystem with
> ZSTD since 4.13. The remaining key part is the bootloader.
> 
> Up to now, there are no bootloaders supporting ZSTD. This could lead to an
> unmountable filesystem if the critical files under /boot get accidentally
> or intentionally compressed by ZSTD.

But otherwise ZSTD is safe to use? Are you aware of any other issues?

I consider switching from LZO to ZSTD on this ThinkPad T520 with Sandybridge.

Thank you,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to run balance successfully (No space left on device)?

2017-11-10 Thread Martin Raiber
On 10.11.2017 22:51 Chris Murphy wrote:
>> Combined with evidence that "No space left on device" during balance can
>> lead to various file corruption (we've witnessed it with MySQL), I'd day
>> btrfs balance is a dangerous operation and decision to use it should be
>> considered very thoroughly.
> I've never heard of this. Balance is COW at the chunk level. The old
> chunk is not dereferenced until it's written in the new location
> correctly. Corruption during balance shouldn't be possible so if you
> have a reproducer, the devs need to know about it.

I didn't say anything before, because I could not reproduce the problem.
I had (I guess) a corruption caused by balance as well. It had ENOSPC in
spite of enough free space (4.9.x), which made me balance it regularly
to keep unallocated space around. Corruption occured probably after or
shortly before power reset during a balance -- no skip_balance specified
so it continued directly after mount -- data was moved relatively fast
after the mount operation (copy file then delete old file). I think
space_cache=v2 was active at the time. I'm of course not completely sure
it was btrfs's fault and as usual not all the conditions may be
relevant. Could also be instead an upper layer error (Hyper-V storage),
memory issue or an application error.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple btrfs-cleaner threads per volume

2017-11-02 Thread Martin Raiber
On 02.11.2017 16:10 Hans van Kranenburg wrote:
> On 11/02/2017 04:02 PM, Martin Raiber wrote:
>> snapshot cleanup is a little slow in my case (50TB volume). Would it
>> help to have multiple btrfs-cleaner threads? The block layer underneath
>> would have higher throughput with more simultaneous read/write requests.
> Just curious:
> * How many subvolumes/snapshots are you removing, and what's the
> complexity level (like, how many other subvolumes/snapshots reference
> the same data extents?)
> * Do you see a lot of cpu usage, or mainly a lot of disk I/O? If it's
> disk IO, is it mainly random read IO, or is it a lot of write traffic?
> * What mount options are you running with (from /proc/mounts)?

It is a single block device, so not a multi-device btrfs, so
optimizations in that area wouldn't help. It is a UrBackup system with
about 200 snapshots per client. 20009 snapshots total. UrBackup reflinks
files between them, but btrfs-cleaner doesn't use much CPU (so it
doesn't seem like the backref walking is the problem). btrfs-cleaner is
probably limited mainly by random read/write IO. The device has a cache,
so parallel accesses would help, as some of them may hit the cache.
Looking at the code it seems easy enough to do. Question is if there are
any obvious reasons why this wouldn't work (like some lock etc.).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Multiple btrfs-cleaner threads per volume

2017-11-02 Thread Martin Raiber
Hi,

snapshot cleanup is a little slow in my case (50TB volume). Would it
help to have multiple btrfs-cleaner threads? The block layer underneath
would have higher throughput with more simultaneous read/write requests.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data and metadata extent allocators [1/2]: Recap: The data story

2017-10-27 Thread Martin Steigerwald
virtual address space)

I see a difference in behavior but I do not yet fully understand what I am 
looking at.
 
> Q: But what if all my chunks have badly fragmented free space right now?
> A: If your situation allows for it, the simplest way is running a full
> balance of the data, as some sort of big reset button. If you only want
> to clean up chunks with excessive free space fragmentation, then you can
> use the helper I used to identify them, which is
> show_free_space_fragmentation.py in [8]. Just feed the chunks to balance
> starting with the one with the highest score. The script requires the
> free space tree to be used, which is a good idea anyway.

Okay, when I understand this correctly I don´t need to use "nossd" with kernel 
4.14, but it would be good to do a full "btrfs filesystem balance" run on all 
the SSD BTRFS filesystems or all other ones with rotational=0.

What would be the benefit of that? Would the filesystem run faster again? My 
subjective impression is that performance got worse over time. *However* all 
my previous full balance attempts made the performance even more worse. So… is 
a full balance safe to the filesystem performance meanwhile?

I still have the issue that fstrim on /home only works with patch from Lutz 
Euler from 2014, which is still not in mainline BTRFS. Maybe it would be a 
good idea to recreate /home in order to get rid of that special "anomaly" of 
the BTRFS that fstrim don´t work without this patch.

Maybe a least a part of this should go into BTRFS kernel wiki as it would be 
more easy to find there for users.

I wonder about a "upgrade notes for users" / "BTRFS maintenance" page that 
gives recommendations in case some step is recommended after a major kernel 
update and general recommendations for maintenance. Ideally most of this would 
be integrated into BTRFS or a userspace daemon for it and be handled 
transparently and automatically. Yet a full balance is an expensive operation 
time-wise and probably should not be started without user consent.

I do wonder about the ton of tools here and there and I would love some btrfsd 
or… maybe even more generic fsd filesystem maintenance daemon which would do 
regular scrubs and whatever else makes sense. It could use some configuration 
in the root directory of a filesystem and work for BTRFS and other filesystem 
that do have beneficial online / background upgraded like XFS which also has 
online scrubbing by now (at least for metadata).

> [0] https://www.spinics.net/lists/linux-btrfs/msg64446.html
> [1] https://www.spinics.net/lists/linux-btrfs/msg64771.html
> [2] https://github.com/knorrie/btrfs-heatmap/
> [3]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-10-27-shotgunblast.png
> [4]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-12-18-heatmap-scripting/
> fsid_ed10a358-c846-4e76-a071-3821d423a99d_startat_320029589504_at_1482095269
> .png [5] https://www.spinics.net/lists/linux-btrfs/msg64418.html
> [6]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=583b723151794e2ff1691f1510b4e43710293875 [7]
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4 [8]
> https://github.com/knorrie/python-btrfs/tree/develop/examples

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.13: "error in btrfs_run_delayed_refs:3009: errno=-28 No space left" with 1.3TB unallocated / 737G free?

2017-10-19 Thread Martin Raiber
On 19.10.2017 10:16 Vladimir Panteleev wrote:
> On Tue, 17 Oct 2017 16:21:04 -0700, Duncan wrote:
>> * try the balance on 4.14-rc5+, where the known bug should be fixed
>
> Thanks! However, I'm getting the same error on
> 4.14.0-rc5-g9aa0d2dde6eb. The stack trace is different, though:
>
> Aside from rebuilding the filesystem, what are my options? Should I
> try to temporarily add a file from another volume as a device and
> retry the balance? If so, what would be a good size for the temporary
> device?
>
Hi,

for me a work-around for something like this has been to reduce the
amount of dirty memory via e.g.

sysctl vm.dirty_background_bytes=$((100*1024*1024))
sysctl vm.dirty_bytes=$((400*1024*1024))

this reduces performance, however. You could also mount with
"enospc_debug" to give the devs more infos about this issue.
I am having more ENOSPC issues with 4.9.x than with the latest 4.14.

Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Martin Steigerwald
[repost. I didn´t notice autocompletion gave me wrong address for fsdevel, 
blacklisted now]

Hello.

What do you think of

http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs

?

There are quite some BTRFS maintenance programs like the deduplication stuff. 
Also regular scrubs… and in certain circumstances probably balances can make 
sense.

In addition to this XFS got scrub functionality as well.

Now putting the foundation for such a functionality in the kernel I think 
would only be reasonable if it cannot be done purely within user space, so I 
wonder about the safety from other concurrent ZFS modification and atomicity 
that are mentioned on the wiki page. The second set of slides, those the 
OpenZFS Developer Commit 2014, which are linked to on the wiki page explain 
this more. (I didn´t look the first ones, as I am no fan of slideshare.net and 
prefer a simple PDF to download and view locally anytime, not for privacy 
reasons alone, but also to avoid a using a crappy webpage over a wonderfully 
functional PDF viewer fat client like Okular)

Also I wonder about putting a lua interpreter into the kernel, but it seems at 
least NetBSD developers added one to their kernel with version 7.0¹.

I also ask this cause I wondered about a kind of fsmaintd or volmaintd for 
quite a while, and thought… it would be nice to do this in a generic way, as 
BTRFS is not the only filesystem which supports maintenance operations. However 
if it can all just nicely be done in userspace, I am all for it.

[1] http://www.netbsd.org/releases/formal-7/NetBSD-7.0.html
(tons of presentation PDFs on their site as well)

Thanks,
-- 
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Something like ZFS Channel Programs for BTRFS & probably XFS or even VFS?

2017-10-03 Thread Martin Steigerwald
Hello.

What do you think of

http://open-zfs.org/wiki/Projects/ZFS_Channel_Programs

?

There are quite some BTRFS maintenance programs like the deduplication stuff. 
Also regular scrubs… and in certain circumstances probably balances can make 
sense.

In addition to this XFS got scrub functionality as well.

Now putting the foundation for such a functionality in the kernel I think 
would only be reasonable if it cannot be done purely within user space, so I 
wonder about the safety from other concurrent ZFS modification and atomicity 
that are mentioned on the wiki page. The second set of slides, those the 
OpenZFS Developer Commit 2014, which are linked to on the wiki page explain 
this more. (I didn´t look the first ones, as I am no fan of slideshare.net and 
prefer a simple PDF to download and view locally anytime, not for privacy 
reasons alone, but also to avoid a using a crappy webpage over a wonderfully 
functional PDF viewer fat client like Okular)

Also I wonder about putting a lua interpreter into the kernel, but it seems at 
least NetBSD developers added one to their kernel with version 7.0¹.

I also ask this cause I wondered about a kind of fsmaintd or volmaintd for 
quite a while, and thought… it would be nice to do this in a generic way, as 
BTRFS is not the only filesystem which supports maintenance operations. However 
if it can all just nicely be done in userspace, I am all for it.

[1] http://www.netbsd.org/releases/formal-7/NetBSD-7.0.html
(tons of presentation PDFs on their site as well)

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding handling of file renames in Btrfs

2017-09-16 Thread Martin Raiber
Hi,

On 16.09.2017 14:27 Hans van Kranenburg wrote:
> On 09/10/2017 01:50 AM, Rohan Kadekodi wrote:
>> I was trying to understand how file renames are handled in Btrfs. I
>> read the code documentation, but had a problem understanding a few
>> things.
>>
>> During a file rename, btrfs_commit_transaction() is called which is
>> because Btrfs has to commit the whole FS before storing the
>> information related to the new renamed file.
> Can you point to which lines of code you're looking at?
>
>> It has to commit the FS
>> because a rename first does an unlink, which is not recorded in the
>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>> understanding correct? [...]
> Can you also point to where exactly you see this happening? I'd also
> like to understand more about this.
>
> The whole mail thread following this message continues about what a
> transaction commit is and does etc, but the above question is never
> answered I think.
>
> And I think it's an interesting question. Is a rename a "heavier"
> operation relative to other file operations?
>
as far as I can see it only uses the log tree in some cases where the
log tree was already used for the file or the parent directory. The
cases are documented here
https://github.com/torvalds/linux/blob/master/fs/btrfs/tree-log.c#L45 .
So rename isn't much heavier than unlink+create.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-13 Thread Martin Raiber
Hi,

On 12.09.2017 23:13 Adam Borowski wrote:
> On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-09-12 16:00, Adam Borowski wrote:
>>> Noted.  Both Marat's and my use cases, though, involve VMs that are off most
>>> of the time, and at least for me, turned on only to test something.
>>> Touching mtime makes rsync run again, and it's freaking _slow_: worse than
>>> 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
>> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
>> you're going direct to a hard drive.  I get better performance than that on
>> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
>> but it's for archival storage so I don't really care).  I'm actually curious
>> what the exact rsync command you are using is (you can obviously redact
>> paths as you see fit), as the only way I can think of that it should be that
>> slow is if you're using both --checksum (but if you're using this, you can
>> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
>> _and_ your HDD is slow to begin with.
> rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
> The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
>
> Both source and target are btrfs, but here switching to send|receive
> wouldn't give much as this particular guest is Win10 Insider Edition --
> a thingy that shows what the folks from Redmond have cooked up, with roughly
> weekly updates to the tune of ~10GB writes 10GB deletions (if they do
> incremental transfers, installation still rewrites everything system).
>
> Lemme look a bit more, rsync performance is indeed really abysmal compared
> to what it should be.

self promo, but consider using UrBackup (OSS software, too) instead? For
Windows VMs I would install the client in the VM. It excludes unnessary
stuff like e.g. page files or the shadow storage area from the image
backups, as well and has a mode to store image backups as raw btrfs files.
Linux VMs I'd backup as files either from the hypervisor or from in VM.
If you want to backup big btrfs image files it can do that too, and
faster than rsync plus it can do incremental backups with sparse files.

Regards,
Martin Raiber

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Martin Raiber
Hi,

On 10.09.2017 08:45 Qu Wenruo wrote:
>
>
> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>
>>
>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>> Hello,
>>>
>>> I was trying to understand how file renames are handled in Btrfs. I
>>> read the code documentation, but had a problem understanding a few
>>> things.
>>>
>>> During a file rename, btrfs_commit_transaction() is called which is
>>> because Btrfs has to commit the whole FS before storing the
>>> information related to the new renamed file. It has to commit the FS
>>> because a rename first does an unlink, which is not recorded in the
>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>> understanding correct? If yes, my questions are as follows:
>>
>> Not familiar with rename kernel code, so not much help for rename
>> opeartion.
>>
>>>
>>> 1. What does committing the whole FS mean?
>>
>> Committing the whole fs means a lot of things, but generally
>> speaking, it makes that the on-disk data is inconsistent with each
>> other.
>
>> For obvious part, it writes modified fs/subvolume trees to disk (with
>> handling of tree operations so no half modified trees).
>>
>> Also other trees like extent tree (very hot since every CoW will
>> update it, and the most complicated one), csum tree if modified.
>>
>> After transaction is committed, the on-disk btrfs will represent the
>> states when commit trans is called, and every tree should match each
>> other.
>>
>> Despite of this, after a transaction is committed, generation of the
>> fs get increased and modified tree blocks will have the same
>> generation number.
>>
>>> Blktrace shows that there
>>> are 2   256KB writes, which are essentially writes to the data of
>>> the root directory of the file system (which I found out through
>>> btrfs-debug-tree).
>>
>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>> I strongly recommend to do vimdiff to get what tree is modified.
>>
>> At least the following trees are modified:
>>
>> 1) fs/subvolume tree
>>     Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>>     updated inode time.
>>     So fs/subvolume tree must be CoWed.
>>
>> 2) extent tree
>>     CoW of above metadata operation will definitely cause extent
>>     allocation and freeing, extent tree will also get updated.
>>
>> 3) root tree
>>     Both extent tree and fs/subvolume tree modified, their root bytenr
>>     needs to be updated and root tree must be updated.
>>
>> And finally superblocks.
>>
>> I just verified the behavior with empty btrfs created on a 1G file,
>> only one file to do the rename.
>>
>> In that case (with 4K sectorsize and 16K nodesize), the total IO
>> should be (3 * 16K) * 2 + 4K * 2 = 104K.
>>
>> "3" = number of tree blocks get modified
>> "16K" = nodesize
>> 1st "*2" = DUP profile for metadata
>> "4K" = superblock size
>> 2nd "*2" = 2 superblocks for 1G fs.
>>
>> If your extent/root/fs trees have higher level, then more tree blocks
>> needs to be updated.
>> And if your fs is very large, you may have 3 superblocks.
>>
>>> Is this equivalent to doing a shell sync, as the
>>> same block groups are written during a shell sync too?
>>
>> For shell "sync" the difference is that, "sync" will write all dirty
>> data pages to disk, and then commit transaction.
>> While only calling btrfs_commit_transacation() doesn't trigger dirty
>> page writeback.
>>
>> So there is a difference.

this conversation made me realize why btrfs has sub-optimal meta-data
performance. Cow b-trees are not the best data structure for such small
changes. In my application I have multiple operations (e.g. renames)
which can be bundles up and (mostly) one writer.
I guess using BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END would be one
way to reduce the cow overhead, but those are dangerous wrt. to ENOSPC
and there have been discussions about removing them.
Best would be if there were delayed metadata, where metadata is handled
the same as delayed allocations and data changes, i.e. commit on fsync,
commit interval or fssync. I assumed this was already the case...

Please correct me if I got this wrong.

Regards,
Martin Raiber
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-07-09 Thread Martin Steigerwald
Hello Duncan.

Duncan - 09.07.17, 11:17:
> Paul Jones posted on Sun, 09 Jul 2017 09:16:36 + as excerpted:
> >> Marc MERLIN - 08.07.17, 21:34:
> >> > This is now the 3rd filesystem I have (on 3 different machines) that
> >> > is getting corruption of some kind (on 4.11.6).
> >> 
> >> Anyone else getting corruptions with 4.11?
> >> 
> >> I happily switch back to 4.10.17 or even 4.9 if that is the case. I may
> >> even do so just from your reports. Well, yes, I will do exactly that. I
> >> just switch back for 4.10 for now. Better be safe, than sorry.
> > 
> > No corruption for me - I've been on 4.11 since about .2 and everything
> > seems fine. Currently on 4.11.8
> 
> No corruptions here either. 4.12.0 now, previously 4.12-rc5(ish, git),
> before that 4.11.0.
> 
> I have however just upgraded to new ssds then wiped and setup the old
[…]
> Also, all my btrfs are raid1 or dup for checksummed redundancy, and
> relatively small, the largest now 80 GiB per device, after the upgrade.
> And my use-case doesn't involve snapshots or subvolumes.
> 
> So any bug that is most likely on older filesystems, say those without
> the no-holes feature, for instance, or that doesn't tend to hit raid1 or
> dup mode, or that is less likely on small filesystems on fast ssds, or
> that triggers most often with reflinks and thus on filesystems with
> snapshots, is unlikely to hit me.

Hmmm, the BTRFS filesystems on my laptop 3 to 5 or even more years old. I stick 
with 4.10 for now, I think.

The older ones are RAID 1 across two SSDs, the newer one is single device, on 
one SSD.

These filesystems didn´t fail me in years and since 4.5 or 4.6 even the "I 
search for free space" kernel hang (hung tasks and all that) is gone as well.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-07-09 Thread Martin Steigerwald
Hello Marc.

Marc MERLIN - 08.07.17, 21:34:
> Sigh,
> 
> This is now the 3rd filesystem I have (on 3 different machines) that is
> getting corruption of some kind (on 4.11.6).

Anyone else getting corruptions with 4.11?

I happily switch back to 4.10.17 or even 4.9 if that is the case. I may even 
do so just from your reports. Well, yes, I will do exactly that. I just switch 
back for 4.10 for now. Better be safe, than sorry.

I know how you feel, Marc. I posted about a corruption on one of my backup 
harddisks here some time ago that btrfs check --repair wasn´t able to handle. 
I redid that disk from scratch and it took a long, long time.

I agree with you that this has to stop. Before that I will never *ever* 
recommend this to a customer. Ideally no corruptions in stable kernels, 
especially when its a .6 at the end of the version number. But if so… then 
fixable. Other filesystems like Ext4 and XFS can do it… so this should be 
possible with BTRFS as well.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/13] scsi/osd: don't save block errors into req_results

2017-05-26 Thread Martin K. Petersen

Christoph,

> We will only have sense data if the command exectured and got a SCSI
> result, so this is pointless.

"executed"

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] [PATCH 08/15] dm mpath: merge do_end_io_bio into multipath_end_io_bio

2017-05-22 Thread Martin Wilck
On Thu, 2017-05-18 at 15:18 +0200, Christoph Hellwig wrote:
> This simplifies the code and especially the error passing a bit and
> will help with the next patch.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/md/dm-mpath.c | 42 -
> -
>  1 file changed, 16 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 3df056b73b66..b1cb0273b081 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -1510,24 +1510,26 @@ static int multipath_end_io(struct dm_target
> *ti, struct request *clone,
>   return r;
>  }
>  
> -static int do_end_io_bio(struct multipath *m, struct bio *clone,
> -  int error, struct dm_mpath_io *mpio)
> +static int multipath_end_io_bio(struct dm_target *ti, struct bio
> *clone, int error)
>  {
> + struct multipath *m = ti->private;
> + struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
> + struct pgpath *pgpath = mpio->pgpath;
>   unsigned long flags;
>  
> - if (!error)
> - return 0;   /* I/O complete */
> + BUG_ON(!mpio);

You dereferenced mpio already above.

Regards,
Martin

>  
> - if (noretry_error(error))
> - return error;
> + if (!error || noretry_error(error))
> + goto done;
>  
> - if (mpio->pgpath)
> - fail_path(mpio->pgpath);
> + if (pgpath)
> + fail_path(pgpath);
>  
>   if (atomic_read(&m->nr_valid_paths) == 0 &&
>   !test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
>   dm_report_EIO(m);
> - return -EIO;
> + error = -EIO;
> + goto done;
>   }
>  
>   /* Queue for the daemon to resubmit */
> @@ -1539,28 +1541,16 @@ static int do_end_io_bio(struct multipath *m,
> struct bio *clone,
>   if (!test_bit(MPATHF_QUEUE_IO, &m->flags))
>   queue_work(kmultipathd, &m->process_queued_bios);
>  
> - return DM_ENDIO_INCOMPLETE;
> -}
> -
> -static int multipath_end_io_bio(struct dm_target *ti, struct bio
> *clone, int error)
> -{
> - struct multipath *m = ti->private;
> - struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
> - struct pgpath *pgpath;
> - struct path_selector *ps;
> - int r;
> -
> - BUG_ON(!mpio);
> -
> - r = do_end_io_bio(m, clone, error, mpio);
> - pgpath = mpio->pgpath;
> + error = DM_ENDIO_INCOMPLETE;
> +done:
>   if (pgpath) {
> - ps = &pgpath->pg->ps;
> + struct path_selector *ps = &pgpath->pg->ps;
> +
>   if (ps->type->end_io)
>   ps->type->end_io(ps, &pgpath->path, mpio-
> >nr_bytes);
>   }
>  
> - return r;
> + return error;
>  }
>  
>  /*

-- 
Dr. Martin Wilck , Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: runtime btrfsck

2017-05-10 Thread Martin Steigerwald
Stefan Priebe - Profihost AG - 10.05.17, 09:02:
> I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me
> something about the expected runtime or how to fix bad key ordering?

I had a similar issue which remained unresolved.

But I clearly saw that btrfs check was running in a loop, see thread:

[4.9] btrfs check --repair looping over file extent discount errors

So it would be interesting to see the exact output of btrfs check, maybe there 
is something like repeated numbers that also indicate a loop.

I was about to say that BTRFS is production ready before this issue happened. 
I still think it for a lot of setup mostly is, as at least the "I get stuck on 
the CPU while searching for free space" issue seems to be gone since about 
anything between 4.5/4.6 kernels. I also think so regarding absence of data 
loss. I was able to copy over all of the data I needed of the broken 
filesystem.

Yet, when it comes to btrfs check? Its still quite rudimentary if you ask me.  
So unless someone has a clever idea here and shares it with you, it may be 
needed to backup anything you can from this filesystem and then start over from 
scratch. As to my past experience something like xfs_repair surpasses btrfs 
check in the ability to actually fix broken filesystem by a great extent.

Ciao,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.9] btrfs check --repair looping over file extent discount errors

2017-04-22 Thread Martin Steigerwald
Martin Steigerwald - 22.04.17, 20:01:
> Chris Murphy - 22.04.17, 09:31:
> > Is the file system created with no-holes?
> 
> I have how to find out about it and while doing accidentally set that

I didn´t find out how to find out about it and…

> feature on another filesystem (btrfstune only seems to be able to enable
> the feature, not show the current state of it).
> 
> But as there is no notice of the feature being set as standard in manpage of
> mkfs.btrfs as of BTRFS tools 4.9.1 and as I didn´t set it myself, I best
> bet is that the feature is not enable on the filesystem.
> 
> Now I wonder… how to disable the feature on that other filesystem again.
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.9] btrfs check --repair looping over file extent discount errors

2017-04-22 Thread Martin Steigerwald
Hello Chris.

Chris Murphy - 22.04.17, 09:31:
> Is the file system created with no-holes?

I have how to find out about it and while doing accidentally set that feature 
on another filesystem (btrfstune only seems to be able to enable the feature, 
not show the current state of it).

But as there is no notice of the feature being set as standard in manpage of 
mkfs.btrfs as of BTRFS tools 4.9.1 and as I didn´t set it myself, I best bet 
is that the feature is not enable on the filesystem.

Now I wonder… how to disable the feature on that other filesystem again.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.9] btrfs check --repair looping over file extent discount errors

2017-04-22 Thread Martin Steigerwald
Hello.

I am planning to copy of important data on the disk with the broken filesystem 
to the disk with the good filesystem and then reformatitting the disk with the 
broken filesystem soon, probably in the course of the day… so in case you want 
any debug information before that, let me know ASAP.

Thanks,
Martin

Martin Steigerwald - 14.04.17, 21:35:
> Hello,
> 
> backup harddisk connected via eSATA. Hard kernel hang, mouse pointer
> freezing two times seemingly after finishing /home backup and creating new
> snapshot on source BTRFS SSD RAID 1 for / in order to backup it. I did
> scrubbed / and it appears to be okay, but I didn´t run btrfs check on it.
> Anyway deleting that subvolume works and I as I suspected an issue with the
> backup disk I started with that one.
> 
> I got
> 
> merkaba:~> btrfs --version
> btrfs-progs v4.9.1
> 
> merkaba:~> cat /proc/version
> Linux version 4.9.20-tp520-btrfstrim+ (martin@merkaba) (gcc version 6.3.0
> 20170321 (Debian 6.3.0-11) ) #6 SMP PREEMPT Mon Apr 3 11:42:17 CEST 2017
> 
> merkaba:~> btrfs fi sh feenwald
> Label: 'feenwald'  uuid: […]
> Total devices 1 FS bytes used 1.26TiB
> devid1 size 2.73TiB used 1.27TiB path /dev/sdc1
> 
> on Debian unstable on ThinkPad T520 connected via eSATA port on Minidock.
> 
> 
> I am now running btrfs check --repair on it after without --repair the
> command reported file extent discount errors and it appears to loop on the
> same file extent discount errors for ages. Any advice?
> 
> I do have another backup harddisk with BTRFS that worked fine today, so I do
> not need to recover that drive immediately. I may let it run for a little
> more time, but then will abort the repair process as I really think its
> looping just over and over and over the same issues again. At some time I
> may just copy all the stuff that is on that harddisk, but not on the other
> one over to the other one and mkfs.btrfs the filesystem again, but I´d
> rather like to know whats happening here.
> 
> Here is output:
> 
> merkaba:~> btrfs check --repair /dev/sdc1
> enabling repair mode
> Checking filesystem on /dev/sdc1
> [… UUID ommited …]
> checking extents
> Fixed 0 roots.
> checking free space cache
> cache and super generation don't match, space cache will be invalidated
> checking fs roots
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> [… hours later …]
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 143360
> root 257 inode 4980214 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 4227072
> root 257 inode 4979842 errors 100, file extent discount
> Found file extent holes:
> start: 0, len: 78798848
> root 257 inode 4980212 errors 100, file extent discount
> Found fil

[4.9] btrfs check --repair looping over file extent discount errors

2017-04-14 Thread Martin Steigerwald
Hello,

backup harddisk connected via eSATA. Hard kernel hang, mouse pointer freezing 
two times seemingly after finishing /home backup and creating new snapshot on 
source BTRFS SSD RAID 1 for / in order to backup it. I did scrubbed / and it 
appears to be okay, but I didn´t run btrfs check on it. Anyway deleting that 
subvolume works and I as I suspected an issue with the backup disk I started 
with that one.

I got

merkaba:~> btrfs --version
btrfs-progs v4.9.1

merkaba:~> cat /proc/version
Linux version 4.9.20-tp520-btrfstrim+ (martin@merkaba) (gcc version 6.3.0 
20170321 (Debian 6.3.0-11) ) #6 SMP PREEMPT Mon Apr 3 11:42:17 CEST 2017

merkaba:~> btrfs fi sh feenwald
Label: 'feenwald'  uuid: […]
Total devices 1 FS bytes used 1.26TiB
devid1 size 2.73TiB used 1.27TiB path /dev/sdc1

on Debian unstable on ThinkPad T520 connected via eSATA port on Minidock.


I am now running btrfs check --repair on it after without --repair the command 
reported file extent discount errors and it appears to loop on the same file 
extent discount errors for ages. Any advice?

I do have another backup harddisk with BTRFS that worked fine today, so I do 
not need to recover that drive immediately. I may let it run for a little more 
time, but then will abort the repair process as I really think its looping 
just over and over and over the same issues again. At some time I may just 
copy all the stuff that is on that harddisk, but not on the other one over to 
the other one and mkfs.btrfs the filesystem again, but I´d rather like to know 
whats happening here.

Here is output:

merkaba:~> btrfs check --repair /dev/sdc1
enabling repair mode
Checking filesystem on /dev/sdc1
[… UUID ommited …]
checking extents
Fixed 0 roots.
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
[… hours later …]
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072
root 257 inode 4979842 errors 100, file extent discount
Found file extent holes:
start: 0, len: 78798848
root 257 inode 4980212 errors 100, file extent discount
Found file extent holes:
start: 0, len: 143360
root 257 inode 4980214 errors 100, file extent discount
Found file extent holes:
start: 0, len: 4227072

This basically seems to go on like this forever.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-04-06 Thread Martin
On 05/04/17 08:04, Marat Khalili wrote:
> On 04/04/17 20:36, Peter Grandi wrote:
>> SATA works for external use, eSATA works well, but what really
>> matters is the chipset of the adapter card.
> eSATA might be sound electrically, but mechanically it is awful. Try to
> run it for months in a crowded server room, and inevitably you'll get
> disconnections and data corruption. Tried different cables, brackets --
> same result. If you ever used eSATA connector, you'd feel it.

Been using eSATA here for multiple disk packs continuously connected for
a few years now for 48TB of data (not enough room in the host for the
disks).

Never suffered am eSATA disconnect.

Had the usual cooling fan fails and HDD fails due to old age.


All just a case of ensuring undisturbed clean cabling and a good UPS?...

(BTRFS spanning four disks per external pack has worked well also.)

Good luck,
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Root volume (ID 5) in deleting state

2017-02-14 Thread Martin Mlynář



It looks you're right!

On a different machine:

# btrfs sub list / | grep -v lxc
ID 327 gen 1959587 top level 5 path mnt/reaver
ID 498 gen 593655 top level 5 path var/lib/machines

# btrfs sub list / -d | wc -l
0

Ok, apparently it's a regression in one of the latest versions then.
But, it seems quite harmless.

I'm glad my data are safe :)






# uname -a
Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET
2017 x86_64 GNU/Linux

# btrfs fi show  /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
  Total devices 1 FS bytes used 132.89GiB
  devid1 size 200.00GiB used 200.00GiB path
/dev/mapper/vg0-btrfsroot

As a side note, all of your disk space is allocated (200GiB of 200GiB).

Even while there's still 70GiB of free space scattered around inside,
this might lead to out-of-space issues, depending on how badly
fragmented that free space is.

I have not noticed this at all!

# btrfs fi show /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
 Total devices 1 FS bytes used 134.23GiB
 devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot

# btrfs fi df /
Data, single: total=195.96GiB, used=131.58GiB
System, single: total=3.00MiB, used=48.00KiB
Metadata, single: total=4.03GiB, used=2.64GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

After btrfs defrag there is no difference. btrfs fi show says still
200/200. I'll try to play with it.


[ ... ]

So, to get the numbers of total raw disk space allocation down, you need
to defragment free space (compact the data), not defrag used space.

You can even create pictures of space utilization in your btrfs
filesystem, which might help understanding what it looks like right now: \o/

https://github.com/knorrie/btrfs-heatmap/
I've run into your tool yesterday while googling around this - thanks, 
it's really nice tool. Now rebalance is running and it seems to work well.


Thank you for excellent responses and help!



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Root volume (ID 5) in deleting state

2017-02-13 Thread Martin Mlynář

On 13.2.2017 21:03, Hans van Kranenburg wrote:

On 02/13/2017 12:26 PM, Martin Mlynář wrote:

I've currently run into strange problem with BTRFS. I'm using it as my
daily driver as root FS. Nothing complicated, just few subvolumes and
incremental backups using btrbk.

Now I've noticed that my btrfs root volume (absolute top, ID 5) is in
"deleting" state. As I've done some testing and googling it seems that
this should not be possible.

[...]

# btrfs sub list -ad /mnt/btrfs_root/
ID 5 gen 257505 top level 0 path /DELETED

I have heard rumours that this is actually a bug in the output of sub
list itself.

What's the version of your btrfs-progs? (output of `btrfs version`)

Sorry, I've lost this part:

$ btrfs version
btrfs-progs v4.9




# mount | grep btr
/dev/mapper/vg0-btrfsroot on / type btrfs
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=1339,subvol=/rootfs)

/dev/mapper/vg0-btrfsroot on /mnt/btrfs_root type btrfs
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=5,subvol=/)

The rumour was that it had something to do with using space_cache=v2,
which this example does not confirm.

It looks you're right!

On a different machine:

# btrfs sub list / | grep -v lxc
ID 327 gen 1959587 top level 5 path mnt/reaver
ID 498 gen 593655 top level 5 path var/lib/machines

# btrfs sub list / -d | wc -l
0

# btrfs version
btrfs-progs v4.8.2

# uname -a
Linux nxserver 4.8.6-1-ARCH #1 SMP PREEMPT Mon Oct 31 18:51:30 CET 2016 
x86_64 GNU/Linux


# mount | grep btrfs
/dev/vda1 on / type btrfs 
(rw,relatime,nodatasum,nodatacow,space_cache,subvolid=5,subvol=/)


Then I've upgraded this machine and:

# btrfs sub list / | grep -v lxc
ID 327 gen 1959587 top level 5 path mnt/reaver
ID 498 gen 593655 top level 5 path var/lib/machines

# btrfs sub list / -d | wc -l
1

# btrfs sub list / -d
ID 5 gen 2186037 top level 0 path DELETED<==

1

# btrfs version
btrfs-progs v4.9

# uname -a
Linux nxserver 4.9.8-1-ARCH #1 SMP PREEMPT Mon Feb 6 12:59:40 CET 2017 
x86_64 GNU/Linux


# mount | grep btrfs
/dev/vda1 on / type btrfs 
(rw,relatime,nodatasum,nodatacow,space_cache,subvolid=5,subvol=/)






# uname -a
Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET
2017 x86_64 GNU/Linux

# btrfs fi show  /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
 Total devices 1 FS bytes used 132.89GiB
 devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot

As a side note, all of your disk space is allocated (200GiB of 200GiB).

Even while there's still 70GiB of free space scattered around inside,
this might lead to out-of-space issues, depending on how badly
fragmented that free space is.

I have not noticed this at all!

# btrfs fi show /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
Total devices 1 FS bytes used 134.23GiB
devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot

# btrfs fi df /
Data, single: total=195.96GiB, used=131.58GiB
System, single: total=3.00MiB, used=48.00KiB
Metadata, single: total=4.03GiB, used=2.64GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

After btrfs defrag there is no difference. btrfs fi show says still 
200/200. I'll try to play with it.



--
Martin Mlynář
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Root volume (ID 5) in deleting state

2017-02-13 Thread Martin Mlynář

Hello,


I've currently run into strange problem with BTRFS. I'm using it as my 
daily driver as root FS. Nothing complicated, just few subvolumes and 
incremental backups using btrbk.


Now I've noticed that my btrfs root volume (absolute top, ID 5) is in 
"deleting" state. As I've done some testing and googling it seems that 
this should not be possible.


I've tried scrubbing and checking, but nothing changed. Volume is not 
being deleted in reality. It just sits there in this state.


Is there anything I can do to fix this?

# btrfs sub list -a /mnt/btrfs_root/
ID 1339 gen 262150 top level 5 path rootfs
ID 1340 gen 262101 top level 5 path .btrbk
ID 1987 gen 262149 top level 5 path no_backup
ID 4206 gen 255869 top level 1340 path /.btrbk/rootfs.20170121T1829
ID 4272 gen 257460 top level 1340 path /.btrbk/rootfs.20170123T0933
ID 4468 gen 259194 top level 1340 path /.btrbk/rootfs.20170131T1132
ID 4474 gen 260911 top level 1340 path /.btrbk/rootfs.20170207T0927
ID 4476 gen 261712 top level 1340 path /.btrbk/rootfs.20170211T
ID 4477 gen 261970 top level 1340 path /.btrbk/rootfs.20170212T1331
ID 4478 gen 262102 top level 1340 path /.btrbk/rootfs.20170213T

# btrfs sub list -ad /mnt/btrfs_root/
ID 5 gen 257505 top level 0 path /DELETED

# mount | grep btr
/dev/mapper/vg0-btrfsroot on / type btrfs 
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=1339,subvol=/rootfs)
/dev/mapper/vg0-btrfsroot on /mnt/btrfs_root type btrfs 
(rw,noatime,nodatasum,nodatacow,ssd,discard,space_cache,subvolid=5,subvol=/)


# uname -a
Linux interceptor 4.9.6-1-ARCH #1 SMP PREEMPT Thu Jan 26 09:22:26 CET 
2017 x86_64 GNU/Linux


# btrfs fi show  /
Label: none  uuid: 859dec5c-850c-4660-ad99-bc87456aa309
Total devices 1 FS bytes used 132.89GiB
devid1 size 200.00GiB used 200.00GiB path /dev/mapper/vg0-btrfsroot


Thank you for your time,


Best regards

--

Martin Mlynář

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-08 Thread Martin Raiber
On 08.02.2017 14:08 Austin S. Hemmelgarn wrote:
> On 2017-02-08 07:14, Martin Raiber wrote:
>> Hi,
>>
>> On 08.02.2017 03:11 Peter Zaitsev wrote:
>>> Out of curiosity, I see one problem here:
>>> If you're doing snapshots of the live database, each snapshot leaves
>>> the database files like killing the database in-flight. Like shutting
>>> the system down in the middle of writing data.
>>>
>>> This is because I think there's no API for user space to subscribe to
>>> events like a snapshot - unlike e.g. the VSS API (volume snapshot
>>> service) in Windows. You should put the database into frozen state to
>>> prepare it for a hotcopy before creating the snapshot, then ensure all
>>> data is flushed before continuing.
>>>
>>> I think I've read that btrfs snapshots do not guarantee single point in
>>> time snapshots - the snapshot may be smeared across a longer period of
>>> time while the kernel is still writing data. So parts of your writes
>>> may still end up in the snapshot after issuing the snapshot command,
>>> instead of in the working copy as expected.
>>>
>>> How is this going to be addressed? Is there some snapshot aware API to
>>> let user space subscribe to such events and do proper preparation? Is
>>> this planned? LVM could be a user of such an API, too. I think this
>>> could have nice enterprise-grade value for Linux.
>>>
>>> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
>>> still, also this needs to be integrated with MySQL to properly work. I
>>> once (years ago) researched on this but gave up on my plans when I
>>> planned database backups for our web server infrastructure. We moved to
>>> creating SQL dumps instead, although there're binlogs which can be used
>>> to recover to a clean and stable transactional state after taking
>>> snapshots. But I simply didn't want to fiddle around with properly
>>> cleaning up binlogs which accumulate horribly much space usage over
>>> time. The cleanup process requires to create a cold copy or dump of the
>>> complete database from time to time, only then it's safe to remove all
>>> binlogs up to that point in time.
>>
>> little bit off topic, but I for one would be on board with such an
>> effort. It "just" needs coordination between the backup
>> software/snapshot tools, the backed up software and the various snapshot
>> providers. If you look at the Windows VSS API, this would be a
>> relatively large undertaking if all the corner cases are taken into
>> account, like e.g. a database having the database log on a separate
>> volume from the data, dependencies between different components etc.
>>
>> You'll know more about this, but databases usually fsync quite often in
>> their default configuration, so btrfs snapshots shouldn't be much behind
>> the properly snapshotted state, so I see the advantages more with
>> usability and taking care of corner cases automatically.
> Just my perspective, but BTRFS (and XFS, and OCFS2) already provide
> reflinking to userspace, and therefore it's fully possible to
> implement this in userspace.  Having a version of the fsfreeze (the
> generic form of xfs_freeze) stuff that worked on individual sub-trees
> would be nice from a practical perspective, but implementing it would
> not be easy by any means, and would be essentially necessary for a
> VSS-like API.  In the meantime though, it is fully possible for the
> application software to implement this itself without needing anything
> more from the kernel.

VSS snapshots whole volumes, not individual files (so comparable to an
LVM snapshot). The sub-folder freeze would be something useful in some
situations, but duplicating the files+extends might also take too long
in a lot of situations. You are correct that the kernel features are
there and what is missing is a user-space daemon, plus a protocol that
facilitates/coordinates the backups/snapshots.

Sending a FIFREEZE ioctl, taking a snapshot and then thawing it does not
really help in some situations as e.g. MySQL InnoDB uses O_DIRECT and
manages its on buffer pool which won't get the FIFREEZE and flush, but
as said, the default configuration is to flush/fsync on every commit.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS for OLTP Databases

2017-02-08 Thread Martin Raiber
Hi,

On 08.02.2017 03:11 Peter Zaitsev wrote:
> Out of curiosity, I see one problem here:
> If you're doing snapshots of the live database, each snapshot leaves
> the database files like killing the database in-flight. Like shutting
> the system down in the middle of writing data.
>
> This is because I think there's no API for user space to subscribe to
> events like a snapshot - unlike e.g. the VSS API (volume snapshot
> service) in Windows. You should put the database into frozen state to
> prepare it for a hotcopy before creating the snapshot, then ensure all
> data is flushed before continuing.
>
> I think I've read that btrfs snapshots do not guarantee single point in
> time snapshots - the snapshot may be smeared across a longer period of
> time while the kernel is still writing data. So parts of your writes
> may still end up in the snapshot after issuing the snapshot command,
> instead of in the working copy as expected.
>
> How is this going to be addressed? Is there some snapshot aware API to
> let user space subscribe to such events and do proper preparation? Is
> this planned? LVM could be a user of such an API, too. I think this
> could have nice enterprise-grade value for Linux.
>
> XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
> still, also this needs to be integrated with MySQL to properly work. I
> once (years ago) researched on this but gave up on my plans when I
> planned database backups for our web server infrastructure. We moved to
> creating SQL dumps instead, although there're binlogs which can be used
> to recover to a clean and stable transactional state after taking
> snapshots. But I simply didn't want to fiddle around with properly
> cleaning up binlogs which accumulate horribly much space usage over
> time. The cleanup process requires to create a cold copy or dump of the
> complete database from time to time, only then it's safe to remove all
> binlogs up to that point in time.

little bit off topic, but I for one would be on board with such an
effort. It "just" needs coordination between the backup
software/snapshot tools, the backed up software and the various snapshot
providers. If you look at the Windows VSS API, this would be a
relatively large undertaking if all the corner cases are taken into
account, like e.g. a database having the database log on a separate
volume from the data, dependencies between different components etc.

You'll know more about this, but databases usually fsync quite often in
their default configuration, so btrfs snapshots shouldn't be much behind
the properly snapshotted state, so I see the advantages more with
usability and taking care of corner cases automatically.

Regards,
Martin Raiber



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [markfasheh/duperemove] Why blocksize is limit to 1MB?

2017-01-03 Thread Martin Raiber
On 04.01.2017 00:43 Hans van Kranenburg wrote:
> On 01/04/2017 12:12 AM, Peter Becker wrote:
>> Good hint, this would be an option and i will try this.
>>
>> Regardless of this the curiosity has packed me and I will try to
>> figure out where the problem with the low transfer rate is.
>>
>> 2017-01-04 0:07 GMT+01:00 Hans van Kranenburg 
>> :
>>> On 01/03/2017 08:24 PM, Peter Becker wrote:
 All invocations are justified, but not relevant in (offline) backup
 and archive scenarios.

 For example you have multiple version of append-only log-files or
 append-only db-files (each more then 100GB in size), like this:

> Snapshot_01_01_2017
 -> file1.log .. 201 GB

> Snapshot_02_01_2017
 -> file1.log .. 205 GB

> Snapshot_03_01_2017
 -> file1.log .. 221 GB

 The first 201 GB would be every time the same.
 Files a copied at night from windows, linux or bsd systems and
 snapshoted after copy.
>>> XY problem?
>>>
>>> Why not use rsync --inplace in combination with btrfs snapshots? Even if
>>> the remote does not support rsync and you need to pull the full file
>>> first, you could again use rsync locally.
> please don't toppost
>
> Also, there is a rather huge difference in the two approaches, given the
> way how btrfs works internally.
>
> Say, I have a subvolume with thousands of directories and millions of
> files with random data in it, and I want to have a second deduped copy
> of it.
>
> Approach 1:
>
> Create a full copy of everything (compare: retrieving remote file again)
> (now 200% of data storage is used), and after that do deduplication, so
> that again only 100% of data storage is used.
>
> Approach 2:
>
> cp -av --reflink original/ copy/
>
> By doing this, you end up with the same as doing approach 1 if your
> deduper is the most ideal in the world (and the files are so random they
> don't contain duplicate blocks inside them).
>
> Approach 3:
>
> btrfs sub snap original copy
>
> W00t, that was fast, and the only thing that happened was writing a few
> 16kB metadata pages again. (1 for the toplevel tree page that got cloned
> into a new filesystem tree, and a few for the blocks one level lower to
> add backreferences to the new root).
>
> So:
>
> The big difference in the end result between approach 1,2 and otoh 3 is
> that while deduplicating your data, you're actually duplicating all your
> metadata at the same time.
>
> In your situation, if possible doing an rsync --inplace from the remote,
> so that only changed appended data gets stored, and then useing native
> btrfs snapshotting it would seem the most effective.
>
Or use UrBackup as backup software. It uses the snapshot then modfiy
approach with btrfs, plus you get file level deduplication between
clients using reflinks.




smime.p7s
Description: S/MIME Cryptographic Signature


  1   2   3   4   5   6   7   8   9   >