On Mar 9, 2018, at 3:54 AM, Nikolay Borisov <nbori...@suse.com> wrote:
> <snip>
>> Sorry, I clearly missed that one. I have applied the patch you referenced 
>> and rebooted the VM in question. This morning we had another FS failure on 
>> the same machine that caused it to go into readonly mode. This happened 
>> after that device was experiencing 100% I/O utilization for some time. No 
>> balance was running at the time; last balance finished about 6 hours prior 
>> to the error.
>> 
>> Kernel messages:
>> [211238.262683] use_block_rsv: 163 callbacks suppressed
>> [211238.262683] BTRFS: block rsv returned -28
>> [211238.266718] ------------[ cut here ]------------
>> [211238.270462] WARNING: CPU: 0 PID: 391 at fs/btrfs/extent-tree.c:8463 
>> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.277203] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE 
>> nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
>> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
>> iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c crc32c_generic 
>> br_netfilter bridge stp llc intel_rapl sb_edac crct10dif_pclmul crc32_pclmul 
>> ghash_clmulni_intel ppdev parport_pc intel_rapl_perf parport serio_raw evdev 
>> ip_tables x_tables autofs4 btrfs xor zstd_decompress zstd_compress xxhash 
>> raid6_pq ata_generic crc32c_intel ata_piix libata xen_blkfront cirrus ttm 
>> drm_kms_helper aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse 
>> drm ena scsi_mod i2c_piix4 button
>> [211238.319618] CPU: 0 PID: 391 Comm: btrfs-transacti Tainted: G        W    
>>    4.14.13 #3
>> [211238.325479] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
>> [211238.330742] task: ffff9cb43abb70c0 task.stack: ffffb234c3b58000
>> [211238.335575] RIP: 0010:btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>> [211238.340454] RSP: 0018:ffffb234c3b5b958 EFLAGS: 00010282
>> [211238.344782] RAX: 000000000000001d RBX: ffff9cb43bdea128 RCX: 
>> 0000000000000000
>> [211238.350562] RDX: 0000000000000000 RSI: ffff9cb440a166f8 RDI: 
>> ffff9cb440a166f8
>> [211238.356066] RBP: 0000000000004000 R08: 0000000000000001 R09: 
>> 0000000000007d81
>> [211238.361649] R10: 0000000000000001 R11: 0000000000007d81 R12: 
>> ffff9cb43bdea000
>> [211238.367304] R13: ffff9cb437f2c800 R14: 0000000000000001 R15: 
>> 00000000ffffffe4
>> [211238.372658] FS:  0000000000000000(0000) GS:ffff9cb440a00000(0000) 
>> knlGS:0000000000000000
>> [211238.379048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [211238.384681] CR2: 00007f90a6677000 CR3: 00000003cea0a006 CR4: 
>> 00000000001606f0
>> [211238.391380] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [211238.398050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [211238.404730] Call Trace:
>> [211238.407880]  __btrfs_cow_block+0x125/0x5c0 [btrfs]
>> [211238.412455]  btrfs_cow_block+0xcb/0x1b0 [btrfs]
>> [211238.416292]  btrfs_search_slot+0x1fd/0x9e0 [btrfs]
>> [211238.420630]  lookup_inline_extent_backref+0x105/0x610 [btrfs]
>> [211238.425215]  __btrfs_free_extent.isra.61+0xf5/0xd30 [btrfs]
>> [211238.429663]  __btrfs_run_delayed_refs+0x516/0x12a0 [btrfs]
>> [211238.434077]  btrfs_run_delayed_refs+0x7a/0x270 [btrfs]
>> [211238.438541]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
>> [211238.442899]  ? remove_wait_queue+0x60/0x60
>> [211238.446503]  transaction_kthread+0x195/0x1b0 [btrfs]
>> [211238.450578]  kthread+0xfc/0x130
>> [211238.453924]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
>> [211238.458381]  ? kthread_create_on_node+0x70/0x70
>> [211238.462225]  ? do_group_exit+0x3a/0xa0
>> [211238.465586]  ret_from_fork+0x1f/0x30
>> [211238.468814] Code: ff 48 c7 c6 28 97 58 c0 48 c7 c7 a0 e1 5d c0 e8 0c d0 
>> f7 d5 85 c0 0f 84 1c fd ff ff 44 89 fe 48 c7 c7 58 0c 59 c0 e8 70 2f 9e d5 
>> <0f> ff e9 06 fd ff ff 4c 63 e8 31 d2 48 89 ee 48 89 df e8 4e eb
>> [211238.482366] ---[ end trace 48dd1ab4e2e46f6e ]---
>> [211238.486524] BTRFS info (device xvdc): space_info 4 has 
>> 18446744073258958848 free, is not full
>> [211238.493014] BTRFS info (device xvdc): space_info total=10737418240, 
>> used=7828127744, pinned=2128166912, reserved=243367936, may_use=988282880, 
>> readonly=65536
> 
> Ok so the numbers here are helpful, they show that we have enough space
> to allocate a chunk. I've also looked at the logic in 4.14.13 and all
> the necessary patches are there. Unfortunately none of this matters due
> to the fact that reserve_metadata_bytes is being called with
> BTRFS_RESERVE_NO_FLUSH from use_block_rsv, meaning the code won't make
> any effort to flush anything at all.
> 
> Can you tell again what the workload is - is it some sort of a database,
> constantly writing to its files?

Yes, we have PostgreSQL databases running these VMs that put a heavy I/O load 
on these machines. We also have snapshots being deleted and created every 15 
minutes. Looking at historical atop data for the two most recent crashes:

1. Right before last week's crash, looking at the last three minutes before 
crash, Postgres databases were averaging a total of 45 MB/s in reads and 29 
MB/s in writes. btrfs balance was also running at that time, which was reading 
at 1 MB/s and writing at 30 MB/s. btrfs-transacti kernel processes were doing 
only a small amount disk I/O, until a minute before the crash, at which time 
they ramped up to 0.4 MB/s in reads and 6 MB/s in writes. For cumulative I/O on 
the entire VM In the three minutes prior to the crash we saw an average of 95% 
I/O utilization, 2,200 read ops/sec, 47 MB/s read, 2,700 write ops/sec, 77 MB/s 
written.

2. Right before yesterday's crash (which had no balance running), Postgres 
databases were averaging a total of 81 MB/s in reads and 14 MB/s in writes. 
btrfs-cleaner and btrfs-transacti kernel processes weren't doing much I/O until 
the minute before the crash, at which point btrfs-cleaner was reading at 2 MB/s 
and writing at 16 MB/s, and btrfs-transacti was reading at 1 MB/s and writing 
at 5 MB/s. For cumulative I/O on the entire VM In the three minutes prior to 
the crash we saw an average of 103% I/O utilization, 2,600 read ops/sec, 86 
MB/s read, 1,300 write ops/sec, 35 MB/s written.

I just realized both crashes happened within 5 minutes of snapshot creation + 
deletion. Not sure if that's just a coincidence.


> If so, btrfs is not really suited for
> rewrite-heavy workloads since this causes excessive CoW. In such cases
> you really ought to set nodatacow on the specified files. For more
> information:
> 
> https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

Am I correct to understand that nodatacow doesn't really avoid CoW when you're 
using snapshots? In a filesystem that's snapshotted every 15 minutes, is there 
a difference between normal CoW and nodatacow when (in the case of Postgres) 
you update a small portion of a 1GB file many times per minute? Do you 
anticipate us seeing a benefit in stability and performance if we set nodatacow 
for the entire FS while retaining snapshots? Does nodatacow increase the chance 
of corruption in a database like Postgres, i.e. are writes still properly 
ordered/sync'ed when flushed to disk?


> The other thing that comes to mind is to try and tune the default commit
> interval. Currently this is 30 seconds, meaning a transaction will
> happen roughly every 30 seconds (unless there is enough data batched
> that it should happen "NOW", where "NOW" is defined as "during the life
> cycle of some arbitrary operation"). Perhaps in your case you could
> potentially lower it to, say 15s. This will likely have performance
> impact but should reduce the premature ENOSPC since there should be a
> lot less data in the transaction.
> 
> In the meantime I will investigate what can be done in case we get
> ENOSPC from NO_FLUSH contexts, my feeling is that we should be able to
> force allocate a new chunk. But this will take some time to verify.

Thanks for your help!

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to