Hello everyone, I have encountered a weird out of memory problem using BTRFS, snapshots and duperemove. The workload is described as: - Lots of static (400G/1T) data which was deduplicated using duperemove which saved about 50GB. - Backups are saved to the BTRFS every 2 days, backup take about 2 hours. - Backups are deduped every weeks. - Snapshots are taken every hour for 10 days.
After about 10 days worth of snapshot and 8 days worth of dedupe, the system started slowing down after each snapshot removal. I decided to wipe all snapshots and stop de-duping data, I proceeded and removed snapshots 8 at a time, waiting for btrfs-transaction to stop. After some time, PC locked up and I could not do anything else but restart, reboot was caused because of the free memory and cache being all used up. It seems the kernel module could not use any swap and out of memory killer had no victim. I was able to start the operation again by mounting using thread_pool=1 and stopping all services, it went well for the first ~100 snapshots but then the condition appeared again, after multiple attemps I tried mounting R/O and it worked. I currently have a VM up on this machine which has 3GB RAM, the VM has now 4GB RAM and the host has 4GB additional swap. It is currently making progress but I think this might be considered as a bug. I have some dmesg logs showing the problem. 16:53:17 srv kernel: [ 68.040380] BTRFS info (device dm-6): disk space caching is enabled 16:53:17 srv kernel: [ 68.040385] BTRFS: has skinny extents 16:56:09 srv kernel: [ 240.112032] INFO: task kworker/u8:6:162 blocked for more than 120 seconds. 16:56:09 srv kernel: [ 240.112086] Not tainted 4.2.0-27-generic #32-Ubuntu 16:56:09 srv kernel: [ 240.112124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:56:09 srv kernel: [ 240.112177] kworker/u8:6 D ffff8800bfa16640 0 162 2 0x00000000 16:56:09 srv kernel: [ 240.112213] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] 16:56:09 srv kernel: [ 240.112217] ffff880034f33c88 0000000000000046 ffffffff81c13500 ffff880000145780 16:56:09 srv kernel: [ 240.112222] 0000000000000246 ffff880034f34000 ffff8800945b79f0 ffff8800945b79f0 16:56:09 srv kernel: [ 240.112226] ffff88005ddd1450 0000000000000001 ffff880034f33ca8 ffffffff817edc37 16:56:09 srv kernel: [ 240.112230] Call Trace: 16:56:09 srv kernel: [ 240.112239] [<ffffffff817edc37>] schedule+0x37/0x80 16:56:09 srv kernel: [ 240.112260] [<ffffffffc03adc13>] wait_current_trans.isra.21+0xd3/0x120 [btrfs] 16:56:09 srv kernel: [ 240.112265] [<ffffffff810bd760>] ? wake_atomic_t_function+0x60/0x60 16:56:09 srv kernel: [ 240.112286] [<ffffffffc03af5ec>] start_transaction+0x3ac/0x570 [btrfs] 16:56:09 srv kernel: [ 240.112307] [<ffffffffc03af807>] btrfs_join_transaction+0x17/0x20 [btrfs] 16:56:09 srv kernel: [ 240.112324] [<ffffffffc0399d78>] delayed_ref_async_start+0x18/0x90 [btrfs] 16:56:09 srv kernel: [ 240.112346] [<ffffffffc03e04aa>] btrfs_scrubparity_helper+0xca/0x290 [btrfs] 16:56:09 srv kernel: [ 240.112369] [<ffffffffc03e06ae>] btrfs_extent_refs_helper+0xe/0x10 [btrfs] 16:56:09 srv kernel: [ 240.112374] [<ffffffff810947ba>] process_one_work+0x1aa/0x440 16:56:09 srv kernel: [ 240.112378] [<ffffffff81094a9b>] worker_thread+0x4b/0x4c0 16:56:09 srv kernel: [ 240.112382] [<ffffffff81094a50>] ? process_one_work+0x440/0x440 16:56:09 srv kernel: [ 240.112386] [<ffffffff8109ae38>] kthread+0xd8/0xf0 16:56:09 srv kernel: [ 240.112390] [<ffffffff8109ad60>] ? kthread_create_on_node+0x1f0/0x1f0 16:56:09 srv kernel: [ 240.112394] [<ffffffff817f209f>] ret_from_fork+0x3f/0x70 16:56:09 srv kernel: [ 240.112398] [<ffffffff8109ad60>] ? kthread_create_on_node+0x1f0/0x1f0 16:56:09 srv kernel: [ 240.112422] INFO: task mount:2323 blocked for more than 120 seconds. 16:56:09 srv kernel: [ 240.112466] Not tainted 4.2.0-27-generic #32-Ubuntu 16:56:09 srv kernel: [ 240.112503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 16:56:09 srv kernel: [ 240.112555] mount D 0000000000000000 0 2323 2076 0x00000000 16:56:09 srv kernel: [ 240.112560] ffff8800b709b928 0000000000000086 ffff880000145780 ffff8800b8dd3e80 16:56:09 srv kernel: [ 240.112564] 0000000000000246 ffff8800b709c000 ffff8800945b79f0 ffff8800945b79f0 16:56:09 srv kernel: [ 240.112568] ffff88005ddd1a10 0000000000000001 ffff8800b709b948 ffffffff817edc37 16:56:09 srv kernel: [ 240.112573] Call Trace: 16:56:09 srv kernel: [ 240.112577] [<ffffffff817edc37>] schedule+0x37/0x80 16:56:09 srv kernel: [ 240.112597] [<ffffffffc03adc13>] wait_current_trans.isra.21+0xd3/0x120 [btrfs] 16:56:09 srv kernel: [ 240.112600] [<ffffffff810bd760>] ? wake_atomic_t_function+0x60/0x60 16:56:09 srv kernel: [ 240.112620] [<ffffffffc03af63c>] start_transaction+0x3fc/0x570 [btrfs] 16:56:09 srv kernel: [ 240.112639] [<ffffffffc039fd50>] ? btrfs_update_root+0xf0/0x340 [btrfs] 16:56:09 srv kernel: [ 240.112658] [<ffffffffc03af7cb>] btrfs_start_transaction+0x1b/0x20 [btrfs] 16:56:09 srv kernel: [ 240.112676] [<ffffffffc03983b5>] btrfs_drop_snapshot+0x555/0x850 [btrfs] 16:56:09 srv kernel: [ 240.112698] [<ffffffffc0404691>] merge_reloc_roots+0xe1/0x260 [btrfs] 16:56:09 srv kernel: [ 240.112720] [<ffffffffc040550c>] btrfs_recover_relocation+0x38c/0x440 [btrfs] 16:56:09 srv kernel: [ 240.112740] [<ffffffffc03abd4a>] open_ctree+0x1bba/0x23e0 [btrfs] 16:56:09 srv kernel: [ 240.112756] [<ffffffffc0380fae>] btrfs_mount+0x94e/0xa70 [btrfs] 16:56:09 srv kernel: [ 240.112762] [<ffffffff813daf15>] ? find_next_bit+0x15/0x20 16:56:09 srv kernel: [ 240.112767] [<ffffffff811a6435>] ? pcpu_alloc+0x385/0x670 16:56:09 srv kernel: [ 240.112772] [<ffffffff812017c8>] mount_fs+0x38/0x160 16:56:09 srv kernel: [ 240.112776] [<ffffffff811a6755>] ? __alloc_percpu+0x15/0x20 16:56:09 srv kernel: [ 240.112781] [<ffffffff8121d5bb>] vfs_kern_mount+0x6b/0x120 16:56:09 srv kernel: [ 240.112785] [<ffffffff8121fdc6>] do_mount+0x246/0xd10 16:56:09 srv kernel: [ 240.112788] [<ffffffff81220bbf>] SyS_mount+0x9f/0x100 16:56:09 srv kernel: [ 240.112791] [<ffffffff817f1c72>] entry_SYSCALL_64_fastpath+0x16/0x75 Also 17:40:12 srv kernel: [ 277.205013] Mem-Info: 17:40:12 srv kernel: [ 277.205018] active_anon:0 inactive_anon:0 isolated_anon:0 17:40:12 srv kernel: [ 277.205018] active_file:27295 inactive_file:27755 isolated_file:33 17:40:12 srv kernel: [ 277.205018] unevictable:0 dirty:30888 writeback:0 unstable:0 17:40:12 srv kernel: [ 277.205018] slab_reclaimable:9787 slab_unreclaimable:659576 17:40:12 srv kernel: [ 277.205018] mapped:1 shmem:0 pagetables:1343 bounce:0 17:40:12 srv kernel: [ 277.205018] free:0 free_pcp:0 free_cma:0 17:40:12 srv kernel: [ 277.205022] Node 0 DMA free:0kB min:36kB low:44kB high:52kB active_anon:0kB inactive_anon:0kB active_file:4kB inactive_file:12kB unevictable:0kB isolated(anon):0kB isolate d(file):4kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:15424kB kernel_stack:64kB pagetables:32kB unstable:0kB bounce:0kB fr ee_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? no 17:40:12 srv kernel: [ 277.205029] lowmem_reserve[]: 0 2862 2862 2862 17:40:12 srv kernel: [ 277.205033] Node 0 DMA32 free:0kB min:6744kB low:8428kB high:10116kB active_anon:0kB inactive_anon:0kB active_file:109176kB inactive_file:111008kB unevictable:0kB isolated (anon):0kB isolated(file):128kB present:3128768kB managed:2933204kB mlocked:0kB dirty:123552kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:39100kB slab_unreclaimable:2622880kB kernel_stack:3568kB pageta bles:5340kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2027488 all_unreclaimable? yes 17:40:12 srv kernel: [ 277.205039] lowmem_reserve[]: 0 0 0 0 17:40:12 srv kernel: [ 277.205042] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB 17:40:12 srv kernel: [ 277.205051] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB 17:40:12 srv kernel: [ 277.205061] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 17:40:12 srv kernel: [ 277.205063] 55020 total pagecache pages 17:40:12 srv kernel: [ 277.205065] 0 pages in swap cache 17:40:12 srv kernel: [ 277.205067] Swap cache stats: add 19683, delete 19683, find 1489/2940 17:40:12 srv kernel: [ 277.205069] Free swap = 2059728kB 17:40:12 srv kernel: [ 277.205070] Total swap = 2097088kB 17:40:12 srv kernel: [ 277.205071] 786189 pages RAM 17:40:12 srv kernel: [ 277.205072] 0 pages HighMem/MovableOnly 17:40:12 srv kernel: [ 277.205074] 48912 pages reserved 17:40:12 srv kernel: [ 277.205075] 0 pages cma reserved 17:40:12 srv kernel: [ 277.205076] 0 pages hwpoisoned 17:40:12 srv kernel: [ 277.205078] SLUB: Unable to allocate memory on node -1 (gfp=0x50) 17:40:12 srv kernel: [ 277.205080] cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0 17:40:12 srv kernel: [ 277.205082] node 0: slabs: 650523, objs: 41633472, free: 0 17:40:12 srv kernel: [ 277.205096] BTRFS warning (device dm-6): Skipping commit of aborted transaction. 17:40:12 srv kernel: [ 277.205098] ------------[ cut here ]------------ 17:40:12 srv kernel: [ 277.205112] WARNING: CPU: 2 PID: 2369 at /build/linux-NgsOGa/linux-4.2.0/fs/btrfs/transaction.c:1710 cleanup_transaction+0x92/0x2d0 [btrfs]() 17:40:12 srv kernel: [ 277.205113] BTRFS: Transaction aborted (error -12) 17:40:12 srv kernel: [ 277.205115] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables softdog bridge stp llc dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c snd_hda_codec_via snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm coretemp snd_seq_midi snd_seq_midi_event snd_rawmidi joydev snd_seq input_leds snd_seq_device snd_timer snd serio_raw lpc_ich soundcore mac_hid kvm_intel shpchp 8250_fintek asus_atk0110 kvm parport_pc ppdev lp parport autofs4 btrfs raid10 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 pata_acpi hid_generic usbhid hid amdkfd amd_iommu_v2 radeon psmouse ahci i2c_algo_bit libahci via_rhine ttm r8169 drm_kms_helper mii pata_jmicron drm 17:40:12 srv kernel: [ 277.205181] CPU: 2 PID: 2369 Comm: mount Not tainted 4.2.0-27-generic #32-Ubuntu 17:40:12 srv kernel: [ 277.205183] Hardware name: System manufacturer System Product Name/P5Q SE PLUS, BIOS 2204 08/28/2009 17:40:12 srv kernel: [ 277.205184] 0000000000000000 0000000060da56dc ffff8800b2bf3978 ffffffff817eae99 17:40:12 srv kernel: [ 277.205187] 0000000000000000 ffff8800b2bf39d0 ffff8800b2bf39b8 ffffffff8107b9c6 17:40:12 srv kernel: [ 277.205190] 0000000000000246 ffff880079c6e000 ffff8800b43e3000 ffff880079eb4a10 17:40:12 srv kernel: [ 277.205193] Call Trace: 17:40:12 srv kernel: [ 277.205195] [<ffffffff817eae99>] dump_stack+0x45/0x57 17:40:12 srv kernel: [ 277.205199] [<ffffffff8107b9c6>] warn_slowpath_common+0x86/0xc0 17:40:12 srv kernel: [ 277.205202] [<ffffffff8107ba55>] warn_slowpath_fmt+0x55/0x70 17:40:12 srv kernel: [ 277.205215] [<ffffffffc03adcf2>] cleanup_transaction+0x92/0x2d0 [btrfs] 17:40:12 srv kernel: [ 277.205218] [<ffffffff810bd760>] ? wake_atomic_t_function+0x60/0x60 17:40:12 srv kernel: [ 277.205232] [<ffffffffc03aea7c>] btrfs_commit_transaction+0x35c/0xb20 [btrfs] 17:40:12 srv kernel: [ 277.205247] [<ffffffffc040555e>] btrfs_recover_relocation+0x3de/0x440 [btrfs] 17:40:12 srv kernel: [ 277.205260] [<ffffffffc03abd4a>] open_ctree+0x1bba/0x23e0 [btrfs] 17:40:12 srv kernel: [ 277.205270] [<ffffffffc0380fae>] btrfs_mount+0x94e/0xa70 [btrfs] 17:40:12 srv kernel: [ 277.205273] [<ffffffff813daf15>] ? find_next_bit+0x15/0x20 17:40:12 srv kernel: [ 277.205276] [<ffffffff811a6435>] ? pcpu_alloc+0x385/0x670 17:40:12 srv kernel: [ 277.205279] [<ffffffff812017c8>] mount_fs+0x38/0x160 17:40:12 srv kernel: [ 277.205281] [<ffffffff811a6755>] ? __alloc_percpu+0x15/0x20 17:40:12 srv kernel: [ 277.205284] [<ffffffff8121d5bb>] vfs_kern_mount+0x6b/0x120 17:40:12 srv kernel: [ 277.205286] [<ffffffff8121fdc6>] do_mount+0x246/0xd10 17:40:12 srv kernel: [ 277.205288] [<ffffffff81220bbf>] SyS_mount+0x9f/0x100 17:40:12 srv kernel: [ 277.205290] [<ffffffff817f1c72>] entry_SYSCALL_64_fastpath+0x16/0x75 17:40:12 srv kernel: [ 277.205292] ---[ end trace 619835a5df93183b ]--- 17:40:12 srv kernel: [ 277.205298] BTRFS: error (device dm-6) in cleanup_transaction:1710: errno=-12 Out of memory 17:40:12 srv kernel: [ 277.205343] BTRFS info (device dm-6): delayed_refs has NO entry 17:40:12 srv kernel: [ 277.230676] BTRFS: failed to recover relocation 17:40:12 srv kernel: [ 277.231071] BTRFS error (device dm-6): cleaner transaction attach returned -30 17:40:12 srv kernel: [ 277.356087] BTRFS: open_ctree failed -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html