Hi all, we run a > 500 TiB backup system on iSCSI targets using 19 BTRFS filesystems (the biggest of which is 110 TiB) on Ubuntu 14.04 LTS and various kernel versions. Btrfs-Progs v3.17.1. The hardware is a 24 core Xeon E5-2620 on an Intel S2600GZ board with 128 GiB RAM.
Since btrfs has changed to kworkers (I think in 3.15) the frontend server somewhat randomly crashes with soft lockups (see attachment). The system is rock solid with the 3.14.22 kernel. The lockups happen during the nightly cron-controlled rsync backups and occur at random times during this process. We are totally aware of the fact that this tends to be one of those âit doesnât workâ bug reports, but itâs really hard to pin down the source of the problem other than it seems to be related to the kworkers. Weâd love to provide any feedback we can, please let us know what you need. Regards Patrick -- Patrick Schmid <sch...@phys.ethz.ch> support: +41 44 633 2668 IT Services Group, HPT H 8 voice: +41 44 633 3997 Departement Physik, ETH Zurich CH-8093 Zurich, Switzerland
Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207104] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u481:26:108963] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207147] Modules linked in: btrfs(E) xor(E) raid6_pq(E) tcp_diag(E) inet_diag(E) autofs4(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_sa(E) ib_mad(E) ib_core(E) ib_addr(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) mousedev(E) cryptd(E) ioatdma(E) sb_edac(E) microcode(E) ipmi_si(E) edac_core(E) lpc_ich(E) mei_me(E) ipmi_msghandler(E) tpm_tis(E) mei(E) wmi(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) sunrpc(E) fscache(E) lp(E) parport(E) hid_generic(E) usbhid(E) hid(E) igb(E) ixgbe(E) i2c_algo_bit(E) dca(E) isci(E) ptp(E) ahci(E) libsas(E) scsi_transport_sas(E) libahci(E) mdio(E) arcmsr(E) pps_core(E) Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207152] CPU: 0 PID: 108963 Comm: kworker/u481:26 Tainted: G EL 3.17.2-stable.slub #6 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207154] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207185] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207186] task: ffff8802e34a8000 ti: ffff88070a5a8000 task.ti: ffff88070a5a8000 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207194] RIP: 0010:[<ffffffff810b0b35>] [<ffffffff810b0b35>] queue_read_lock_slowpath+0xb5/0xd0 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207195] RSP: 0018:ffff88070a5aba00 EFLAGS: 00000206 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207196] RAX: 00000000000041b8 RBX: ffff8806bdac3a18 RCX: 0000000000003bcc Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207197] RDX: ffff8800a2c4f350 RSI: 0000000000003bcc RDI: ffff8800a2c4f354 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207198] RBP: ffff88070a5aba08 R08: 0000000000003bc6 R09: 0000000000000000 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207199] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88081ee14300 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207200] R13: ffff88100e6e0000 R14: ffffffff810946ac R15: ffff88070a5ab9a8 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207202] FS: 0000000000000000(0000) GS:ffff88081ee00000(0000) knlGS:0000000000000000 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207203] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207204] CR2: 0000000002b97fc8 CR3: 0000000001c16000 CR4: 00000000000407f0 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207205] Stack: Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207207] ffffffff8173b07c ffff88070a5aba68 ffffffffa04d8a3b 0000000000000000 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207209] ffff88070a5aba78 ffffffffa04757af 00003f66a0497f6e ffff88061c29af68 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207211] ffff8800a2c4f2e0 ffff88100f36d800 ffff880000000000 0000160000000000 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207212] Call Trace: Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207218] [<ffffffff8173b07c>] ? _raw_read_lock+0x1c/0x30 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207233] [<ffffffffa04d8a3b>] btrfs_tree_read_lock+0x5b/0x120 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207241] [<ffffffffa04757af>] ? leaf_space_used+0xcf/0x110 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207249] [<ffffffffa0477d6b>] btrfs_read_lock_root_node+0x3b/0x50 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207258] [<ffffffffa047cbee>] btrfs_search_slot+0x50e/0xa10 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207269] [<ffffffffa0494257>] btrfs_lookup_file_extent+0x37/0x40 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207282] [<ffffffffa04b35da>] __btrfs_drop_extents+0x16a/0xd90 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207285] [<ffffffff810946ac>] ? try_to_wake_up+0x1fc/0x340 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207299] [<ffffffffa04bc65b>] ? __set_extent_bit+0x15b/0x540 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207302] [<ffffffff811b0a12>] ? kmem_cache_alloc+0x122/0x130 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207311] [<ffffffffa0477aea>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207323] [<ffffffffa04a36ce>] insert_reserved_file_extent.constprop.59+0x9e/0x2f0 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207335] [<ffffffffa04a94c5>] btrfs_finish_ordered_io+0x2e5/0x5f0 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207345] [<ffffffffa04a9ad5>] finish_ordered_fn+0x15/0x20 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207358] [<ffffffffa04cf3e2>] normal_work_helper+0xc2/0x2b0 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207362] [<ffffffff8107fe09>] ? pwq_activate_delayed_work+0x39/0x80 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207374] [<ffffffffa04cf742>] btrfs_endio_write_helper+0x12/0x20 [btrfs] Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207377] [<ffffffff81082000>] process_one_work+0x150/0x3f0 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207379] [<ffffffff810826f1>] worker_thread+0x121/0x520 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207381] [<ffffffff810825d0>] ? rescuer_thread+0x330/0x330 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207385] [<ffffffff81087992>] kthread+0xd2/0xf0 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207388] [<ffffffff810878c0>] ? kthread_create_on_node+0x180/0x180 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207390] [<ffffffff8173b6bc>] ret_from_fork+0x7c/0xb0 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207393] [<ffffffff810878c0>] ? kthread_create_on_node+0x180/0x180 Nov 12 23:25:16 phd-bkp-gw kernel: [29411.207413] Code: 8b 02 3c ff 74 f8 f3 c3 55 48 89 e5 e8 a8 df 67 00 5d c3 83 e1 fe 0f b7 f1 b8 00 80 00 00 44 0f b7 42 04 66 44 39 c1 74 83 f3 90 <83> e8 01 75 ee 66 66 66 90 66 66 90 eb e0 66 2e 0f 1f 84 00 00