Hi, ok. Here is my kernel traces for the 4.15.18-5-pve kernel, I use it with both: bionic and xenial ubuntu systems. There is also out-of-tree mlnx_en driver installed (version 4.3-1.0.1) I use VFIO devices connected via bonding for DRBD communication. As I said before the problem occurs only with 9.0.15-1 kernel module, and there is no problems on 9.0.14-1 version.
I'm not sure but maybe it is connected with version mismatch between 9.0.14-1 and 9.0.15-1 during replication process is running? ------------------------------------------------------------------------------- [ 673.805377] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7 [ 683.134748] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 [ 688.597202] watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [kworker/4:2:3483] [ 689.539292] Kernel panic - not syncing: softlockup: hung tasks [ 689.822358] CPU: 4 PID: 3483 Comm: kworker/4:2 Tainted: P O L 4.15.18-5-pve #1 [ 690.223400] Hardware name: HP ProLiant m710x Server Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018 [ 690.726310] Workqueue: events wait_rcu_exp_gp [ 690.937639] Call Trace: [ 691.055597] <IRQ> [ 691.152632] dump_stack+0x63/0x8b [ 691.314120] panic+0xe4/0x244 [ 691.457890] watchdog_timer_fn+0x21c/0x230 [ 691.656303] ? watchdog+0x30/0x30 [ 691.816636] __hrtimer_run_queues+0xe7/0x220 [ 692.023631] hrtimer_interrupt+0xa3/0x1e0 [ 692.218525] smp_apic_timer_interrupt+0x6f/0x130 [ 692.442133] apic_timer_interrupt+0x84/0x90 [ 692.644699] </IRQ> [ 692.746094] RIP: 0010:smp_call_function_single+0x8e/0x100 [ 693.007742] RSP: 0018:ffffbb8a07443d80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11 [ 693.375128] RAX: 0000000000000001 RBX: ffffffffbb4a9200 RCX: 0000000000000000 [ 693.721604] RDX: ffffffffbb4a9200 RSI: ffffffffba0feac0 RDI: 0000000000000005 [ 694.068179] RBP: ffffbb8a07443dd0 R08: ffff96bf81523880 R09: 0000000000000005 [ 694.414734] R10: ffffbb8a07443df8 R11: 0000000000000280 R12: 000000000000005e [ 694.760804] R13: ffffffffbb4a9200 R14: 00000000000000a0 R15: 0000000000000001 [ 695.107153] ? rcu_barrier_func+0x50/0x50 [ 695.301204] sync_rcu_exp_select_cpus+0x2ad/0x420 [ 695.529610] ? cpumask_next+0x1b/0x20 [ 695.707302] ? sync_rcu_exp_select_cpus+0x2ad/0x420 [ 695.944146] ? rcu_barrier_func+0x50/0x50 [ 696.138698] wait_rcu_exp_gp+0x20/0x30 [ 696.320144] process_one_work+0x1e0/0x400 [ 696.514658] worker_thread+0x4b/0x420 [ 696.692225] kthread+0x105/0x140 [ 696.848570] ? process_one_work+0x400/0x400 [ 697.051278] ? kthread_create_worker_on_cpu+0x70/0x70 [ 697.296928] ? do_syscall_64+0x73/0x130 [ 697.483045] ? SyS_exit_group+0x14/0x20 [ 697.669355] ret_from_fork+0x35/0x40 [ 698.924648] Shutting down cpus with NMI [ 699.110150] Kernel Offset: 0x39000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 699.638057] Rebooting in 10 seconds.. [ 709.736369] ACPI MEMORY or I/O RESET_REG. ------------------------------------------------------------------------------- [ 9596.201090] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [systemd:12398] [ 9596.548566] Kernel panic - not syncing: softlockup: hung tasks [ 9596.831632] CPU: 5 PID: 12398 Comm: systemd Tainted: P O L 4.15.18-5-pve #1 [ 9597.220358] Hardware name: HP ProLiant m710x Server Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018 [ 9597.724072] Call Trace: [ 9597.842902] <IRQ> [ 9597.940248] dump_stack+0x63/0x8b [ 9598.100587] panic+0xe4/0x244 [ 9598.244267] watchdog_timer_fn+0x21c/0x230 [ 9598.442988] ? watchdog+0x30/0x30 [ 9598.603634] __hrtimer_run_queues+0xe7/0x220 [ 9598.811388] hrtimer_interrupt+0xa3/0x1e0 [ 9599.006050] smp_apic_timer_interrupt+0x6f/0x130 [ 9599.230283] apic_timer_interrupt+0x84/0x90 [ 9599.433052] </IRQ> [ 9599.534382] RIP: 0010:smp_call_function_many+0x1f9/0x260 [ 9599.791986] RSP: 0018:ffffab5ca6027b40 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11 [ 9600.155228] RAX: 0000000000000000 RBX: ffffa00801563900 RCX: ffffa008014288e0 [ 9600.501585] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffa007c0844070 [ 9600.847949] RBP: ffffab5ca6027b78 R08: ffffffffffffffff R09: 00000000000000df [ 9601.194242] R10: ffffd877bfaf9c00 R11: 0000000000000095 R12: ffffffff8767b7d0 [ 9601.541932] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000008 [ 9601.888206] ? load_new_mm_cr3+0xf0/0xf0 [ 9602.078392] ? load_new_mm_cr3+0xf0/0xf0 [ 9602.268282] on_each_cpu+0x2d/0x60 [ 9602.433017] flush_tlb_kernel_range+0x4b/0x80 [ 9602.644043] ? purge_fragmented_blocks_allcpus+0x53/0x1f0 [ 9602.906196] __purge_vmap_area_lazy+0x52/0xc0 [ 9603.117778] vm_unmap_aliases+0xfa/0x130 [ 9603.307708] change_page_attr_set_clr+0xea/0x370 [ 9603.531861] set_memory_ro+0x29/0x30 [ 9603.705108] bpf_int_jit_compile+0x1e0/0x2d0 [ 9603.912190] bpf_prog_select_runtime+0xdb/0x100 [ 9604.132496] bpf_prepare_filter+0x387/0x400 [ 9604.335457] __get_filter+0xc3/0x100 [ 9604.508597] sk_attach_filter+0x18/0x60 [ 9604.694663] sock_setsockopt+0x43e/0x990 [ 9604.884914] ? apparmor_socket_setsockopt+0x22/0x30 [ 9605.121579] SyS_setsockopt+0xd3/0xf0 [ 9605.298778] do_syscall_64+0x73/0x130 [ 9605.476152] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 9605.721497] RIP: 0033:0x7efd49ed5e6a [ 9605.894823] RSP: 002b:00007ffdefb136e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 [ 9606.263666] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007efd49ed5e6a [ 9606.610963] RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000009 [ 9606.959240] RBP: 00007ffdefb13710 R08: 0000000000000010 R09: 0000000000000006 [ 9607.306903] R10: 00007ffdefb13700 R11: 0000000000000246 R12: 000055b51a1a03c8 [ 9607.655192] R13: 000000000000000a R14: 00007ffdefb13760 R15: 000055b51a1a0230 [ 9609.124819] Shutting down cpus with NMI [ 9609.311447] Kernel Offset: 0x6600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 9609.840620] Rebooting in 10 seconds.. ------------------------------------------------------------------------------- [ 384.679007] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 [ 384.781477] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 [ 396.205982] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [sensors:12097] [ 397.136095] Kernel panic - not syncing: softlockup: hung tasks [ 397.418904] CPU: 2 PID: 12097 Comm: sensors Tainted: P O L 4.15.18-5-pve #1 [ 397.807613] Hardware name: HP ProLiant m710x Server Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018 [ 398.311096] Call Trace: [ 398.429564] <IRQ> [ 398.526903] dump_stack+0x63/0x8b [ 398.687306] panic+0xe4/0x244 [ 398.830885] watchdog_timer_fn+0x21c/0x230 [ 399.029325] ? watchdog+0x30/0x30 [ 399.189827] __hrtimer_run_queues+0xe7/0x220 [ 399.397121] hrtimer_interrupt+0xa3/0x1e0 [ 399.591510] smp_apic_timer_interrupt+0x6f/0x130 [ 399.815278] apic_timer_interrupt+0x84/0x90 [ 400.018834] </IRQ> [ 400.120069] RIP: 0010:smp_call_function_single+0xdd/0x100 [ 400.381752] RSP: 0018:ffffb5862411fca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11 [ 400.744894] RAX: 0000000000000000 RBX: ffffb5862411fd6c RCX: 0000000000000830 [ 401.091501] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000830 [ 401.433770] RBP: ffffb5862411fcf0 R08: ffff9d389560cc18 R09: ffff9d388a75e180 [ 401.775868] R10: ffffb5862411fd20 R11: 0000000000000000 R12: ffffb5862411fd68 [ 402.124117] R13: ffff9d388a47b000 R14: 0000000000000001 R15: ffff9d3873594080 [ 402.470882] ? sbitmap_queue_init_node+0x1a0/0x1a0 [ 402.703603] ? vsnprintf+0xf1/0x4e0 [ 402.872647] rdmsr_on_cpu+0x5d/0x90 [ 403.042378] ? rdmsr_on_cpu+0x5d/0x90 [ 403.219880] show_temp+0xa9/0xe0 [coretemp] [ 403.422975] dev_attr_show+0x23/0x60 [ 403.596290] ? mutex_lock+0x12/0x40 [ 403.765190] sysfs_kf_seq_show+0xa2/0x130 [ 403.959449] kernfs_seq_show+0x27/0x30 [ 404.141147] seq_read+0x12a/0x430 [ 404.301845] kernfs_fop_read+0x116/0x180 [ 404.492090] ? security_file_permission+0xa1/0xc0 [ 404.720388] __vfs_read+0x1b/0x40 [ 404.881143] vfs_read+0x96/0x130 [ 405.037746] SyS_read+0x55/0xc0 [ 405.189772] do_syscall_64+0x73/0x130 [ 405.367127] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 405.612262] RIP: 0033:0x7efc19c3e260 [ 405.785308] RSP: 002b:00007fff2685b538 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 406.153341] RAX: ffffffffffffffda RBX: 0000000002129880 RCX: 00007efc19c3e260 [ 406.495101] RDX: 0000000000001000 RSI: 0000000002129ab0 RDI: 0000000000000003 [ 406.842407] RBP: 00007fff2685bc50 R08: 00007efc19f0bb98 R09: 0000000000000000 [ 407.189117] R10: 00007efc19f0bb78 R11: 0000000000000246 R12: 0000000000000000 [ 407.536464] R13: ffffffffffffff98 R14: 00007efc19f0c420 R15: 0000000002129880 [ 407.853825] INFO: rcu_sched detected stalls on CPUs/tasks: [ 407.853829] 0-...0: (1 GPs behind) idle=ca2/140000000000001/0 softirq=23498/23499 fqs=7290 [ 407.853831] 1-...0: (3 GPs behind) idle=346/140000000000000/0 softirq=22199/22200 fqs=7290 [ 407.853831] (detected by 4, t=15002 jiffies, g=15073, c=15072, q=46784) [ 407.853834] Sending NMI from CPU 4 to CPUs 0: [ 407.853837] NMI backtrace for cpu 0 [ 407.854048] CPU: 0 PID: 8730 Comm: drbd_r_tgt112 Tainted: P O L 4.15.18-5-pve #1 [ 407.854049] Hardware name: HP ProLiant m710x Server Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018 [ 407.854053] RIP: 0010:native_queued_spin_lock_slowpath+0x179/0x1a0 [ 407.854054] RSP: 0018:ffff9d3901403bd8 EFLAGS: 00000002 [ 407.854055] RAX: 0000000000080101 RBX: 0000000000000086 RCX: 0000000000000001 [ 407.854056] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff9d386e6202b8 [ 407.854057] RBP: ffff9d3901403bd8 R08: 0000000000000101 R09: 0000000000000000 [ 407.854057] R10: 0000000000000356 R11: 0000000000000046 R12: ffff9d386e620000 [ 407.854058] R13: 0000000000767600 R14: ffff9d386e699000 R15: 000000000000001d [ 407.854059] FS: 0000000000000000(0000) GS:ffff9d3901400000(0000) knlGS:0000000000000000 [ 407.854060] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 407.854061] CR2: 00007ffcf041e090 CR3: 00000008a160a003 CR4: 00000000003606f0 [ 407.854273] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 407.854275] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 407.854275] Call Trace: [ 407.854276] <IRQ> [ 407.854280] _raw_spin_lock_irqsave+0x37/0x40 [ 407.854291] drbd_rs_complete_io+0x3e/0x150 [drbd] [ 407.854298] drbd_endio_write_sec_final+0x3c5/0x450 [drbd] [ 407.854304] drbd_peer_request_endio+0x160/0x3c0 [drbd] [ 407.854305] bio_endio+0xa2/0x120 [ 407.854308] dec_pending+0x124/0x250 [ 407.854310] clone_endio+0x88/0x130 [ 407.854311] bio_endio+0xa2/0x120 [ 407.854312] blk_update_request+0x98/0x2d0 [ 407.854314] blk_mq_end_request+0x1e/0x90 [ 407.854316] nvme_complete_rq+0x23/0x120 [ 407.854318] nvme_pci_complete_rq+0x7a/0x130 [ 407.854320] __blk_mq_complete_request+0x99/0x150 [ 407.854321] blk_mq_complete_request+0x17/0x20 [ 407.854323] nvme_process_cq+0xe1/0x1b0 [ 407.854325] nvme_irq+0x23/0x50 [ 407.854326] __handle_irq_event_percpu+0x84/0x1a0 [ 407.854328] handle_irq_event_percpu+0x32/0x80 [ 407.854330] handle_irq_event+0x3b/0x60 [ 407.854331] handle_edge_irq+0x78/0x1a0 [ 407.854333] handle_irq+0x20/0x30 [ 407.854335] do_IRQ+0x4e/0xd0 [ 407.854336] common_interrupt+0x84/0x84 [ 407.854337] </IRQ> [ 407.854339] RIP: 0010:_raw_spin_lock+0x10/0x30 [ 407.854339] RSP: 0018:ffffb586081c7d70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc [ 407.854341] RAX: 0000000000000000 RBX: ffff9d3856c82000 RCX: 0000000000000000 [ 407.854341] RDX: 0000000000000001 RSI: 00000000000122e7 RDI: ffff9d386e6202b8 [ 407.854342] RBP: ffffb586081c7e50 R08: 0000000000001000 R09: ffff9d386e620000 [ 407.854343] R10: ffff9d388a6a97e8 R11: ffff9d386e699000 R12: ffff9d386e699000 [ 407.854343] R13: ffff9d386e620000 R14: ffff9d386e6202b8 R15: ffff9d388a6a97e8 [ 407.854351] ? receive_Data+0x6d3/0x15b0 [drbd] [ 407.854353] ? dtt_recv+0xd8/0x180 [drbd_transport_tcp] [ 407.854360] ? receive_RSDataReply+0x370/0x370 [drbd] [ 407.854365] ? receive_RSDataReply+0x370/0x370 [drbd] [ 407.854371] drbd_receiver+0x45c/0x6a0 [drbd] [ 407.854379] drbd_thread_setup+0x79/0x190 [drbd] [ 407.854381] kthread+0x105/0x140 [ 407.854387] ? __drbd_next_peer_device_ref+0x170/0x170 [drbd] [ 407.854389] ? kthread_create_worker_on_cpu+0x70/0x70 [ 407.854391] ? do_syscall_64+0x73/0x130 [ 407.854392] ? SyS_exit_group+0x14/0x20 [ 407.854394] ret_from_fork+0x35/0x40 ------------------------------------------------------------------------------- [ 524.920614] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0 [ 531.795413] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2 [ 588.202242] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sensors:13479] [ 589.130264] Kernel panic - not syncing: softlockup: hung tasks [ 589.412765] CPU: 1 PID: 13479 Comm: sensors Tainted: P O L 4.15.18-5-pve #1 [ 589.800666] Hardware name: HP ProLiant m710x Server Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018 [ 590.302717] Call Trace: [ 590.420675] <IRQ> [ 590.517554] dump_stack+0x63/0x8b [ 590.677690] panic+0xe4/0x244 [ 590.821054] watchdog_timer_fn+0x21c/0x230 [ 591.019107] ? watchdog+0x30/0x30 [ 591.179362] __hrtimer_run_queues+0xe7/0x220 [ 591.385849] hrtimer_interrupt+0xa3/0x1e0 [ 591.579709] smp_apic_timer_interrupt+0x6f/0x130 [ 591.803262] apic_timer_interrupt+0x84/0x90 [ 592.005528] </IRQ> [ 592.107069] RIP: 0010:smp_call_function_single+0xdd/0x100 [ 592.368500] RSP: 0018:ffffb7b2a482bca0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11 [ 592.735159] RAX: 0000000000000000 RBX: ffffb7b2a482bd6c RCX: 0000000000000830 [ 593.081059] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000830 [ 593.427696] RBP: ffffb7b2a482bcf0 R08: ffff8935adb2fc18 R09: ffff8935a388f9c0 [ 593.773919] R10: ffffb7b2a482bd20 R11: 0000000000000000 R12: ffffb7b2a482bd68 [ 594.119842] R13: ffff8935c1a63000 R14: 0000000000000001 R15: ffff8935c0254100 [ 594.461694] ? sbitmap_queue_init_node+0x1a0/0x1a0 [ 594.693704] rdmsr_on_cpu+0x5d/0x90 [ 594.862401] ? rdmsr_on_cpu+0x5d/0x90 [ 595.039732] show_crit_alarm+0x59/0xa0 [coretemp] [ 595.267472] dev_attr_show+0x23/0x60 [ 595.440312] ? mutex_lock+0x12/0x40 [ 595.608969] sysfs_kf_seq_show+0xa2/0x130 [ 595.802897] kernfs_seq_show+0x27/0x30 [ 595.984857] seq_read+0x12a/0x430 [ 596.145125] kernfs_fop_read+0x116/0x180 [ 596.334749] ? security_file_permission+0xa1/0xc0 [ 596.562326] __vfs_read+0x1b/0x40 [ 596.722505] vfs_read+0x96/0x130 [ 596.874230] SyS_read+0x55/0xc0 [ 597.025935] do_syscall_64+0x73/0x130 [ 597.203036] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 597.447495] RIP: 0033:0x7f71ed58d260 [ 597.620432] RSP: 002b:00007ffe54e32e48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 597.986899] RAX: ffffffffffffffda RBX: 0000000000be9880 RCX: 00007f71ed58d260 [ 598.332523] RDX: 0000000000001000 RSI: 0000000000be9ab0 RDI: 0000000000000003 [ 598.678088] RBP: 00007ffe54e33560 R08: 00007f71ed85b278 R09: 00007f71edc8d700 [ 599.024604] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 599.370916] R13: ffffffffffffff98 R14: 00007f71ed85b420 R15: 0000000000be9880 [ 600.818687] Shutting down cpus with NMI [ 601.004760] Kernel Offset: 0x1dc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 601.533984] Rebooting in 10 seconds.. - kvaps On Fri, Sep 28, 2018 at 4:06 PM Lars Ellenberg <lars.ellenb...@linbit.com> wrote: > > On Thu, Sep 27, 2018 at 05:49:43PM +0200, kvaps wrote: > > Hi, I found out that when I using 9.0.15-1 version of drbd kernel > > module, I have softlockup problems quite often. > > > > For now I have many nodes with 9.0.14-1 and they are working fine. > > > > When I add new nodes with 9.0.15-1, then distribute old resources on > > them, after a while I start having random reboots of the new nodes, > > during synchronization. > > > > New nodes isn't using resources at all, kernel log shows, that it was > > tainted on some random processes like systemd, sensors, etc. > > > > After downgrade to the version to 9.0.14-1, synchronization is finishing > > fine. > > > > I use 4.15.18-5-pve kernel, and I can provide my kernel stack traces > > if you want to. > > Sure. > otherwise this is a non-actionable "$something does not work" report. > > -- > : Lars Ellenberg > : LINBIT | Keeping the Digital World Running > : DRBD -- Heartbeat -- Corosync -- Pacemaker > > DRBD® and LINBIT® are registered trademarks of LINBIT > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user@lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user