On 24.01.2018 11:50, Pavel Tikhomirov wrote: > We have a situation when node worked for ~8 days and came to The Point > where jiffies > fence_wdog_jiffies64 (what ever updated > /sys/kernel/watchdog_timer stopped doing these) but node lived after > The Point for 17 more hours, these means that nobody called > fence_wdog_check_timer() for 17 hours, else crash or reboot would've > happened earlier. > > When fence_wdog_check_timer() was called the first time after The Point > we started reboot action. As jiffies were even bigger than > (fence_wdog_jiffies64 + 30 * HZ) and we got to fence_wdog_check_timer() > again we crashed (we assume reboot action took to long if it does not > fit in 30 seconds after fence_wdog_jiffies64). > > So it seem to be quiet a valid situation, except that > fence_wdog_check_timer() was not called for 17 hours. That means(should > mean) that there were no network activity on the node for 17 hours - no > packets send/received. > > note: I mentioned vmxnet3 NIC on the node. > > So to be sure that we never got to fence_wdog_check_timer make it print > alive messages to syslog. If no messages in syslog we might need to > understand why vmxnet3 does not get to dev_hard_start_xmit() and > net_rx_action(). > > Crashdump backtrace: > [ 84.606838] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps > ... > [ 198.756251] arp_tables: (C) 2002 David S. Miller > [855063.554579] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected > by 0, t=61572840 jiffies, g=55111116, c=55111115, q=15) > [855063.554659] Task dump for CPU 1: > [855063.554662] MDNS-resolver R ffff88023250a020 0 1705 1 0 > 0x00000008 > [855063.554665] ffff88023250a020 ffff88023250a7c0 ffff88022d613da0 > ffffffff8109d2b1 > [855063.554667] ffff88022d613e28 ffff88022d613e70 0000000000000293 > 0000000000001d4c > [855063.554669] 0000000000000004 ffff880035a35188 24e96a4dd65d19eb > 00000000000c1b96 > [855063.554670] Call Trace: > [855063.554677] [<ffffffff8109d2b1>] ? dequeue_signal+0xd1/0x170 > [855063.554681] [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0 > [855063.554683] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0 > [855063.554687] [<ffffffff812303c0>] ? select_estimate_accuracy+0x40/0x120 > [855063.554688] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0 > [855063.554693] [<ffffffff81267291>] ? ep_poll+0xa1/0x360 > [855063.554695] [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0 > [855063.554697] [<ffffffff8126877d>] ? SyS_epoll_wait+0xed/0x120 > [855063.554702] [<ffffffff816b15c9>] ? system_call_fastpath+0x16/0x1b > [855063.555345] Kernel panic - not syncing: fence-watchdog: reboot > > [855063.555375] CPU: 1 PID: 1705 Comm: MDNS-resolver ve: 0 Not tainted > 3.10.0-693.1.1.vz7.37.18 #1 37.18 > [855063.555404] Hardware name: VMware, Inc. VMware Virtual Platform/440BX > Desktop Reference Platform, BIOS 6.00 09/17/2015 > [855063.555436] 000000000000000e 00000000f2e82b7f ffff88023fd03aa8 > ffffffff816a0288 > [855063.555462] ffff88023fd03b28 ffffffff8169b4a7 ffffffff00000010 > ffff88023fd03b38 > [855063.555487] ffff88023fd03ad8 00000000f2e82b7f 0000000000000000 > ffffffff81905aad > [855063.555513] Call Trace: > [855063.556123] <IRQ> [<ffffffff816a0288>] dump_stack+0x19/0x1b > [855063.556555] [<ffffffff8169b4a7>] panic+0xe8/0x20d > [855063.556998] [<ffffffffc0569020>] ? iptable_nat_ipv4_fn+0x20/0x20 > [iptable_nat] > [855063.557434] [<ffffffff8119130f>] fence_wdog_do_fence+0x4f/0x50 > [855063.557880] [<ffffffff81191382>] fence_wdog_check_timer+0x72/0x80 > [855063.558310] [<ffffffff8158b458>] dev_hard_start_xmit+0x38/0x3b0 > [855063.558745] [<ffffffff8158e5b8>] __dev_queue_xmit+0x448/0x550 > [855063.559173] [<ffffffff8158e6d0>] dev_queue_xmit+0x10/0x20 > [855063.559599] [<ffffffff815d548e>] ip_finish_output+0x52e/0x780 > [855063.560041] [<ffffffff815d5753>] ip_output+0x73/0xe0 > [855063.560468] [<ffffffff815d4f60>] ? __ip_append_data.isra.51+0xa50/0xa50 > [855063.560924] [<ffffffff815d3601>] ip_local_out_sk+0x31/0x40 > [855063.561355] [<ffffffff815d3973>] ip_queue_xmit+0x143/0x390 > [855063.561797] [<ffffffff815ed4ff>] tcp_transmit_skb+0x4cf/0x9b0 > [855063.562234] [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30 > [855063.562697] [<ffffffff815ef64b>] tcp_xmit_probe_skb+0xdb/0xf0 > [855063.563147] [<ffffffff815f09a9>] tcp_write_wakeup+0x59/0x170 > [855063.563584] [<ffffffff815f0e9e>] tcp_keepalive_timer+0x1ae/0x260 > [855063.564030] [<ffffffff81098da6>] call_timer_fn+0x36/0x110 > [855063.564465] [<ffffffff815f0cf0>] ? tcp_init_xmit_timers+0x30/0x30 > [855063.564913] [<ffffffff8109b2bd>] run_timer_softirq+0x22d/0x310 > [855063.565365] [<ffffffff816b3e9f>] __do_softirq+0x10f/0x2aa > [855063.565862] [<ffffffff816b255c>] call_softirq+0x1c/0x30 > [855063.566323] [<ffffffff8102d2e5>] do_softirq+0x65/0xa0 > [855063.566814] [<ffffffff81092425>] irq_exit+0x105/0x110 > [855063.567265] [<ffffffff816b3cc2>] smp_apic_timer_interrupt+0x42/0x50 > [855063.567722] [<ffffffff816b310d>] apic_timer_interrupt+0x6d/0x80 > [855063.568152] <EOI> [<ffffffff810ef910>] ? ktime_get_ts64+0xa0/0xf0 > [855063.568589] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0 > [855063.569033] [<ffffffff812303c0>] select_estimate_accuracy+0x40/0x120 > [855063.569454] [<ffffffff810ef8bc>] ? ktime_get_ts64+0x4c/0xf0 > [855063.569880] [<ffffffff81267291>] ep_poll+0xa1/0x360 > [855063.570271] [<ffffffff8109da81>] ? __set_task_blocked+0x41/0xa0 > [855063.570670] [<ffffffff8126877d>] SyS_epoll_wait+0xed/0x120 > [855063.571049] [<ffffffff816b15c9>] system_call_fastpath+0x16/0x1b > > https://pmc.acronis.com/browse/TTASK-22056 > Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> > --- > kernel/fence-watchdog.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/fence-watchdog.c b/kernel/fence-watchdog.c > index 40d258a68d7c..3f65dbc69216 100644 > --- a/kernel/fence-watchdog.c > +++ b/kernel/fence-watchdog.c > @@ -151,6 +151,12 @@ void fence_wdog_do_fence(void) > > inline int fence_wdog_check_timer(void) > { > + static unsigned long print_alive_time; > + > + if (fence_wdog_jiffies64 != MAX_U64) > + if (printk_timed_ratelimit(&print_alive_time, 30*60*HZ)) > + printk("fence-watchdog: alive\n");
Except looks good for me except that printk() has no loglevel prefix. Is this OK for readers of this message, we print it with default/last loglevel? Thanks, Kirill _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel