On 2018-02-21 20:55, Guillaume Nault wrote:
On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
It seems even rebuilding seemingly stable version triggering crashes too
(but different ones)
Different ones? The trace following your message looks very similar to
your first KASAN report. Or are you refering to the lockup you posted
on Sun, 18 Feb 2018?

Also, which stable versions are you refering to?
Trace i sent in previous email - is latest kernel, vanilla, just more debug options and few options disabled. One of disabled was spitting some errors (it is obviously bug), CONFIG_XFRM, in nf_xfrm_me_harder (i reported about it).
And i disabled namespaces, as they are often source of trouble.

Today i will try to revert just:
drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert ppp_file.refcnt from atomic_t to  refcount_t

Because i suspect previously, after reverting this patches i got different kernel panic (and i didn't noticed that, now too late to identify between other crashes),
seems it was not KASAN.
I will report results after testing, unfortunately i can't test it more than once per day.

"Stable" for me was 4.14.2 - but it looks like on that kernel i am getting different issue now.
I will paste it below.

Another observation, just hour ago, i noticed on another server, where i am testing 4.15, and 4.14.20 (at moment of testing 4.14.20, but no debug at that moment), when i killed accel-pppd (pppoe server software), with 8k sessions online, i got some weird behaviour, accel-pppd process got stuck, same as ifconfig and "ip link", and even kexec -e didn't worked(got stuck too), unless i did kexec -e -x
(so it wont try to make interfaces down on kexec).
I will try to reproduce this bug as well, with debug enabled (lockdep and so) i hope it is not related.


I'm interested in the ppp_generic.o file that produced the following
trace. Just to be sure that the differences come from the new debugging
options.
Also kernel config:
https://nuclearcat.com/bughunting/config.txt
https://nuclearcat.com/bughunting/ppp_generic.o

This is in 4.14.2, was seemingly stable before:

[50401.388670] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 1 timed out
[50401.389014] ------------[ cut here ]------------
[50401.389340] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320 dev_watchdog+0x15c/0x1b9 [50401.389925] Modules linked in: pppoe pppox ppp_generic slhc netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [50401.391869] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.2-build-0134 #4 [50401.392191] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015
[50401.392513] task: ffff880434d72640 task.stack: ffffc90001914000
[50401.392836] RIP: 0010:dev_watchdog+0x15c/0x1b9
[50401.393155] RSP: 0018:ffff8804364c3e90 EFLAGS: 00010286
[50401.393470] RAX: 0000000000000039 RBX: ffff88042f6e0000 RCX: 0000000000000000 [50401.393787] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffffffff828dbc64 [50401.394103] RBP: ffff8804364c3eb0 R08: 0000000000000001 R09: 0000000000000000 [50401.394420] R10: 0000000000000002 R11: ffff8803fa075c00 R12: 0000000000000001 [50401.394739] R13: 0000000000000040 R14: 0000000000000003 R15: ffffffff81e05108 [50401.395064] FS: 0000000000000000(0000) GS:ffff8804364c0000(0000) knlGS:0000000000000000
[50401.395645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50401.395970] CR2: 00007fff25fc20a8 CR3: 0000000001e09005 CR4: 00000000001606e0
[50401.396294] Call Trace:
[50401.396613]  <IRQ>
[50401.396934]  ? qdisc_rcu_free+0x3f/0x3f
[50401.397255]  call_timer_fn.isra.4+0x17/0x7b
[50401.397576]  expire_timers+0x6f/0x7e
[50401.397899]  run_timer_softirq+0x6d/0x8f
[50401.398219]  ? ktime_get+0x3b/0x8c
[50401.398540]  ? lapic_next_event+0x18/0x1c
[50401.398862]  ? clockevents_program_event+0xa3/0xbb
[50401.399186]  __do_softirq+0xbc/0x1ab
[50401.399510]  irq_exit+0x4d/0x8e
[50401.399832]  smp_apic_timer_interrupt+0x73/0x80
[50401.400157]  apic_timer_interrupt+0x8d/0xa0
[50401.400480]  </IRQ>
[50401.400801] RIP: 0010:mwait_idle+0x4e/0x61
[50401.401123] RSP: 0018:ffffc90001917ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [50401.401714] RAX: 0000000000000000 RBX: ffff880434d72640 RCX: 0000000000000000 [50401.402037] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [50401.402362] RBP: ffffc90001917ec0 R08: 0000000000000000 R09: 0000000000000001 [50401.402685] R10: ffffc90001917e58 R11: 000000000000037a R12: 0000000000000000 [50401.403008] R13: 0000000000000000 R14: ffff880434d72640 R15: ffff880434d72640
[50401.403336]  arch_cpu_idle+0xa/0xc
[50401.403660]  default_idle_call+0x20/0x22
[50401.403983]  do_idle+0xb3/0x13b
[50401.404303]  cpu_startup_entry+0x1a/0x1c
[50401.404627]  start_secondary+0x102/0x105
[50401.404949]  secondary_startup_64+0xa5/0xa5
[50401.405271] Code: d9 54 58 00 00 75 38 48 89 df c6 05 cd 54 58 00 01 e8 dc 24 fe ff 44 89 e1 48 89 c2 48 89 de 48 c7 c7 51 df d6 81 e8 db f6 7c ff <0f> ff eb 10 41 ff c4 48 05 40 01 0
0 00 41 39 cc 75 9a eb 0d 48
[50401.406154] ---[ end trace 433f9a49b32a3b4f ]---
[50401.406489] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout
[50401.406828] ixgbe 0000:04:00.1 eth1: Reset adapter

[50416.686709] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [tc:24389] [50416.687042] Modules linked in: pppoe pppox ppp_generic slhc netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [50416.687707] watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [accel-pppd:24377] [50416.687708] Modules linked in: pppoe pppox ppp_generic slhc netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc ixgbe dca [50416.687720] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W 4.14.2-build-0134 #4 [50416.687721] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015
 [50416.687722] task: ffff8803f278d940 task.stack: ffffc90011a38000
 [50416.687727] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e
[50416.687728] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [50416.687729] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX: ffff8804365d9a40 [50416.687730] RDX: ffff880436419a60 RSI: 0000000000200000 RDI: ffff8804035abfa0 [50416.687730] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09: 00000000ffffffff [50416.687731] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12: ffff88041a5fde00 [50416.687731] R13: 0000000000008021 R14: ffff8803fc5778c0 R15: ffff8803fc577840 [50416.687732] FS: 00007f7b5726e700(0000) GS:ffff8804365c0000(0000) knlGS:0000000000000000
 [50416.687733] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50416.687733] CR2: 0000000001d5f408 CR3: 000000042bf7c006 CR4: 00000000001606e0
 [50416.687734] Call Trace:
 [50416.687738]  _raw_spin_lock+0x1a/0x1c
 [50416.687741]  ppp_push+0x56/0x49b [ppp_generic]
 [50416.687743]  ? __kmalloc_node_track_caller+0xba/0xdf
 [50416.687746]  __ppp_xmit_process+0x38/0x4c9 [ppp_generic]
 [50416.687747]  ppp_xmit_process+0x39/0x8d [ppp_generic]
 [50416.687749]  ppp_write+0xbb/0xc7 [ppp_generic]
 [50416.687750]  __vfs_write+0x21/0x111
 [50416.687752]  ? handle_mm_fault+0xb1/0xc0
 [50416.687753]  vfs_write+0xcd/0x176
 [50416.687754]  SyS_write+0x49/0x83
 [50416.687755]  entry_SYSCALL_64_fastpath+0x17/0x98
 [50416.687756] RIP: 0033:0x7f7b5ba4e14d
[50416.687756] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [50416.687757] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX: 00007f7b5ba4e14d [50416.687758] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI: 0000000000004e9e [50416.687758] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09: 0000000000633e00 [50416.687758] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 [50416.687759] R13: 00007f7b5716c4ff R14: 0000000000000003 R15: 00007f7b5716c590 [50416.687759] Code: 89 c2 c1 e8 12 48 c1 ea 0c ff c8 83 e2 30 48 98 48 81 c2 40 9a 01 00 48 03 14 c5 a0 73 d7 81 48 89 0a 8b 41 08 85 c0 75 04 f3 90 <eb> f5 48 8b 11 48 85 d2 74 03 0f 1
8 0a 44 8b 07 66 45 85 c0 0f
 [50416.687771] Kernel panic - not syncing: softlockup: hung tasks
[50416.687772] CPU: 7 PID: 24377 Comm: accel-pppd Tainted: G W L 4.14.2-build-0134 #4 [50416.687773] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015
 [50416.687773] Call Trace:
 [50416.687774]  <IRQ>
 [50416.687776]  dump_stack+0x4d/0x63
 [50416.687778]  panic+0xcd/0x215
 [50416.687780]  watchdog_timer_fn+0x1a4/0x1c6
 [50416.687782]  __hrtimer_run_queues+0x76/0xc0
 [50416.687783]  hrtimer_interrupt+0xa0/0x162
 [50416.687785]  smp_apic_timer_interrupt+0x6e/0x80
 [50416.687786]  apic_timer_interrupt+0x8d/0xa0
 [50416.687786]  </IRQ>
 [50416.687787] RIP: 0010:queued_spin_lock_slowpath+0x125/0x16e
[50416.687788] RSP: 0018:ffffc90011a3bd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10 [50416.687789] RAX: 0000000000000000 RBX: ffff8804035abfb8 RCX: ffff8804365d9a40 [50416.687789] RDX: ffff880436419a60 RSI: 0000000000200000 RDI: ffff8804035abfa0 [50416.687789] RBP: ffffc90011a3bd20 R08: 0000000000000001 R09: 00000000ffffffff [50416.687790] R10: ffffc90011a3bda8 R11: ffff8803f278d940 R12: ffff88041a5fde00 [50416.687790] R13: 0000000000008021 R14: ffff8803fc5778c0 R15: ffff8803fc577840
 [50416.687792]  _raw_spin_lock+0x1a/0x1c
 [50416.687793]  ppp_push+0x56/0x49b [ppp_generic]
 [50416.687794]  ? __kmalloc_node_track_caller+0xba/0xdf
 [50416.687795]  __ppp_xmit_process+0x38/0x4c9 [ppp_generic]
 [50416.687797]  ppp_xmit_process+0x39/0x8d [ppp_generic]
 [50416.687798]  ppp_write+0xbb/0xc7 [ppp_generic]
 [50416.687798]  __vfs_write+0x21/0x111
 [50416.687799]  ? handle_mm_fault+0xb1/0xc0
 [50416.687800]  vfs_write+0xcd/0x176
 [50416.687801]  SyS_write+0x49/0x83
 [50416.687802]  entry_SYSCALL_64_fastpath+0x17/0x98
 [50416.687802] RIP: 0033:0x7f7b5ba4e14d
[50416.687803] RSP: 002b:00007f7b5726dbd0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [50416.687803] RAX: ffffffffffffffda RBX: 00007f7b5244aae8 RCX: 00007f7b5ba4e14d [50416.687804] RDX: 000000000000000c RSI: 00007f7b4c7e6180 RDI: 0000000000004e9e [50416.687804] RBP: 00007f7b5726dba0 R08: 0000000000000000 R09: 0000000000633e00 [50416.687805] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 [50416.687805] R13: 00007f7b5716c4ff R14: 0000000000000003 R15: 00007f7b5716c590 [50416.723251] CPU: 5 PID: 24389 Comm: tc Tainted: G W L 4.14.2-build-0134 #4 [50416.723831] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 04/02/2015
 [50416.724151] task: ffff8803f0f19980 task.stack: ffffc90011a70000
 [50416.724477] RIP: 0010:queued_spin_lock_slowpath+0x54/0x16e
[50416.724799] RSP: 0018:ffffc90011a73940 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10 [50416.725384] RAX: 0000000000000101 RBX: ffff8803fc577000 RCX: 0000000000000101 [50416.725709] RDX: 0000000000000100 RSI: 0000000000000001 RDI: ffff8803fc5778c0 [50416.726034] RBP: ffffc90011a73940 R08: 0000000000000001 R09: 0000000000002e88 [50416.726355] R10: 0000000000000024 R11: 0000000000000000 R12: ffff8803eea1d060 [50416.726678] R13: ffff8803fc5778c0 R14: ffffc90011a739f0 R15: 0000000000000000 [50416.727001] FS: 00007f72228f2700(0000) GS:ffff880436540000(0000) knlGS:0000000000000000
 [50416.727582] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[50416.727904] CR2: 0000000001aea038 CR3: 00000003fb09a001 CR4: 00000000001606e0
 [50416.728230] Call Trace:
 [50416.728550]  _raw_spin_lock_bh+0x25/0x27
 [50416.728872]  ppp_get_stats64+0x4c/0xaf [ppp_generic]
 [50416.729195]  dev_get_stats+0x33/0x84
 [50416.729513]  rtnl_fill_stats+0x41/0x121
 [50416.729831]  rtnl_fill_ifinfo+0x567/0xdf0
 [50416.730148]  rtnl_dump_ifinfo+0x232/0x2e0
 [50416.730470]  ? __alloc_pages_nodemask+0xd2/0xa99
 [50416.730790]  ? handle_mm_fault+0xb1/0xc0
 [50416.731109]  ? __do_page_fault+0x31c/0x3b6
 [50416.731431]  netlink_dump+0xfc/0x271
 [50416.731749]  netlink_recvmsg+0x1af/0x2d9
 [50416.732068]  sock_recvmsg+0x14/0x16
 [50416.732388]  ___sys_recvmsg+0xea/0x1a1
 [50416.732708]  ? lru_deactivate_file_fn+0x29f/0x29f
 [50416.733028]  ? pagevec_lru_move_fn+0xad/0xc1
 [50416.733348]  ? alloc_pages_vma+0x152/0x19c
 [50416.733669]  ? page_add_new_anon_rmap+0xa0/0xa9
 [50416.733988]  ? __handle_mm_fault+0xda4/0xf40
 [50416.734310]  ? sock_sendmsg+0x12/0x1d
 [50416.734628]  __sys_recvmsg+0x3d/0x5e
 [50416.734945]  ? __sys_recvmsg+0x3d/0x5e
 [50416.735264]  SyS_recvmsg+0xd/0x17
 [50416.735576]  entry_SYSCALL_64_fastpath+0x17/0x98
 [50416.735890] RIP: 0033:0x7f7221afdac0
[50416.736202] RSP: 002b:00007ffcbbc1f328 EFLAGS: 00000246 ORIG_RAX: 000000000000002f [50416.736772] RAX: ffffffffffffffda RBX: 00007f7221da8b98 RCX: 00007f7221afdac0 [50416.737092] RDX: 0000000000000000 RSI: 00007ffcbbc1f390 RDI: 0000000000000003 [50416.737411] RBP: 00007f7221da8b98 R08: 0000000000007d7c R09: 0000000000000000 [50416.737734] R10: 00007ffcbbc1f3d0 R11: 0000000000000246 R12: 000000000066a520 [50416.738055] R13: 0000000000000040 R14: 00007f7221da8b40 R15: 0000000000002710 [50416.738377] Code: ff ff 75 33 83 fe 01 89 ca 89 f0 41 0f 45 d0 f0 0f b1 17 39 c6 74 04 89 c6 eb e1 ff ca 0f 84 20 01 00 00 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 e9 0c 01 00 0
0 48 c7 c0 40 9a 01 00 65 48
Feb 21 00:04:31 10.0.252.9 [50417.724420] Shutting down cpus with NMI
Feb 21 00:04:31 10.0.252.9 [50417.724748] Kernel Offset: disabled
Feb 21 00:04:31 10.0.252.9 [50417.728891] Rebooting in 5 seconds..

Reply via email to