A preliminary analysis of the problem, based in a crash dump collected. >From dmesg, we have
[28663.018356] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 [28663.026266] IP: [<ffffffffc00ddb21>] ixgbe_xmit_frame_ring+0x81/0xf50 [ixgbe] Using addr2line to validate the line in the ixgbe code, we got: #nm ixgbe.ko |grep "ixgbe_xmit_frame_ring" 000000000000aaa0 T ixgbe_xmit_frame_ring # printf "%0x\n" $((0xaaa0+0x81)) ab21 # addr2line -fip -e ixgbe.ko -j .text ab21 ixgbe_xmit_frame_ring at [...]/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:7403 Checking the code, it gives us the inlined function ixgbe_maybe_stop_tx(), called from ixgbe_xmit_frame_ring(): static inline int ixgbe_maybe_stop_tx(struct ixgbe_ring *tx_ring, u16 size) { if (likely(ixgbe_desc_unused(tx_ring) >= size)) [...] } Checking now the inlined function ixgbe_desc_unused(): static inline u16 ixgbe_desc_unused(struct ixgbe_ring *ring) { u16 ntc = ring->next_to_clean; u16 ntu = ring->next_to_use; [...] } Using crash, we can validate the offset 0x58 in the struct ixgbe_ring (from the null dereference at 0000000000000058): crash> struct -ox ixgbe_ring|grep -A1 58 [0x58] u16 next_to_use; [0x5a] u16 next_to_clean; It matches what is expected given the ixgbe_desc_unused() code; struct ixgbe_ring was null and the function tried to get the value of next_to_use. Although C code shows that the value "ring->next_to_clean" should trigger the crash before, compiler reordered the instructions as showed by the crash disassembly: crash> disassemble ixgbe_xmit_frame_ring [...] 0xffffffffc00ddab8 <+24>: mov %rdx,%rbx [...] 0xffffffffc00ddb21 <+129>: movzwl 0x58(%rbx),%eax 0xffffffffc00ddb25 <+133>: movzwl 0x5a(%rbx),%esi [...] Finally, from the stack frame information in crash, we can double-validate that ixgbe_ring is null: crash> bt -f |grep ixgbe_xmit_frame_ring -A7 [exception RIP: ixgbe_xmit_frame_ring+129] RIP: ffffffffc00ddb21 RSP: ffff88103f283d20 RFLAGS: 00010246 RAX: 00000000000000c2 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffff8800538c0840 RDI: ffff881034167ec0 [...] Since the x86-64 ABI calling convention specifies that the parameters are passed in registers RDI, RSI, RDX (in that order), the 3rd parameter (ixgbe_ring) is in RDX, which is null. I'll continue the investigation now to understand why this value was null at this point. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794877 Title: Crash in ixgbe, during tx packet xmit (while potentially changing queues number) Status in linux package in Ubuntu: Confirmed Bug description: It was reported that ixgbe driver may crash with the following stack trace, while changing interrupt/queue configuration (probably using ethtool --set-channel): [28661.949147] init: irqbalance main process (19397) killed by TERM signal [28662.381154] ixgbe 0000:04:00.0: removed PHC on eth4 [28662.502142] ixgbe 0000:04:00.0: Multiqueue Enabled: Rx Queue count = 18, Tx Queue count = 18 [28662.588634] ixgbe 0000:04:00.0: registered PHC device on eth4 [28662.689789] br-iscsi-left: port 1(eth4.4011) entered disabled state [28662.689951] br-sio-bel: port 1(eth4.4015) entered disabled state [28662.690039] br-sio-fel: port 1(eth4.4017) entered disabled state [28662.694227] ixgbe 0000:04:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX [28662.694506] br-iscsi-left: port 1(eth4.4011) entered forwarding state [28662.694519] br-iscsi-left: port 1(eth4.4011) entered forwarding state [28662.694596] br-sio-bel: port 1(eth4.4015) entered forwarding state [28662.694604] br-sio-bel: port 1(eth4.4015) entered forwarding state [28662.694651] br-sio-fel: port 1(eth4.4017) entered forwarding state [28662.694658] br-sio-fel: port 1(eth4.4017) entered forwarding state [28662.709921] ixgbe 0000:04:00.1: removed PHC on eth5 [28662.834289] ixgbe 0000:04:00.1: Multiqueue Enabled: Rx Queue count = 18, Tx Queue count = 18 [28662.915121] ixgbe 0000:04:00.1: registered PHC device on eth5 [28663.018209] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: RX/TX [28663.018356] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 [28663.026266] IP: [<ffffffffc00ddb21>] ixgbe_xmit_frame_ring+0x81/0xf50 [ixgbe] [28663.033491] PGD 8000000046bcc067 PUD 46bcd067 PMD 0 [28663.038562] Oops: 0000 [#1] SMP [28663.328921] Call Trace: [28663.334598] <IRQ> [28663.336551] [<ffffffffc00dea32>] ixgbe_xmit_frame+0x42/0x90 [ixgbe] [28663.349627] [<ffffffff8171532d>] dev_hard_start_xmit+0x23d/0x400 [28663.358854] [<ffffffff81739d44>] sch_direct_xmit+0xe4/0x1f0 [28663.367602] [<ffffffff81739eeb>] __qdisc_run+0x9b/0x1c0 [28663.376110] [<ffffffff8171220e>] net_tx_action+0x15e/0x240 [28663.384673] [<ffffffff81084fb6>] __do_softirq+0xe6/0x2a0 [28663.392944] [<ffffffff81085395>] irq_exit+0x95/0xa0 [28663.400720] [<ffffffff8181d0f6>] do_IRQ+0x56/0xe0 [28663.408338] [<ffffffff8181a77f>] common_interrupt+0xbf/0xbf [28663.416733] <EOI> [28663.418680] [<ffffffff810998dc>] ? worker_thread+0x18c/0x480 [28663.430363] [<ffffffff81099750>] ? rescuer_thread+0x310/0x310 [28663.438870] [<ffffffff8109f138>] kthread+0xd8/0xf0 [28663.446368] [<ffffffff8109f060>] ? kthread_park+0x60/0x60 [28663.454385] [<ffffffff81819ff5>] ret_from_fork+0x55/0x80 [28663.462286] [<ffffffff8109f060>] ? kthread_park+0x60/0x60 [28663.470488] Code: 2a 41 83 e8 01 31 c0 45 0f b7 c0 49 83 c0 01 49 c1 e0 04 8b 74 07 3c 48 83 c0 10 8d 96 ff 3f 00 00 c1 ea 0e 01 d1 4c 39 c0 75 e8 <0f> b7 43 58 0f b7 73 5a 83 c1 03 31 d2 66 39 f0 66 0f 43 53 54 [28663.498992] RIP [<ffffffffc00ddb21>] ixgbe_xmit_frame_ring+0x81/0xf50 [ixgbe] [28663.512112] RSP <ffff88103f283d20> [28663.518217] CR2: 0000000000000058 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794877/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp