I was reading assembly and comparing with the code to evaluate the accuracy of the registers during the dump, and also some points in which it could has failed.
In the tx transmit function of ixgbe - ixgbe_xmit_frame(), we have the following: tx_ring = ring ? ring : adapter->tx_ring[skb->queue_mapping]; Checking the assembly, it really ignores ring in this point since it comes as NULL, so we are getting a NULL tx_ring due to adapter->tx_ring[skb->queue_mapping] being NULL. The struct sk_buff passed in %rdi is odd, it contains no valid data it seems. Even so, the queue_mapping is 0x0, and checking ixgbe_adapter during the crash moment, adapter->tx_ring[0x0] is valid and shouldn't cause the NULL pointer dereference. I think a race may be happening and in the moment tx_ring is assigned in ixgbe_xmit_frame(), it's NULL, but it's filled right after with a valid pointer in another function, running concurrently. I've noticed a queue allocation function assigns this pointer, and also, one interesting thing I've observed from dmesg is a successive amount of interface/queue re-initialization (it seems): [ 6.628974] ixgbe 0000:04:00.1: Multiqueue Enabled: Rx Queue count = 20, Tx Queue count = 20 [...] [ 1493.198280] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: RX/TX [...] [ 4113.173315] ixgbe 0000:04:00.1: Multiqueue Enabled: Rx Queue count = 19, Tx Queue count = 19 [ 4113.365528] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: RX/TX [...] [28662.834289] ixgbe 0000:04:00.1: Multiqueue Enabled: Rx Queue count = 18, Tx Queue count = 18 [28663.018209] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: RX/TX [28663.018356] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 So, noticed the number of queues is reducing by 1 each time we see these messages in dmesg. It seems triggered by "ethtool --set-channels" changing the number of tx/rx queues for the interface. Also, an oddity from the dump: crash> ixgbe_adapter -x ffff8800538c0840 struct ixgbe_adapter { active_vlans = {0x1, 0x0, [...] 0x0, 0x5500000000000, 0x0}, [...] So, besides the VLAN 0, there's more bits set in this bit field; I don't know why, it seems not expected, will study more the code. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794877 Title: Crash in ixgbe, during tx packet xmit (while potentially changing queues number) Status in linux package in Ubuntu: Confirmed Bug description: It was reported that ixgbe driver may crash with the following stack trace, while changing interrupt/queue configuration (probably using ethtool --set-channel): [28661.949147] init: irqbalance main process (19397) killed by TERM signal [28662.381154] ixgbe 0000:04:00.0: removed PHC on eth4 [28662.502142] ixgbe 0000:04:00.0: Multiqueue Enabled: Rx Queue count = 18, Tx Queue count = 18 [28662.588634] ixgbe 0000:04:00.0: registered PHC device on eth4 [28662.689789] br-iscsi-left: port 1(eth4.4011) entered disabled state [28662.689951] br-sio-bel: port 1(eth4.4015) entered disabled state [28662.690039] br-sio-fel: port 1(eth4.4017) entered disabled state [28662.694227] ixgbe 0000:04:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX [28662.694506] br-iscsi-left: port 1(eth4.4011) entered forwarding state [28662.694519] br-iscsi-left: port 1(eth4.4011) entered forwarding state [28662.694596] br-sio-bel: port 1(eth4.4015) entered forwarding state [28662.694604] br-sio-bel: port 1(eth4.4015) entered forwarding state [28662.694651] br-sio-fel: port 1(eth4.4017) entered forwarding state [28662.694658] br-sio-fel: port 1(eth4.4017) entered forwarding state [28662.709921] ixgbe 0000:04:00.1: removed PHC on eth5 [28662.834289] ixgbe 0000:04:00.1: Multiqueue Enabled: Rx Queue count = 18, Tx Queue count = 18 [28662.915121] ixgbe 0000:04:00.1: registered PHC device on eth5 [28663.018209] ixgbe 0000:04:00.1 eth5: NIC Link is Up 10 Gbps, Flow Control: RX/TX [28663.018356] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 [28663.026266] IP: [<ffffffffc00ddb21>] ixgbe_xmit_frame_ring+0x81/0xf50 [ixgbe] [28663.033491] PGD 8000000046bcc067 PUD 46bcd067 PMD 0 [28663.038562] Oops: 0000 [#1] SMP [28663.328921] Call Trace: [28663.334598] <IRQ> [28663.336551] [<ffffffffc00dea32>] ixgbe_xmit_frame+0x42/0x90 [ixgbe] [28663.349627] [<ffffffff8171532d>] dev_hard_start_xmit+0x23d/0x400 [28663.358854] [<ffffffff81739d44>] sch_direct_xmit+0xe4/0x1f0 [28663.367602] [<ffffffff81739eeb>] __qdisc_run+0x9b/0x1c0 [28663.376110] [<ffffffff8171220e>] net_tx_action+0x15e/0x240 [28663.384673] [<ffffffff81084fb6>] __do_softirq+0xe6/0x2a0 [28663.392944] [<ffffffff81085395>] irq_exit+0x95/0xa0 [28663.400720] [<ffffffff8181d0f6>] do_IRQ+0x56/0xe0 [28663.408338] [<ffffffff8181a77f>] common_interrupt+0xbf/0xbf [28663.416733] <EOI> [28663.418680] [<ffffffff810998dc>] ? worker_thread+0x18c/0x480 [28663.430363] [<ffffffff81099750>] ? rescuer_thread+0x310/0x310 [28663.438870] [<ffffffff8109f138>] kthread+0xd8/0xf0 [28663.446368] [<ffffffff8109f060>] ? kthread_park+0x60/0x60 [28663.454385] [<ffffffff81819ff5>] ret_from_fork+0x55/0x80 [28663.462286] [<ffffffff8109f060>] ? kthread_park+0x60/0x60 [28663.470488] Code: 2a 41 83 e8 01 31 c0 45 0f b7 c0 49 83 c0 01 49 c1 e0 04 8b 74 07 3c 48 83 c0 10 8d 96 ff 3f 00 00 c1 ea 0e 01 d1 4c 39 c0 75 e8 <0f> b7 43 58 0f b7 73 5a 83 c1 03 31 d2 66 39 f0 66 0f 43 53 54 [28663.498992] RIP [<ffffffffc00ddb21>] ixgbe_xmit_frame_ring+0x81/0xf50 [ixgbe] [28663.512112] RSP <ffff88103f283d20> [28663.518217] CR2: 0000000000000058 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794877/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp