On Thu, Dec 19, 2019 at 10:35 AM adam radford <aradf...@gmail.com> wrote: > > e1000-devel list, > > Problem: > > With kernel 4.19.29 and igb 5.4.0-k on Intel E5-2618Lv4 and E5-2648Lv4 > servers:
<snip> > Using the default 8 MSI-X vectors (combined TX/RX queues), after 2 > days to 30 days of runtime we suddenly see unhandled interrupts, back > to back TX timeouts / Reset Adapter sequences, or both at the same > time on igb, i.e.: > > 2019-12-01T17:57:33.797+0000 controller- user.emerg kernel: > [919635.664612] do_IRQ: 13.53 No irq handler for vector > 2019-12-01T17:57:35.845+0000 controller- user.emerg kernel: > [919637.712587] do_IRQ: 13.53 No irq handler for vector > 2019-12-01T17:57:37.829+0000 controller- user.emerg kernel: > [919639.696569] do_IRQ: 13.53 No irq handler for vector > 2019-12-01T17:57:39.237+0000 controller- user.err kernel: > [919641.103011] igb 0000:01:00.1 eth1: Reset adapter > 2019-12-01T17:57:39.237+0000 controller- user.emerg kernel: > [919641.103021] do_IRQ: 13.53 No irq handler for vector > 2019-12-01T17:57:39.266+0000 controller- user.err kernel: > [919641.132142] igb 0000:01:00.2 eth2: Reset adapter > 2019-12-01T17:57:39.268+0000 controller- user.info kernel: > [919641.139341] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Up 1000 > Mbps Full Duplex, Flow Control: RX > 2019-12-01T17:57:39.346+0000 controller- user.emerg kernel: > [919641.212614] do_IRQ: 13.53 No irq handler for vector > 2019-12-01T17:57:39.363+0000 controller- user.info kernel: > [919641.234249] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Down > 2019-12-01T17:57:41.796+0000 controller- user.emerg kernel: > [919643.663562] do_IRQ: 13.53 No irq handler for vector > 2019-12-01T17:57:42.790+0000 controller- user.info kernel: > [919644.661800] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Up 1000 > Mbps Full Duplex, Flow Control: RX > 2019-12-01T17:57:43.841+0000 controller- user.info kernel: > [919645.712796] igb 0000:01:00.1 eth1: igb: eth1 NIC Link is Up 1000 > Mbps Full Duplex, Flow Control: RX > 2019-12-01T17:57:43.849+0000 controller- user.emerg kernel: > [919645.714693] do_IRQ: 13.53 No irq handler for vector > 2019-12-01T17:57:45.831+0000 controller- user.emerg kernel: > [919647.698208] do_IRQ: 13.53 No irq handler for vector So all of this points to an issue with the IRQ handler somehow not being associated with the vector on this CPU. I suspect not much has changed in the driver code itself. My thought is that something is causing the CPU to lose track of the IRQ handler for the vector. So the messages above are pointing to IRQ vector 53 on CPU 13. It makes me wonder if irqbalance is causing the interrupts to be bounced between CPUs and in the process of that migration it is somehow losing track of the IRQ handler. One thing you might try doing to determine if something like this is an issue is to disable irqbalance and then manually assign the IRQ affinity of the interrupts for the NIC ports. As long as that is working you wouldn't need to worry about it moving the interrupts to CPUs and if that is the issue then the interface is unlikely to lose the interrupt handler for its' vectors. Hope that helps. - Alex _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel Ethernet, visit https://forums.intel.com/s/topic/0TO0P00000018NbWAI/intel-ethernet