On Thu, Dec 19, 2019 at 10:35 AM adam radford <aradf...@gmail.com> wrote:
>
> e1000-devel list,
>
> Problem:
>
> With kernel 4.19.29 and igb 5.4.0-k on Intel E5-2618Lv4 and  E5-2648Lv4 
> servers:

<snip>

> Using the default 8 MSI-X vectors (combined TX/RX queues), after 2
> days to 30 days of runtime we suddenly see unhandled interrupts, back
> to back TX timeouts / Reset Adapter sequences, or both at the same
> time on igb, i.e.:
>
> 2019-12-01T17:57:33.797+0000 controller- user.emerg kernel:
> [919635.664612] do_IRQ: 13.53 No irq handler for vector
> 2019-12-01T17:57:35.845+0000 controller- user.emerg kernel:
> [919637.712587] do_IRQ: 13.53 No irq handler for vector
> 2019-12-01T17:57:37.829+0000 controller- user.emerg kernel:
> [919639.696569] do_IRQ: 13.53 No irq handler for vector
> 2019-12-01T17:57:39.237+0000 controller- user.err kernel:
> [919641.103011] igb 0000:01:00.1 eth1: Reset adapter
> 2019-12-01T17:57:39.237+0000 controller- user.emerg kernel:
> [919641.103021] do_IRQ: 13.53 No irq handler for vector
> 2019-12-01T17:57:39.266+0000 controller- user.err kernel:
> [919641.132142] igb 0000:01:00.2 eth2: Reset adapter
> 2019-12-01T17:57:39.268+0000 controller- user.info kernel:
> [919641.139341] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Up 1000
> Mbps Full Duplex, Flow Control: RX
> 2019-12-01T17:57:39.346+0000 controller- user.emerg kernel:
> [919641.212614] do_IRQ: 13.53 No irq handler for vector
> 2019-12-01T17:57:39.363+0000 controller- user.info kernel:
> [919641.234249] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Down
> 2019-12-01T17:57:41.796+0000 controller- user.emerg kernel:
> [919643.663562] do_IRQ: 13.53 No irq handler for vector
> 2019-12-01T17:57:42.790+0000 controller- user.info kernel:
> [919644.661800] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Up 1000
> Mbps Full Duplex, Flow Control: RX
> 2019-12-01T17:57:43.841+0000 controller- user.info kernel:
> [919645.712796] igb 0000:01:00.1 eth1: igb: eth1 NIC Link is Up 1000
> Mbps Full Duplex, Flow Control: RX
> 2019-12-01T17:57:43.849+0000 controller- user.emerg kernel:
> [919645.714693] do_IRQ: 13.53 No irq handler for vector
> 2019-12-01T17:57:45.831+0000 controller- user.emerg kernel:
> [919647.698208] do_IRQ: 13.53 No irq handler for vector

So all of this points to an issue with the IRQ handler somehow not
being associated with the vector on this CPU.

I suspect not much has changed in the driver code itself. My thought
is that something is causing the CPU to lose track of the IRQ handler
for the vector. So the messages above are pointing to IRQ vector 53 on
CPU 13. It makes me wonder if irqbalance is causing the interrupts to
be bounced between CPUs and in the process of that migration it is
somehow losing track of the IRQ handler.

One thing you might try doing to determine if something like this is
an issue is to disable irqbalance and then manually assign the IRQ
affinity of the interrupts for the NIC ports. As long as that is
working you wouldn't need to worry about it moving the interrupts to
CPUs and if that is the issue then the interface is unlikely to lose
the interrupt handler for its' vectors.

Hope that helps.

- Alex


_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P00000018NbWAI/intel-ethernet

Reply via email to