On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote: > Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski: > > On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote: > >> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote: > >>> Am 10.04.25 um 16:30 schrieb Michal Kubiak: > >>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote: > >>>>> Hi, > >>>>> > >>>>> in a setup where I use native XDP to redirect packets to a bonding > >>>>> interface > >>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe driver > >>>>> constantly > >>>>> resets the NIC with the following kernel output: > >>>>> > >>>>> ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP) > >>>>> Tx Queue <4> > >>>>> TDH, TDT <17e>, <17e> > >>>>> next_to_use <181> > >>>>> next_to_clean <17e> > >>>>> tx_buffer_info[next_to_clean] > >>>>> time_stamp <0> > >>>>> jiffies <10025c380> > >>>>> ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, > >>>>> resetting adapter > >>>>> ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout > >>>>> ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter > >>>>> > >>>>> This only occurs in combination with a bonding interface and XDP, so I > >>>>> don't > >>>>> know if this is an issue with ixgbe or the bonding driver. > >>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and > >>>>> 6.15.0-rc1 > >>>>> show the same issue. > >>>>> > >>>>> > >>>>> I managed to reproduce this bug in a lab environment. Here are some > >>>>> details > >>>>> about my setup and the steps to reproduce the bug: > >>>>> > >>>>> [...] > >>>>> > >>>>> Do you have any ideas what may be causing this issue or what I can do to > >>>>> diagnose this further? > >>>>> > >>>>> Please let me know when I should provide any more information. > >>>>> > >>>>> > >>>>> Thanks! > >>>>> Marcus > >>>>> > >>>> > >> [...] > >> > >> Hi Marcus, > >> > >>> thank you for looking into it. And not even 24 hours after my report, I'm > >>> very impressed! ;) > >> > >> Thanks! :-) > >> > >>> Interesting. I just tried again but had no luck yet with reproducing it > >>> without a bonding interface. May I ask how your setup looks like? > >> > >> For now, I've just grabbed the first available system with the HW > >> controlled by the "ixgbe" driver. In my case it was: > >> > >> Ethernet controller: Intel Corporation Ethernet Controller X550 > >> > >> Also, for my first attempt, I didn't use the upstream kernel - I just tried > >> the kernel installed on that system. It was the Fedora kernel: > >> > >> 6.12.8-200.fc41.x86_64 > >> > >> > >> I think that may be the "beauty" of timing issues - sometimes you can > >> change > >> just one piece in your system and get a completely different replication > >> ratio. > >> Anyway, the higher the repro probability, the easier it is to debug > >> the timing problem. :-) > > > > Hi Marcus, to break the silence could you try to apply the diff below on > > your side? > > Hi, thank you for the patch. We've tried it and with your changes we can no > longer trigger the error and the NIC is no longer being reset. > > > We see several issues around XDP queues in ixgbe, but before we > > proceed let's this small change on your side. > > How confident are you that this patch is sufficient to make things stable > enough > for production use? Was it just the Tx hang detection that was misbehaving for > the XDP case, or is there an underlying issue with the XDP queues that is not > solved by disabling the detection for it?
I believe that correct way to approach this is to move the Tx hang detection onto ixgbe_tx_timeout() as that is the place where this logic belongs to. By doing so I suppose we would kill two birds with one stone as mentioned ndo is called under netdev watchdog which is not a subject for XDP Tx queues. > > With our current setup we cannot verify accurately, that we have no packet > loss > or stuck queues. We can do additional tests to verify that. > > > Additional question, do you have enabled pause frames on your setup? > > Pause frames were enabled, but we can also reproduce it after disabling them, > without your patch. Please give your setup a go with pause frames enabled and applied patch that i shared previously and let us see the results. As said above I do not think it is correct to check for hung queues in Tx descriptor cleaning routine. This is a job of ndo_tx_timeout callback. > > Thanks! Thanks for feedback and testing. I'll provide a proper fix tomorrow and CC you so you could take it for a spin. > Marcus
