Re: ena(4) tx timeout messages in dmesg

Pete Wright Tue, 13 May 2025 07:43:35 -0700



On 5/12/25 19:52, Kiyanovski, Arthur wrote:

---------- Forwarded message ---------
From: Pete Wright <[email protected]>
Date: Mon, 12 May 2025 at 12:30
Subject: Re: ena(4) tx timeout messages in dmesg
To: Colin Percival <[email protected]>, <[email protected]>
Cc: Arthur Kiyanovski <[email protected]>




On 5/12/25 11:56, Colin Percival wrote:

On 5/12/25 11:25, Pete Wright wrote:

On 5/12/25 11:17, Colin Percival wrote:

On 5/12/25 11:04, Pete Wright wrote:

hey there - i have an ec2 instance that i'm using as a nfs server
and have noticed the following messages in my dmesg buffer:
[...]
ena0: Found a Tx that wasn't completed on time, qid 3, index 998. 1
msecs have passed since last cleanup. Missing Tx timeout value 5000
msecs.

I've heard that this can be caused by a thread being starved for
CPU, possibly due to FreeBSD kernel scheduler issues, but that was
on a far more heavily loaded system.  What instance type are you
running on?


oh of course, forgot to provide useful info:

# uname -ar
FreeBSD airflow-nfs.q0.ringdna.net 14.2-RELEASE-p1 FreeBSD 14.2-
RELEASE-p1 GENERIC amd64

Instance type:
t3a.xlarge

I also verified I have plenty of available "burstable credit"
available since this is a t class system (current balance is steady
at
2,300 credits).


Ah, this won't necessarily help you -- T family instances are on
shared hardware so even if you have burstable credits it's possible
that you'll be unlucky with "noisy neighbours" and the sibling
instances will all want CPU at the same time as you.  But I think
there's probably something else going on as well.



oh that's a good point, since this is a pre-prod system that is less of a 
concern
as we want to limit spend when possible.  i'll be spinning up production
systems in the following week or so that will be on a "c"
class system, i'll keep an eye out to see if see similar messages in that
environment.

-pete

--
Pete Wright
[email protected]


HI Colin, Pete,

Your analysis regarding CPU being occupied is the classic explanation for this 
kind
prints.

The prints are consistent with cpu not being available to the interrupt
handler to run.
Although you say you have burstable credits available, the fact that you are 
using
T instance types does make you more susceptible to such issues.

Also when you say you have 25% CPU usage, how did you check that?
Are you using tools that give you an average over some time? so you may
have 75% of the time 0 cpu usage and 25% of the time 100% cpu usage.

As you already suggested, the first thing we would like to eliminate is the T 
instance
Type.
If all works - great!

If not you may want to look into the spreading of interrupts over the different 
cpus
using 
https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#io-irq-affinity
And also make sure that the cpu heavy processes you have, are run on different 
cpus than
ones you handle the interrupts on.

Hope this helps,
Arthur

thanks for the context Arthur, I'll take a look at that sysctl knob. asi said the box is only serving a python virtual environment to a pool ofec2 compute nodes, and the dataset resides in memory. so nothing toocrazy. the load does have spikes but they are pretty brief and rarelyover %70. i'm collecting metrics via telegraph, and also observe loadvia the usual suspects like top, systat etc.

it sounds like ena(4) seems to be particularly sensitive to cpu spikesthough - at least with this vm configuration. if i continue to seethese messages in dmesg i'll test out distributing IRQ's, otherwise ithink i can chalk this up to a noisy neighbor or something similar.


thanks!
-pete



--
Pete Wright
[email protected]

Re: ena(4) tx timeout messages in dmesg

Reply via email to