On 5/12/25 19:52, Kiyanovski, Arthur wrote:
---------- Forwarded message ---------
From: Pete Wright <[email protected]>
Date: Mon, 12 May 2025 at 12:30
Subject: Re: ena(4) tx timeout messages in dmesg
To: Colin Percival <[email protected]>, <[email protected]>
Cc: Arthur Kiyanovski <[email protected]>
On 5/12/25 11:56, Colin Percival wrote:
On 5/12/25 11:25, Pete Wright wrote:
On 5/12/25 11:17, Colin Percival wrote:
On 5/12/25 11:04, Pete Wright wrote:
hey there - i have an ec2 instance that i'm using as a nfs server
and have noticed the following messages in my dmesg buffer:
[...]
ena0: Found a Tx that wasn't completed on time, qid 3, index 998. 1
msecs have passed since last cleanup. Missing Tx timeout value 5000
msecs.
I've heard that this can be caused by a thread being starved for
CPU, possibly due to FreeBSD kernel scheduler issues, but that was
on a far more heavily loaded system. What instance type are you
running on?
oh of course, forgot to provide useful info:
# uname -ar
FreeBSD airflow-nfs.q0.ringdna.net 14.2-RELEASE-p1 FreeBSD 14.2-
RELEASE-p1 GENERIC amd64
Instance type:
t3a.xlarge
I also verified I have plenty of available "burstable credit"
available since this is a t class system (current balance is steady
at
2,300 credits).
Ah, this won't necessarily help you -- T family instances are on
shared hardware so even if you have burstable credits it's possible
that you'll be unlucky with "noisy neighbours" and the sibling
instances will all want CPU at the same time as you. But I think
there's probably something else going on as well.
oh that's a good point, since this is a pre-prod system that is less of a
concern
as we want to limit spend when possible. i'll be spinning up production
systems in the following week or so that will be on a "c"
class system, i'll keep an eye out to see if see similar messages in that
environment.
-pete
--
Pete Wright
[email protected]
HI Colin, Pete,
Your analysis regarding CPU being occupied is the classic explanation for this
kind
prints.
The prints are consistent with cpu not being available to the interrupt
handler to run.
Although you say you have burstable credits available, the fact that you are
using
T instance types does make you more susceptible to such issues.
Also when you say you have 25% CPU usage, how did you check that?
Are you using tools that give you an average over some time? so you may
have 75% of the time 0 cpu usage and 25% of the time 100% cpu usage.
As you already suggested, the first thing we would like to eliminate is the T
instance
Type.
If all works - great!
If not you may want to look into the spreading of interrupts over the different
cpus
using
https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena#io-irq-affinity
And also make sure that the cpu heavy processes you have, are run on different
cpus than
ones you handle the interrupts on.
Hope this helps,
Arthur
thanks for the context Arthur, I'll take a look at that sysctl knob. as
i said the box is only serving a python virtual environment to a pool of
ec2 compute nodes, and the dataset resides in memory. so nothing too
crazy. the load does have spikes but they are pretty brief and rarely
over %70. i'm collecting metrics via telegraph, and also observe load
via the usual suspects like top, systat etc.
it sounds like ena(4) seems to be particularly sensitive to cpu spikes
though - at least with this vm configuration. if i continue to see
these messages in dmesg i'll test out distributing IRQ's, otherwise i
think i can chalk this up to a noisy neighbor or something similar.
thanks!
-pete
--
Pete Wright
[email protected]