I reached out to phausman on #canonical-support
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1855409
Title:
qede driver causes 100% CPU load
Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Xenial:
Invalid
Status in linux source package in Bionic:
Fix Committed
Status in linux source package in Disco:
Fix Committed
Status in linux source package in Eoan:
Fix Released
Status in linux source package in Focal:
Fix Released
Bug description:
[Impact]
* The PTP feature in qede driver is implemented in a way that if the
NIC firmware takes some time to perform the timestamping then the PTP
worker function will reschedule itself indefinitely until the value
read from a device register is meaningful. With that behavior, if an
userspace tool requests a bad configured TX/RX filter (or if NIC
firmware has any other issue in timestamping), the function
qede_ptp_task() will reschedule itself forever and cause an unbound
resource consumption. This manifests as a kworker thread consuming
100% of CPU.
* The dmesg log will show a message like this:
"qede_ptp_tx_ts:533(eno3)]Timestamping in progress"
Also, by using perf user can observe a stack like the following:
- 44.76% 0.00% kworker/16:5 [kernel.kallsyms]
ret_from_fork
- kthread
- 44.74% worker_thread
- 44.57% process_one_work
- 42.67% qede_ptp_task
- 38.86% qed_ptp_hw_read_tx_ts
qed_rd
- 3.03% queue_work_on
- 2.06% __queue_work
- 0.68% get_work_pool
- 0.61% radix_tree_lookup
__radix_tree_lookup
0.50% set_work_pool_and_clear_pending
* The patch proposed in this SRU request refactors the PTP worked in
qede by adding a time limit, after which the task doesn't reschedule
itself anymore, failing the timestamp procedure: 9adebac37e7d ("qede:
Handle infinite driver spinning for Tx timestamp.")
http://git.kernel.org/linus/9adebac37e7d
Besides fixing the issue, it also adds an ethtool statistics for
accounting the PTP errors.
[Test case]
By using chrony in Bionic, the following steps will reproduce the
issue:
a) Install chrony on Bionic in a system with working NIC managed by qede;
b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf
file;
c) Restart chrony service
Check dmesg for the "[...]Timestamping in progress" message and the
overall CPU workload using a tool like "top" to observe a kthread
consuming 100% of CPU.
[Regression potential]
The patch scope is restricted to qede PTP handler, and is upstream for
more than 7 months. If there's any possibility of regressions, the
worst would be an issue affecting the packet timestamping, not messing
with the regular xmit path of the driver.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1855409/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp