On 01/03/2023 14:36, Ilya Maximets wrote:
On 3/1/23 15:24, Brendan Doyle wrote:

On 01/03/2023 13:13, Ilya Maximets wrote:
On 3/1/23 12:42, Brendan Doyle via discuss wrote:
Hi Folks,

Whilst launching a lot (150+) of VMs on a hypervisor we occasionally
see errors like
"libvirt: error : internal error: Unable to add port vnet114 to OVS bridge 
br-int"

in var/log/messages, for example:


Feb 25 04:13:15 pcacn001 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl
--timeout=5 -- --may-exist add-port br-int vnet87 -- set Interface vnet87
"external-ids:attached-mac=\"00:13:97:42:ff:5d\"" -- set Interface vnet87
"external-ids:iface-id=\"b652aef0-b4c2-11ed-9fba-a8698c05eb74\"" -- set
Interface vnet87
"external-ids:vm-id=\"ba70d0b3-9b47-48e8-9ed4-a8bf04bae901\"" -- set
Interface vnet87 external-ids:iface-status=active

Feb 25 04:13:20 pcacn001 journal: internal error: Child process (ovs-vsctl
--timeout=5 -- --may-exist add-port br-int vnet87 -- set Interface vnet87
'external-ids:attached-mac="00:13:97:42:ff:5d"' -- set Interface vnet87
'external-ids:iface-id="b652aef0-b4c2-11ed-9fba-a8698c05eb74"' -- set
Interface vnet87 'external-ids:vm-id="ba70d0b3-9b47-48e8-9ed4-a8bf04bae901"'
-- set Interface vnet87 external-ids:iface-status=active) unexpected fatal
signal 14: 2023-02-25T04:13:20Z|00002|fatal_signal|WARN|terminating with
signal 14 (Alarm clock)
Feb 25 04:13:20 pcacn001 journal: internal error: Unable to add port vnet87
to OVS bridge br-int


There is nothing suspicious in ovn-controller log, but the ovsdb-server and
ovs-vswitchd logs for this time period do contain WARNs

ovsdb-server.log
================
2023-02-25T04:13:08.896Z|108756|jsonrpc|WARN|unix#84848: receive error: 
Connection reset by peer
2023-02-25T04:13:08.896Z|108757|reconnect|WARN|unix#84848: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:08.910Z|108758|reconnect|WARN|unix#84850: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:08.918Z|108759|reconnect|WARN|unix#84851: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:08.925Z|108760|reconnect|WARN|unix#84852: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:08.932Z|108761|reconnect|WARN|unix#84853: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:08.938Z|108762|reconnect|WARN|unix#84854: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:09.238Z|108763|reconnect|WARN|unix#84855: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:09.277Z|108764|reconnect|WARN|unix#84856: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:09.284Z|108765|reconnect|WARN|unix#84857: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:09.300Z|108766|reconnect|WARN|unix#84859: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:09.308Z|108767|reconnect|WARN|unix#84860: connection dropped 
(Connection reset by peer)
2023-02-25T04:13:09.323Z|108768|reconnect|WARN|unix#84862: connection dropped 
(Connection reset by peer)
2023-02-25T04:14:46.220Z|108769|jsonrpc|WARN|Dropped 11 log messages in last 97 
seconds (most recently, 97 seconds ago) due to excessive rate
2023-02-25T04:14:46.220Z|108770|jsonrpc|WARN|unix#84870: receive error: 
Connection reset by peer

ovs-vswitchd.log
=================
2023-02-25T04:13:00.849Z|00258|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:01.694Z|04014|timeval|WARN|Unreasonably long 1910ms poll 
interval (902ms user, 1007ms system)
2023-02-25T04:13:01.694Z|04015|timeval|WARN|faults: 21 minor, 0 major
2023-02-25T04:13:01.694Z|04016|timeval|WARN|context switches: 0 voluntary, 2 
involuntary
2023-02-25T04:13:01.694Z|04017|coverage|INFO|Dropped 12 log messages in last 65 
seconds (most recently, 40 seconds ago) due to excessive rate
2023-02-25T04:13:01.694Z|04018|coverage|INFO|Skipping details of duplicate 
event coverage for hash=ce7b9f83
2023-02-25T04:13:01.722Z|04019|bridge|INFO|bridge br-int: added interface 
vnet86 on port 204
2023-02-25T04:13:02.695Z|00259|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:03.676Z|04020|timeval|WARN|Unreasonably long 1982ms poll 
interval (895ms user, 1084ms system)
2023-02-25T04:13:03.676Z|04021|timeval|WARN|faults: 1 minor, 0 major
2023-02-25T04:13:03.676Z|04022|timeval|WARN|context switches: 0 voluntary, 5 
involuntary
2023-02-25T04:13:04.676Z|00260|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:05.658Z|04023|timeval|WARN|Unreasonably long 1982ms poll 
interval (922ms user, 1059ms system)
2023-02-25T04:13:05.658Z|04024|timeval|WARN|faults: 18 minor, 0 major
2023-02-25T04:13:05.658Z|04025|timeval|WARN|context switches: 2 voluntary, 5 
involuntary
2023-02-25T04:13:05.658Z|04026|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
(<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
2023-02-25T04:13:06.672Z|00261|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:07.616Z|04027|timeval|WARN|Unreasonably long 1958ms poll 
interval (916ms user, 1025ms system)
2023-02-25T04:13:07.616Z|04028|timeval|WARN|context switches: 3 voluntary, 6 
involuntary
2023-02-25T04:13:07.616Z|04029|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
(<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
2023-02-25T04:13:08.616Z|00262|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:09.536Z|04030|timeval|WARN|Unreasonably long 1920ms poll 
interval (919ms user, 999ms system)
2023-02-25T04:13:09.536Z|04031|timeval|WARN|faults: 2 minor, 0 major
2023-02-25T04:13:09.536Z|04032|timeval|WARN|context switches: 4 voluntary, 3 
involuntary
2023-02-25T04:13:09.536Z|04033|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 
(NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (101% CPU usage)
2023-02-25T04:13:09.537Z|04034|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 
(FIFO pipe:[52276]) at vswitchd/bridge.c:423 (101% CPU usage)
2023-02-25T04:13:09.537Z|04035|poll_loop|INFO|wakeup due to [POLLIN] on fd 529 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (101% CPU usage)
2023-02-25T04:13:10.538Z|00263|ovs_rcu(urcu9)|WARN|blocked 1001 ms waiting for 
main to quiesce
2023-02-25T04:13:11.512Z|04036|timeval|WARN|Unreasonably long 1976ms poll 
interval (918ms user, 1023ms system)
2023-02-25T04:13:11.512Z|04037|timeval|WARN|faults: 3 minor, 0 major
2023-02-25T04:13:11.512Z|04038|timeval|WARN|context switches: 8 voluntary, 5 
involuntary
2023-02-25T04:13:11.512Z|04039|poll_loop|INFO|wakeup due to [POLLIN] on fd 529 
(/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (101% CPU usage)
2023-02-25T04:13:11.522Z|04040|poll_loop|INFO|wakeup due to 0-ms timeout at 
ofproto/ofproto-dpif.c:1976 (101% CPU usage)
2023-02-25T04:13:11.972Z|04041|poll_loop|INFO|Dropped 1 log messages in last 1 
seconds (most recently, 1 seconds ago) due to excessive rate
2023-02-25T04:13:11.972Z|04042|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 
(FIFO pipe:[52276]) at vswitchd/bridge.c:423 (101% CPU usage)
2023-02-25T04:13:16.472Z|00264|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:17.472Z|00265|ovs_rcu(urcu9)|WARN|blocked 2000 ms waiting for 
main to quiesce
2023-02-25T04:13:18.405Z|04043|connmgr|INFO|br-int<->unix#1: 60 flow_mods in 
the 16 s starting 25 s ago (58 adds, 2 deletes)
2023-02-25T04:13:18.487Z|04044|timeval|WARN|Unreasonably long 3303ms poll 
interval (1569ms user, 1731ms system)
2023-02-25T04:13:18.487Z|04045|timeval|WARN|faults: 11 minor, 0 major
2023-02-25T04:13:18.487Z|04046|timeval|WARN|context switches: 1 voluntary, 8 
involuntary
2023-02-25T04:13:18.487Z|04047|poll_loop|INFO|Dropped 10 log messages in last 6 
seconds (most recently, 3 seconds ago) due to excessive rate
2023-02-25T04:13:18.487Z|04048|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 
(NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (57% CPU usage)
2023-02-25T04:13:18.518Z|04049|bridge|INFO|bridge br-int: added interface 
vnet87 on port 205
2023-02-25T04:13:19.487Z|00266|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:20.475Z|04050|ofproto|WARN|br-int: cannot get STP status on 
nonexistent port 205
2023-02-25T04:13:20.475Z|04051|ofproto|WARN|br-int: cannot get RSTP status on 
nonexistent port 205
2023-02-25T04:13:20.547Z|04052|timeval|WARN|Unreasonably long 2060ms poll 
interval (910ms user, 1092ms system)
2023-02-25T04:13:20.548Z|04053|timeval|WARN|faults: 9 minor, 0 major
2023-02-25T04:13:20.548Z|04054|timeval|WARN|context switches: 8 voluntary, 4 
involuntary
2023-02-25T04:13:20.578Z|04055|bridge|INFO|bridge br-int: deleted interface 
vnet87 on port 205
2023-02-25T04:13:20.581Z|04056|bridge|WARN|could not open network device vnet87 
(No such device)
2023-02-25T04:13:21.550Z|00267|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
2023-02-25T04:13:22.509Z|04057|timeval|WARN|Unreasonably long 1961ms poll 
interval (908ms user, 1048ms system)
2023-02-25T04:13:22.509Z|04058|timeval|WARN|faults: 1 minor, 0 major
2023-02-25T04:13:22.509Z|04059|timeval|WARN|context switches: 6 voluntary, 7 
involuntary
2023-02-25T04:13:24.026Z|04060|poll_loop|INFO|Dropped 12 log messages in last 6 
seconds (most recently, 1 seconds ago) due to excessive rate



Any idea what's going on here? Should it be that ovs-vsctl timeout should be 
longer?
Who is sending the SIG 14 to it and why?
ovs-vsctl sends SIGALRM to itself after 5 seconds, because it is
started with --timeout=5.

What does thisovs-vswitchd  WARN mean:
2023-02-25T04:13:02.695Z|00259|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting for 
main to quiesce
That means that the main thread is busy doing some work for at least
one second without entering quiescent state.  basically means that
one iteration in the main thread took a lot of time.

Looking at this:
    Unreasonably long 2060ms poll interval (910ms user, 1092ms system)

ovs-vswitchd is likely stuck somewhere inside the syscall in the
kernel for 1092ms.  That is a lot of time.

It seems that eventually the port is added even though ovs-vsctl cmd failed :
ovs-vsctl list-ports br-int | grep vnet87
vnet87
Database is fine, so the port was added to it without issues, but
ovs-vsctl waits for ovs-vswitchd to act on database update.  That
didn't happen within 5 seconds, so ovs-vsctl reported an error.

'ovs-vsctl list-ports' is a pure database command, it shows the
database content.

It's also likely that ovs-vswitch will actually add the port eventually
when it gets out from being stuck in the kernel.  But it's too late
for ovs-vsctl.

You need to investigate what is going on with ovs-vswitchd, i.e.
what holds it inside the kernel for so long, of it it makes way
too many syscalls for some reason.

Not sure how many cores you have or what OVS version you're using,
but this might be one of the possible causes of being stuck in
syscalls for a long time on systems with large number of cores with
A lot
echo "Threads/core: $(nproc --all)"
Threads/core: 384

OVS/OVN version:

[root@pcacn001 openvswitch]# ovs-vsctl -V
ovs-vsctl (Open vSwitch) 2.16.90
DB Schema 8.3.0
[root@pcacn001 openvswitch]# ovn-nbctl -V
ovn-nbctl 21.09.0
Open vSwitch Library 2.16.90
DB Schema 5.32.1

big number of ports:
    
https://urldefense.com/v3/__https://github.com/openvswitch/ovs/commit/c0e053f6d__;!!ACWV5N9M2RV99hQ!JinadxFbxISvtiYFYAlsGUE6tu3ydjDzfZMLdMyVVnE8TggGV20_M4vFCujhiZNfg72DM1ucWXQOxjfXmb4Z_w$
Thanks, I'll take a look
Yeah, with that many cores and ports, you should definitely try that patch.
It was also backported to stable branches.  On current LTS branch it is
available since v2.17.3.

Yeah I just checked - thanks.


Best regards, Ilya Maximets.

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to