On 3/1/23 12:42, Brendan Doyle via discuss wrote:
> Hi Folks,
> 
> Whilst launching a lot (150+) of VMs on a hypervisor we occasionally
> see errors like
> "libvirt: error : internal error: Unable to add port vnet114 to OVS bridge 
> br-int"
> 
> in var/log/messages, for example:
> 
> 
> Feb 25 04:13:15 pcacn001 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl
> --timeout=5 -- --may-exist add-port br-int vnet87 -- set Interface vnet87
> "external-ids:attached-mac=\"00:13:97:42:ff:5d\"" -- set Interface vnet87
> "external-ids:iface-id=\"b652aef0-b4c2-11ed-9fba-a8698c05eb74\"" -- set
> Interface vnet87
> "external-ids:vm-id=\"ba70d0b3-9b47-48e8-9ed4-a8bf04bae901\"" -- set
> Interface vnet87 external-ids:iface-status=active
> 
> Feb 25 04:13:20 pcacn001 journal: internal error: Child process (ovs-vsctl
> --timeout=5 -- --may-exist add-port br-int vnet87 -- set Interface vnet87
> 'external-ids:attached-mac="00:13:97:42:ff:5d"' -- set Interface vnet87
> 'external-ids:iface-id="b652aef0-b4c2-11ed-9fba-a8698c05eb74"' -- set
> Interface vnet87 'external-ids:vm-id="ba70d0b3-9b47-48e8-9ed4-a8bf04bae901"'
> -- set Interface vnet87 external-ids:iface-status=active) unexpected fatal
> signal 14: 2023-02-25T04:13:20Z|00002|fatal_signal|WARN|terminating with
> signal 14 (Alarm clock)
> Feb 25 04:13:20 pcacn001 journal: internal error: Unable to add port vnet87
> to OVS bridge br-int
> 
> 
> There is nothing suspicious in ovn-controller log, but the ovsdb-server and
> ovs-vswitchd logs for this time period do contain WARNs
> 
> ovsdb-server.log
> ================
> 2023-02-25T04:13:08.896Z|108756|jsonrpc|WARN|unix#84848: receive error: 
> Connection reset by peer
> 2023-02-25T04:13:08.896Z|108757|reconnect|WARN|unix#84848: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:08.910Z|108758|reconnect|WARN|unix#84850: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:08.918Z|108759|reconnect|WARN|unix#84851: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:08.925Z|108760|reconnect|WARN|unix#84852: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:08.932Z|108761|reconnect|WARN|unix#84853: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:08.938Z|108762|reconnect|WARN|unix#84854: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:09.238Z|108763|reconnect|WARN|unix#84855: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:09.277Z|108764|reconnect|WARN|unix#84856: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:09.284Z|108765|reconnect|WARN|unix#84857: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:09.300Z|108766|reconnect|WARN|unix#84859: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:09.308Z|108767|reconnect|WARN|unix#84860: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:13:09.323Z|108768|reconnect|WARN|unix#84862: connection dropped 
> (Connection reset by peer)
> 2023-02-25T04:14:46.220Z|108769|jsonrpc|WARN|Dropped 11 log messages in last 
> 97 seconds (most recently, 97 seconds ago) due to excessive rate
> 2023-02-25T04:14:46.220Z|108770|jsonrpc|WARN|unix#84870: receive error: 
> Connection reset by peer
> 
> ovs-vswitchd.log
> =================
> 2023-02-25T04:13:00.849Z|00258|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:01.694Z|04014|timeval|WARN|Unreasonably long 1910ms poll 
> interval (902ms user, 1007ms system)
> 2023-02-25T04:13:01.694Z|04015|timeval|WARN|faults: 21 minor, 0 major
> 2023-02-25T04:13:01.694Z|04016|timeval|WARN|context switches: 0 voluntary, 2 
> involuntary
> 2023-02-25T04:13:01.694Z|04017|coverage|INFO|Dropped 12 log messages in last 
> 65 seconds (most recently, 40 seconds ago) due to excessive rate
> 2023-02-25T04:13:01.694Z|04018|coverage|INFO|Skipping details of duplicate 
> event coverage for hash=ce7b9f83
> 2023-02-25T04:13:01.722Z|04019|bridge|INFO|bridge br-int: added interface 
> vnet86 on port 204
> 2023-02-25T04:13:02.695Z|00259|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:03.676Z|04020|timeval|WARN|Unreasonably long 1982ms poll 
> interval (895ms user, 1084ms system)
> 2023-02-25T04:13:03.676Z|04021|timeval|WARN|faults: 1 minor, 0 major
> 2023-02-25T04:13:03.676Z|04022|timeval|WARN|context switches: 0 voluntary, 5 
> involuntary
> 2023-02-25T04:13:04.676Z|00260|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:05.658Z|04023|timeval|WARN|Unreasonably long 1982ms poll 
> interval (922ms user, 1059ms system)
> 2023-02-25T04:13:05.658Z|04024|timeval|WARN|faults: 18 minor, 0 major
> 2023-02-25T04:13:05.658Z|04025|timeval|WARN|context switches: 2 voluntary, 5 
> involuntary
> 2023-02-25T04:13:05.658Z|04026|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
> (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
> 2023-02-25T04:13:06.672Z|00261|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:07.616Z|04027|timeval|WARN|Unreasonably long 1958ms poll 
> interval (916ms user, 1025ms system)
> 2023-02-25T04:13:07.616Z|04028|timeval|WARN|context switches: 3 voluntary, 6 
> involuntary
> 2023-02-25T04:13:07.616Z|04029|poll_loop|INFO|wakeup due to [POLLIN] on fd 15 
> (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (101% CPU usage)
> 2023-02-25T04:13:08.616Z|00262|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:09.536Z|04030|timeval|WARN|Unreasonably long 1920ms poll 
> interval (919ms user, 999ms system)
> 2023-02-25T04:13:09.536Z|04031|timeval|WARN|faults: 2 minor, 0 major
> 2023-02-25T04:13:09.536Z|04032|timeval|WARN|context switches: 4 voluntary, 3 
> involuntary
> 2023-02-25T04:13:09.536Z|04033|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 
> (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (101% CPU usage)
> 2023-02-25T04:13:09.537Z|04034|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 
> (FIFO pipe:[52276]) at vswitchd/bridge.c:423 (101% CPU usage)
> 2023-02-25T04:13:09.537Z|04035|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 529 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (101% CPU 
> usage)
> 2023-02-25T04:13:10.538Z|00263|ovs_rcu(urcu9)|WARN|blocked 1001 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:11.512Z|04036|timeval|WARN|Unreasonably long 1976ms poll 
> interval (918ms user, 1023ms system)
> 2023-02-25T04:13:11.512Z|04037|timeval|WARN|faults: 3 minor, 0 major
> 2023-02-25T04:13:11.512Z|04038|timeval|WARN|context switches: 8 voluntary, 5 
> involuntary
> 2023-02-25T04:13:11.512Z|04039|poll_loop|INFO|wakeup due to [POLLIN] on fd 
> 529 (/var/run/openvswitch/br-int.mgmt<->) at lib/stream-fd.c:157 (101% CPU 
> usage)
> 2023-02-25T04:13:11.522Z|04040|poll_loop|INFO|wakeup due to 0-ms timeout at 
> ofproto/ofproto-dpif.c:1976 (101% CPU usage)
> 2023-02-25T04:13:11.972Z|04041|poll_loop|INFO|Dropped 1 log messages in last 
> 1 seconds (most recently, 1 seconds ago) due to excessive rate
> 2023-02-25T04:13:11.972Z|04042|poll_loop|INFO|wakeup due to [POLLIN] on fd 22 
> (FIFO pipe:[52276]) at vswitchd/bridge.c:423 (101% CPU usage)
> 2023-02-25T04:13:16.472Z|00264|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:17.472Z|00265|ovs_rcu(urcu9)|WARN|blocked 2000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:18.405Z|04043|connmgr|INFO|br-int<->unix#1: 60 flow_mods in 
> the 16 s starting 25 s ago (58 adds, 2 deletes)
> 2023-02-25T04:13:18.487Z|04044|timeval|WARN|Unreasonably long 3303ms poll 
> interval (1569ms user, 1731ms system)
> 2023-02-25T04:13:18.487Z|04045|timeval|WARN|faults: 11 minor, 0 major
> 2023-02-25T04:13:18.487Z|04046|timeval|WARN|context switches: 1 voluntary, 8 
> involuntary
> 2023-02-25T04:13:18.487Z|04047|poll_loop|INFO|Dropped 10 log messages in last 
> 6 seconds (most recently, 3 seconds ago) due to excessive rate
> 2023-02-25T04:13:18.487Z|04048|poll_loop|INFO|wakeup due to [POLLIN] on fd 14 
> (NETLINK_ROUTE<->NETLINK_ROUTE) at lib/netlink-socket.c:1401 (57% CPU usage)
> 2023-02-25T04:13:18.518Z|04049|bridge|INFO|bridge br-int: added interface 
> vnet87 on port 205
> 2023-02-25T04:13:19.487Z|00266|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:20.475Z|04050|ofproto|WARN|br-int: cannot get STP status on 
> nonexistent port 205
> 2023-02-25T04:13:20.475Z|04051|ofproto|WARN|br-int: cannot get RSTP status on 
> nonexistent port 205
> 2023-02-25T04:13:20.547Z|04052|timeval|WARN|Unreasonably long 2060ms poll 
> interval (910ms user, 1092ms system)
> 2023-02-25T04:13:20.548Z|04053|timeval|WARN|faults: 9 minor, 0 major
> 2023-02-25T04:13:20.548Z|04054|timeval|WARN|context switches: 8 voluntary, 4 
> involuntary
> 2023-02-25T04:13:20.578Z|04055|bridge|INFO|bridge br-int: deleted interface 
> vnet87 on port 205
> 2023-02-25T04:13:20.581Z|04056|bridge|WARN|could not open network device 
> vnet87 (No such device)
> 2023-02-25T04:13:21.550Z|00267|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce
> 2023-02-25T04:13:22.509Z|04057|timeval|WARN|Unreasonably long 1961ms poll 
> interval (908ms user, 1048ms system)
> 2023-02-25T04:13:22.509Z|04058|timeval|WARN|faults: 1 minor, 0 major
> 2023-02-25T04:13:22.509Z|04059|timeval|WARN|context switches: 6 voluntary, 7 
> involuntary
> 2023-02-25T04:13:24.026Z|04060|poll_loop|INFO|Dropped 12 log messages in last 
> 6 seconds (most recently, 1 seconds ago) due to excessive rate
> 
> 
> 
> Any idea what's going on here? Should it be that ovs-vsctl timeout should be 
> longer?
> Who is sending the SIG 14 to it and why?

ovs-vsctl sends SIGALRM to itself after 5 seconds, because it is
started with --timeout=5.

> 
> What does thisovs-vswitchd  WARN mean:
> 2023-02-25T04:13:02.695Z|00259|ovs_rcu(urcu9)|WARN|blocked 1000 ms waiting 
> for main to quiesce

That means that the main thread is busy doing some work for at least
one second without entering quiescent state.  basically means that
one iteration in the main thread took a lot of time.

Looking at this:
  Unreasonably long 2060ms poll interval (910ms user, 1092ms system)

ovs-vswitchd is likely stuck somewhere inside the syscall in the
kernel for 1092ms.  That is a lot of time.

> 
> It seems that eventually the port is added even though ovs-vsctl cmd failed :
> ovs-vsctl list-ports br-int | grep vnet87
> vnet87

Database is fine, so the port was added to it without issues, but
ovs-vsctl waits for ovs-vswitchd to act on database update.  That
didn't happen within 5 seconds, so ovs-vsctl reported an error.

'ovs-vsctl list-ports' is a pure database command, it shows the
database content.

It's also likely that ovs-vswitch will actually add the port eventually
when it gets out from being stuck in the kernel.  But it's too late
for ovs-vsctl.

You need to investigate what is going on with ovs-vswitchd, i.e.
what holds it inside the kernel for so long, of it it makes way
too many syscalls for some reason.

Not sure how many cores you have or what OVS version you're using,
but this might be one of the possible causes of being stuck in
syscalls for a long time on systems with large number of cores with
big number of ports:
  https://github.com/openvswitch/ovs/commit/c0e053f6d

Best regards, Ilya Maximets.
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to