[ovs-discuss] ovs-vswitchd running at 100% cpu

Donald Sharp via discuss Mon, 31 Oct 2022 12:02:08 -0700

Hi!

I work on the FRRouting project (https://frrouting/org ) and am doing work
with FRR and have noticed that when I have a full BGP feed on a system that
is also running ovs-vswitchd that ovs-vswitchd sits at 100% cpu:


top - 09:43:12 up 4 days, 22:53,  3 users,  load average: 1.06, 1.08, 1.08

Tasks: 188 total,   3 running, 185 sleeping,   0 stopped,   0 zombie

%Cpu(s): 12.3 us, 14.7 sy,  0.0 ni, 72.8 id,  0.0 wa,  0.0 hi,  0.2 si,
0.0 st

MiB Mem :   7859.3 total,   2756.5 free,   2467.2 used,   2635.6 buff/cache

MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.   5101.9 avail Mem



    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND


    730 root      10 -10  146204 146048  11636 R  98.3   1.8   6998:13
ovs-vswitchd


 169620 root      20   0       0      0      0 I   3.3   0.0   1:34.83
kworker/0:3-events


     21 root      20   0       0      0      0 S   1.3   0.0  14:09.59
ksoftirqd/1


 131734 frr       15  -5 2384292 609556   6612 S   1.0   7.6  21:57.51
zebra


 131739 frr       15  -5 1301168   1.0g   7420 S   1.0  13.3  18:16.17
bgpd


When I turn off FRR ( or turn off the bgp feed ) ovs-vswitchd stops running
at 100%:


top - 09:48:12 up 4 days, 22:58,  3 users,  load average: 0.08, 0.60, 0.89

Tasks: 169 total,   1 running, 168 sleeping,   0 stopped,   0 zombie

%Cpu(s):  0.2 us,  0.4 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.1 si,
0.0 st

MiB Mem :   7859.3 total,   4560.6 free,    663.1 used,   2635.6 buff/cache

MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.   6906.1 avail Mem



    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND


 179064 sharpd    20   0   11852   3816   3172 R   1.0   0.0   0:00.09
top


   1037 zerotie+  20   0  291852 113180   7408 S   0.7   1.4  19:09.17
zerotier-one


   1043 Debian-+  20   0   34356  21988   7588 S   0.3   0.3  22:04.42
snmpd


 178480 root      20   0       0      0      0 I   0.3   0.0   0:01.21
kworker/1:2-events


 178622 sharpd    20   0   14020   6364   4872 S   0.3   0.1   0:00.10
sshd


      1 root      20   0  169872  13140   8272 S   0.0   0.2   2:33.26
systemd


      2 root      20   0       0      0      0 S   0.0   0.0   0:00.60
kthreadd


I do not have any particular ovs configuration on this box:


sharpd@janelle:~$ sudo ovs-vsctl show

c72d327c-61eb-4877-b4e7-dcf7e07e24fc

    ovs_version: "2.13.8"



sharpd@janelle:~$ sudo ovs-vsctl list o .

_uuid               : c72d327c-61eb-4877-b4e7-dcf7e07e24fc

bridges             : []

cur_cfg             : 0

datapath_types      : [netdev, system]

datapaths           : {}

db_version          : "8.2.0"

dpdk_initialized    : false

dpdk_version        : none

external_ids        : {hostname=janelle, rundir="/var/run/openvswitch",
system-id="a1031fcf-8acc-40a9-9fd6-521716b0faaa"}

iface_types         : [erspan, geneve, gre, internal, ip6erspan, ip6gre,
lisp, patch, stt, system, tap, vxlan]

manager_options     : []

next_cfg            : 0

other_config        : {}

ovs_version         : "2.13.8"

ssl                 : []

statistics          : {}

system_type         : ubuntu

system_version      : "20.04"



sharpd@janelle:~$ sudo ovs-appctl dpctl/dump-flows -m

ovs-vswitchd: no datapaths exist

ovs-vswitchd: datapath not found (Invalid argument)

ovs-appctl: ovs-vswitchd: server returned an error


Eli Britstein suggested I update ovs-openvswitch to latest and I did and
saw the same behavior.  When I pulled up the running code in a debugger I
see

that ovs-vswitchd is running in this loop below pretty much 100% of the
time:


(gdb) f 4

#4  0x0000559498b4e476 in route_table_run () at lib/route-table.c:133

133                 nln_run(nln);

(gdb) l

128             OVS_EXCLUDED(route_table_mutex)

129         {

130             ovs_mutex_lock(&route_table_mutex);

131             if (nln) {

132                 rtnetlink_run();

133                 nln_run(nln);

134

135                 if (!route_table_valid) {

136                     route_table_reset();

137                 }

(gdb) l

138             }

139             ovs_mutex_unlock(&route_table_mutex);

140         }


I pulled up where route_table_valid is set:


298         static void

299         route_table_change(const struct route_table_msg *change
OVS_UNUSED,

300                            void *aux OVS_UNUSED)

301         {

302             route_table_valid = false;

303         }


If I am reading the code correctly, every RTM_NEWROUTE netlink message that
ovs-vswitchd is getting

is setting the route_table_valid global variable to false and causing
route_table_reset() to be run.

This makes sense in context of what FRR is doing.  A full BGP feed *always*
has churn.  So ovs-vswitchd

is receiving. RTM_NEWROUTE message, parsing it and deciding in
route_table_change() that the

route table is no longer valid and causing it to call route_table_reset()
which redumps the entire

routing table to ovs-vswitchd.  In this case there are ~115k ipv6 routes in
the linux fib.


I hesitate to make any changes here since I really don't understand what
the end goal here is.

ovs-vswitchd is receiving a route change from the kernel but is in turn
causing it to redump the entire

routing table again.  What should be the correct behavior be from
ovs-vswitchd's perspective here?


As a note, I recompiled and set line 302 to true above and cpu usage of
ovs-vswitchd pretty much stays

at 0% once the initial table read has been done.


thanks!


donald

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

[ovs-discuss] ovs-vswitchd running at 100% cpu

Reply via email to