Re: [ovs-discuss] ovs-vswitchd high CPU load

Cristian Contescu via discuss Wed, 10 Dec 2025 09:12:01 -0800

Hi Tiago,

Many thanks again for the prompt reply. We would have to check if any
use-case might be affected by the missing GARP. If this would be the case
(thinking of cases like allowed_address_pairs and keepalived running on VMs
on external/provider networks), would you be aware of a possible workaround?


Best regards,
Cristi

On Tue 9 Dec 2025, 17:23 Tiago Pires, <[email protected]> wrote:

> Hi,
>
> On Tue, 9 Dec 2025 at 11:08 Cristian Contescu <[email protected]> wrote:
>
>> Hi Tiago,
>>
>> Thank you for your quick reply. We have actually just tested this in one
>> of our clusters and it seems that the load issue is gone.
>> Are you aware of any possible drawbacks of setting this?
>>
>
> Indeed if you rely on GARPs from an upstream
> router for some kind of failover mechanism, this would not work more.
>
>
> Tiago Pires
>
>>
>> Thank you again,
>> Cristi
>>
>> On Tue, Dec 9, 2025 at 1:07 PM Tiago Pires <[email protected]>
>> wrote:
>>
>>> Hi Cristian,
>>>
>>> On Tue, Dec 9, 2025 at 6:42 AM Cristian Contescu via discuss
>>> <[email protected]> wrote:
>>> >
>>> > Hello everyone
>>> >
>>> > We wanted to check with the community a strange issue which we saw
>>> happening in the following scenario
>>> >
>>> > In order to scale out one of our environments we decided to increase
>>> the IPv4 provider network from a /22 to a /20 on Openstack, after doing so
>>> > We noticed that OVS started using 100% of CPU (also from the logs):
>>> >
>>> >
>>> > 2025-12-04T14:18:07Z|26750|poll_loop|INFO|wakeup due to [POLLIN] on fd
>>> 425 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (101%
>>> CPU usage)
>>> >
>>> >
>>> >
>>> > When the CPU spiked to 100% (ovs-vswitchd main process, while handlers
>>> and revalidator threads are not as used) we started having packetloss
>>> regardless of traffic (IPv4 / IPv6)
>>> >
>>> > After that we reverted the change and saw that the same issue happens
>>> when we gradually increase the number of virtual routers on another
>>> environment (with less traffic) with another /20 provider network.
>>> >
>>> > Do you know of any recent fixes related to the following or does
>>> anyone have experienced a similar issue and can point us to some options to
>>> evaluate?
>>> >
>>> >
>>> > Our setup:
>>> >
>>> > dual stack external network (VLAN type) with a /20 (increased from /22
>>> before) IPv4 subnet and a /64 IPv6 subnet
>>> >  virtual routers are connected to the external network:
>>> >
>>> > dual-stack tenant networks are possible
>>> >
>>> > for IPv4 we use distributed floating IPs and SNAT(+DNAT)
>>> > for IPv6 tenant networks are public and advertised via the
>>> ovn-bgp-agent to the physical routers with next-hop being on the external
>>> network
>>> >
>>> > our Openstack setup is based on openstack-helm deployed on physical
>>> nodes
>>> >
>>> >
>>> > So as of now our current findings are:
>>> > - Correlation between number of virtual routers and CPU usage increase
>>> > - Potential correlation between provider network being /20 instead of
>>> /22 (increase in broadcast domain / traffic)
>>> >
>>> Did you set the broadcast-arps-to-all-routers=false in the provider
>>> network's logical switch?
>>> Ex: ovn-nbctl --no-leader-only set logical_switch <ID-Logical-Switch>
>>> other_config:broadcast-arps-to-all-routers=false
>>>
>>> Checking your scenario I think it will probably decrease this load
>>> spike in the ovs-vswitchd.
>>>
>>> > - Potential RA IPv6 / multicast flood when the issue happens by
>>> investigating tcpdumps
>>> >
>>> > What you did that make the problem appear.
>>> > In order to replicate this issue we increased the provider network
>>> from /22 to /20 in Openstack and on the physical routers connecting the
>>> external network.
>>> >
>>> > Another way to replicate the issue is to just increase the number of
>>> virtual routers on an existing /20 provider network on a different
>>> Openstack environment
>>> >
>>> > What you expected to happen.
>>> > - OVS has the same load as before and doesn't reach 100% of CPU usage,
>>> thus no packetloss
>>> > - OVS is able to sustain the /20 provider network from Openstack
>>> >
>>> > What actually happened.
>>> > - As soon as OVS main process reached 100% CPU usage from the log
>>> lines, we started detecting packet loss
>>> >
>>> > - Other errors detected are "|WARN|over 4096 resubmit actions on
>>> bridge"
>>> >
>>> > neutron openvswitch-zg86k openvswitch-vswitchd
>>> 2025-12-08T14:33:28Z|00016|ofproto_dpif_xlate(handler35)|WARN|over 4096
>>> resubmit actions on bridge br-int while processing
>>> icmp6,in_port=1,dl_vlan=101,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:02:65,dl_dst=33:33:00:00:00:01,ipv6_src=fe80::200:5eff:fe00:265,ipv6_dst=ff02::1,ipv6_label=0x00000,nw_tos=224,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=134,icmp_code=0
>>> >
>>> >
>>> > Versions of various components:
>>> >  ovs-vswitchd --version
>>> > ovs-vswitchd (Open vSwitch) 3.3.4
>>> >
>>> >  ovn-controller --version
>>> > ovn-controller 24.03.6
>>> > Open vSwitch Library 3.3.4
>>> > OpenFlow versions 0x6:0x6
>>> > SB DB Schema 20.33.0
>>> >
>>> > No local patches
>>> >
>>> > Kernel:
>>> > # cat /proc/version
>>> > Linux version 6.8.0-52-generic (buildd@lcy02-amd64-046)
>>> (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU
>>> Binutils for Ubuntu) 2.42) #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11
>>> 00:06:25 UTC 2025
>>> >
>>> > # ovs-dpctl show
>>> > system@ovs-system:
>>> >   lookups: hit:25791010986 missed:954723789 lost:3319115
>>> >   flows: 1185
>>> >   masks: hit:96181163550 total:35 hit/pkt:3.60
>>> >   cache: hit:19162151183 hit-rate:71.65%
>>> >   caches:
>>> >     masks-cache: size:256
>>> >   port 0: ovs-system (internal)
>>> >   port 1: br-ex (internal)
>>> >   port 2: bond0
>>> >   port 3: br-int (internal)
>>> >   port 4: genev_sys_6081 (geneve: packet_type=ptap)
>>> >   port 5: tap196a9595-b2
>>> >   port 6: tap72f307a7-37
>>> >   ..
>>> >   ..
>>> >
>>> >   The only workaround seems to decrease the number of virtual routers
>>> which is not sustainable in a used environment
>>> >
>>> > We checked flows and seems they grow from 70k~ up to 120/130k~ when
>>> the issue happens
>>> > # ovs-ofctl dump-flows br-int | wc -l
>>> > 78413
>>> >
>>> >
>>> > Thank you for your help,
>>> >
>>> > Cristi
>>> >
>>> Regards,
>>>
>>> Tiago Pires
>>> >
>>> > _______________________________________________
>>> > discuss mailing list
>>> > [email protected]
>>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>
>>> --
>>>
>>>
>>>
>>>
>>> _‘Esta mensagem é direcionada apenas para os endereços constantes no
>>> cabeçalho inicial. Se você não está listado nos endereços constantes no
>>> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
>>> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas
>>> estão
>>> imediatamente anuladas e proibidas’._
>>>
>>>
>>> * **‘Apesar do Magazine Luiza tomar
>>> todas as precauções razoáveis para assegurar que nenhum vírus esteja
>>> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade
>>> por
>>> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
>>>
>>>
>>>
>>>
>
> *‘Esta mensagem é direcionada apenas para os endereços constantes no
> cabeçalho inicial. Se você não está listado nos endereços constantes no
> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa
> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão
> imediatamente anuladas e proibidas’.*
>
>  *‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para
> assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não
> poderá aceitar a responsabilidade por quaisquer perdas ou danos causados
> por esse e-mail ou por seus anexos’.*
>

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovs-vswitchd high CPU load

Reply via email to