Hi Cristian,

On Tue, Dec 9, 2025 at 6:42 AM Cristian Contescu via discuss
<[email protected]> wrote:
>
> Hello everyone
>
> We wanted to check with the community a strange issue which we saw happening 
> in the following scenario
>
> In order to scale out one of our environments we decided to increase the IPv4 
> provider network from a /22 to a /20 on Openstack, after doing so
> We noticed that OVS started using 100% of CPU (also from the logs):
>
>
> 2025-12-04T14:18:07Z|26750|poll_loop|INFO|wakeup due to [POLLIN] on fd 425 
> (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (101% CPU 
> usage)
>
>
>
> When the CPU spiked to 100% (ovs-vswitchd main process, while handlers and 
> revalidator threads are not as used) we started having packetloss regardless 
> of traffic (IPv4 / IPv6)
>
> After that we reverted the change and saw that the same issue happens when we 
> gradually increase the number of virtual routers on another environment (with 
> less traffic) with another /20 provider network.
>
> Do you know of any recent fixes related to the following or does anyone have 
> experienced a similar issue and can point us to some options to evaluate?
>
>
> Our setup:
>
> dual stack external network (VLAN type) with a /20 (increased from /22 
> before) IPv4 subnet and a /64 IPv6 subnet
>  virtual routers are connected to the external network:
>
> dual-stack tenant networks are possible
>
> for IPv4 we use distributed floating IPs and SNAT(+DNAT)
> for IPv6 tenant networks are public and advertised via the ovn-bgp-agent to 
> the physical routers with next-hop being on the external network
>
> our Openstack setup is based on openstack-helm deployed on physical nodes
>
>
> So as of now our current findings are:
> - Correlation between number of virtual routers and CPU usage increase
> - Potential correlation between provider network being /20 instead of /22 
> (increase in broadcast domain / traffic)
>
Did you set the broadcast-arps-to-all-routers=false in the provider
network's logical switch?
Ex: ovn-nbctl --no-leader-only set logical_switch <ID-Logical-Switch>
other_config:broadcast-arps-to-all-routers=false

Checking your scenario I think it will probably decrease this load
spike in the ovs-vswitchd.

> - Potential RA IPv6 / multicast flood when the issue happens by investigating 
> tcpdumps
>
> What you did that make the problem appear.
> In order to replicate this issue we increased the provider network from /22 
> to /20 in Openstack and on the physical routers connecting the external 
> network.
>
> Another way to replicate the issue is to just increase the number of virtual 
> routers on an existing /20 provider network on a different Openstack 
> environment
>
> What you expected to happen.
> - OVS has the same load as before and doesn't reach 100% of CPU usage, thus 
> no packetloss
> - OVS is able to sustain the /20 provider network from Openstack
>
> What actually happened.
> - As soon as OVS main process reached 100% CPU usage from the log lines, we 
> started detecting packet loss
>
> - Other errors detected are "|WARN|over 4096 resubmit actions on bridge"
>
> neutron openvswitch-zg86k openvswitch-vswitchd 
> 2025-12-08T14:33:28Z|00016|ofproto_dpif_xlate(handler35)|WARN|over 4096 
> resubmit actions on bridge br-int while processing 
> icmp6,in_port=1,dl_vlan=101,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:02:65,dl_dst=33:33:00:00:00:01,ipv6_src=fe80::200:5eff:fe00:265,ipv6_dst=ff02::1,ipv6_label=0x00000,nw_tos=224,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=134,icmp_code=0
>
>
> Versions of various components:
>  ovs-vswitchd --version
> ovs-vswitchd (Open vSwitch) 3.3.4
>
>  ovn-controller --version
> ovn-controller 24.03.6
> Open vSwitch Library 3.3.4
> OpenFlow versions 0x6:0x6
> SB DB Schema 20.33.0
>
> No local patches
>
> Kernel:
> # cat /proc/version
> Linux version 6.8.0-52-generic (buildd@lcy02-amd64-046) 
> (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU 
> Binutils for Ubuntu) 2.42) #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 
> UTC 2025
>
> # ovs-dpctl show
> system@ovs-system:
>   lookups: hit:25791010986 missed:954723789 lost:3319115
>   flows: 1185
>   masks: hit:96181163550 total:35 hit/pkt:3.60
>   cache: hit:19162151183 hit-rate:71.65%
>   caches:
>     masks-cache: size:256
>   port 0: ovs-system (internal)
>   port 1: br-ex (internal)
>   port 2: bond0
>   port 3: br-int (internal)
>   port 4: genev_sys_6081 (geneve: packet_type=ptap)
>   port 5: tap196a9595-b2
>   port 6: tap72f307a7-37
>   ..
>   ..
>
>   The only workaround seems to decrease the number of virtual routers which 
> is not sustainable in a used environment
>
> We checked flows and seems they grow from 70k~ up to 120/130k~ when the issue 
> happens
> # ovs-ofctl dump-flows br-int | wc -l
> 78413
>
>
> Thank you for your help,
>
> Cristi
>
Regards,

Tiago Pires
>
> _______________________________________________
> discuss mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

-- 




_‘Esta mensagem é direcionada apenas para os endereços constantes no 
cabeçalho inicial. Se você não está listado nos endereços constantes no 
cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa 
mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão 
imediatamente anuladas e proibidas’._


* **‘Apesar do Magazine Luiza tomar 
todas as precauções razoáveis para assegurar que nenhum vírus esteja 
presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por 
quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*



_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to