Hi Cristian, On Tue, Dec 9, 2025 at 6:42 AM Cristian Contescu via discuss <[email protected]> wrote: > > Hello everyone > > We wanted to check with the community a strange issue which we saw happening > in the following scenario > > In order to scale out one of our environments we decided to increase the IPv4 > provider network from a /22 to a /20 on Openstack, after doing so > We noticed that OVS started using 100% of CPU (also from the logs): > > > 2025-12-04T14:18:07Z|26750|poll_loop|INFO|wakeup due to [POLLIN] on fd 425 > (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (101% CPU > usage) > > > > When the CPU spiked to 100% (ovs-vswitchd main process, while handlers and > revalidator threads are not as used) we started having packetloss regardless > of traffic (IPv4 / IPv6) > > After that we reverted the change and saw that the same issue happens when we > gradually increase the number of virtual routers on another environment (with > less traffic) with another /20 provider network. > > Do you know of any recent fixes related to the following or does anyone have > experienced a similar issue and can point us to some options to evaluate? > > > Our setup: > > dual stack external network (VLAN type) with a /20 (increased from /22 > before) IPv4 subnet and a /64 IPv6 subnet > virtual routers are connected to the external network: > > dual-stack tenant networks are possible > > for IPv4 we use distributed floating IPs and SNAT(+DNAT) > for IPv6 tenant networks are public and advertised via the ovn-bgp-agent to > the physical routers with next-hop being on the external network > > our Openstack setup is based on openstack-helm deployed on physical nodes > > > So as of now our current findings are: > - Correlation between number of virtual routers and CPU usage increase > - Potential correlation between provider network being /20 instead of /22 > (increase in broadcast domain / traffic) > Did you set the broadcast-arps-to-all-routers=false in the provider network's logical switch? Ex: ovn-nbctl --no-leader-only set logical_switch <ID-Logical-Switch> other_config:broadcast-arps-to-all-routers=false
Checking your scenario I think it will probably decrease this load spike in the ovs-vswitchd. > - Potential RA IPv6 / multicast flood when the issue happens by investigating > tcpdumps > > What you did that make the problem appear. > In order to replicate this issue we increased the provider network from /22 > to /20 in Openstack and on the physical routers connecting the external > network. > > Another way to replicate the issue is to just increase the number of virtual > routers on an existing /20 provider network on a different Openstack > environment > > What you expected to happen. > - OVS has the same load as before and doesn't reach 100% of CPU usage, thus > no packetloss > - OVS is able to sustain the /20 provider network from Openstack > > What actually happened. > - As soon as OVS main process reached 100% CPU usage from the log lines, we > started detecting packet loss > > - Other errors detected are "|WARN|over 4096 resubmit actions on bridge" > > neutron openvswitch-zg86k openvswitch-vswitchd > 2025-12-08T14:33:28Z|00016|ofproto_dpif_xlate(handler35)|WARN|over 4096 > resubmit actions on bridge br-int while processing > icmp6,in_port=1,dl_vlan=101,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:02:65,dl_dst=33:33:00:00:00:01,ipv6_src=fe80::200:5eff:fe00:265,ipv6_dst=ff02::1,ipv6_label=0x00000,nw_tos=224,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=134,icmp_code=0 > > > Versions of various components: > ovs-vswitchd --version > ovs-vswitchd (Open vSwitch) 3.3.4 > > ovn-controller --version > ovn-controller 24.03.6 > Open vSwitch Library 3.3.4 > OpenFlow versions 0x6:0x6 > SB DB Schema 20.33.0 > > No local patches > > Kernel: > # cat /proc/version > Linux version 6.8.0-52-generic (buildd@lcy02-amd64-046) > (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU > Binutils for Ubuntu) 2.42) #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 > UTC 2025 > > # ovs-dpctl show > system@ovs-system: > lookups: hit:25791010986 missed:954723789 lost:3319115 > flows: 1185 > masks: hit:96181163550 total:35 hit/pkt:3.60 > cache: hit:19162151183 hit-rate:71.65% > caches: > masks-cache: size:256 > port 0: ovs-system (internal) > port 1: br-ex (internal) > port 2: bond0 > port 3: br-int (internal) > port 4: genev_sys_6081 (geneve: packet_type=ptap) > port 5: tap196a9595-b2 > port 6: tap72f307a7-37 > .. > .. > > The only workaround seems to decrease the number of virtual routers which > is not sustainable in a used environment > > We checked flows and seems they grow from 70k~ up to 120/130k~ when the issue > happens > # ovs-ofctl dump-flows br-int | wc -l > 78413 > > > Thank you for your help, > > Cristi > Regards, Tiago Pires > > _______________________________________________ > discuss mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
