Hi, On Tue, 9 Dec 2025 at 11:08 Cristian Contescu <[email protected]> wrote:
> Hi Tiago, > > Thank you for your quick reply. We have actually just tested this in one > of our clusters and it seems that the load issue is gone. > Are you aware of any possible drawbacks of setting this? > Indeed if you rely on GARPs from an upstream router for some kind of failover mechanism, this would not work more. Tiago Pires > > Thank you again, > Cristi > > On Tue, Dec 9, 2025 at 1:07 PM Tiago Pires <[email protected]> > wrote: > >> Hi Cristian, >> >> On Tue, Dec 9, 2025 at 6:42 AM Cristian Contescu via discuss >> <[email protected]> wrote: >> > >> > Hello everyone >> > >> > We wanted to check with the community a strange issue which we saw >> happening in the following scenario >> > >> > In order to scale out one of our environments we decided to increase >> the IPv4 provider network from a /22 to a /20 on Openstack, after doing so >> > We noticed that OVS started using 100% of CPU (also from the logs): >> > >> > >> > 2025-12-04T14:18:07Z|26750|poll_loop|INFO|wakeup due to [POLLIN] on fd >> 425 (/var/run/openvswitch/br-int.mgmt<->) at ../lib/stream-fd.c:157 (101% >> CPU usage) >> > >> > >> > >> > When the CPU spiked to 100% (ovs-vswitchd main process, while handlers >> and revalidator threads are not as used) we started having packetloss >> regardless of traffic (IPv4 / IPv6) >> > >> > After that we reverted the change and saw that the same issue happens >> when we gradually increase the number of virtual routers on another >> environment (with less traffic) with another /20 provider network. >> > >> > Do you know of any recent fixes related to the following or does anyone >> have experienced a similar issue and can point us to some options to >> evaluate? >> > >> > >> > Our setup: >> > >> > dual stack external network (VLAN type) with a /20 (increased from /22 >> before) IPv4 subnet and a /64 IPv6 subnet >> > virtual routers are connected to the external network: >> > >> > dual-stack tenant networks are possible >> > >> > for IPv4 we use distributed floating IPs and SNAT(+DNAT) >> > for IPv6 tenant networks are public and advertised via the >> ovn-bgp-agent to the physical routers with next-hop being on the external >> network >> > >> > our Openstack setup is based on openstack-helm deployed on physical >> nodes >> > >> > >> > So as of now our current findings are: >> > - Correlation between number of virtual routers and CPU usage increase >> > - Potential correlation between provider network being /20 instead of >> /22 (increase in broadcast domain / traffic) >> > >> Did you set the broadcast-arps-to-all-routers=false in the provider >> network's logical switch? >> Ex: ovn-nbctl --no-leader-only set logical_switch <ID-Logical-Switch> >> other_config:broadcast-arps-to-all-routers=false >> >> Checking your scenario I think it will probably decrease this load >> spike in the ovs-vswitchd. >> >> > - Potential RA IPv6 / multicast flood when the issue happens by >> investigating tcpdumps >> > >> > What you did that make the problem appear. >> > In order to replicate this issue we increased the provider network from >> /22 to /20 in Openstack and on the physical routers connecting the external >> network. >> > >> > Another way to replicate the issue is to just increase the number of >> virtual routers on an existing /20 provider network on a different >> Openstack environment >> > >> > What you expected to happen. >> > - OVS has the same load as before and doesn't reach 100% of CPU usage, >> thus no packetloss >> > - OVS is able to sustain the /20 provider network from Openstack >> > >> > What actually happened. >> > - As soon as OVS main process reached 100% CPU usage from the log >> lines, we started detecting packet loss >> > >> > - Other errors detected are "|WARN|over 4096 resubmit actions on bridge" >> > >> > neutron openvswitch-zg86k openvswitch-vswitchd >> 2025-12-08T14:33:28Z|00016|ofproto_dpif_xlate(handler35)|WARN|over 4096 >> resubmit actions on bridge br-int while processing >> icmp6,in_port=1,dl_vlan=101,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:5e:00:02:65,dl_dst=33:33:00:00:00:01,ipv6_src=fe80::200:5eff:fe00:265,ipv6_dst=ff02::1,ipv6_label=0x00000,nw_tos=224,nw_ecn=0,nw_ttl=255,nw_frag=no,icmp_type=134,icmp_code=0 >> > >> > >> > Versions of various components: >> > ovs-vswitchd --version >> > ovs-vswitchd (Open vSwitch) 3.3.4 >> > >> > ovn-controller --version >> > ovn-controller 24.03.6 >> > Open vSwitch Library 3.3.4 >> > OpenFlow versions 0x6:0x6 >> > SB DB Schema 20.33.0 >> > >> > No local patches >> > >> > Kernel: >> > # cat /proc/version >> > Linux version 6.8.0-52-generic (buildd@lcy02-amd64-046) >> (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU >> Binutils for Ubuntu) 2.42) #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 >> 00:06:25 UTC 2025 >> > >> > # ovs-dpctl show >> > system@ovs-system: >> > lookups: hit:25791010986 missed:954723789 lost:3319115 >> > flows: 1185 >> > masks: hit:96181163550 total:35 hit/pkt:3.60 >> > cache: hit:19162151183 hit-rate:71.65% >> > caches: >> > masks-cache: size:256 >> > port 0: ovs-system (internal) >> > port 1: br-ex (internal) >> > port 2: bond0 >> > port 3: br-int (internal) >> > port 4: genev_sys_6081 (geneve: packet_type=ptap) >> > port 5: tap196a9595-b2 >> > port 6: tap72f307a7-37 >> > .. >> > .. >> > >> > The only workaround seems to decrease the number of virtual routers >> which is not sustainable in a used environment >> > >> > We checked flows and seems they grow from 70k~ up to 120/130k~ when the >> issue happens >> > # ovs-ofctl dump-flows br-int | wc -l >> > 78413 >> > >> > >> > Thank you for your help, >> > >> > Cristi >> > >> Regards, >> >> Tiago Pires >> > >> > _______________________________________________ >> > discuss mailing list >> > [email protected] >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >> >> -- >> >> >> >> >> _‘Esta mensagem é direcionada apenas para os endereços constantes no >> cabeçalho inicial. Se você não está listado nos endereços constantes no >> cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa >> mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas >> estão >> imediatamente anuladas e proibidas’._ >> >> >> * **‘Apesar do Magazine Luiza tomar >> todas as precauções razoáveis para assegurar que nenhum vírus esteja >> presente nesse e-mail, a empresa não poderá aceitar a responsabilidade >> por >> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.* >> >> >> >> -- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
_______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
