Re: [ovs-discuss] [OVN] OVN frequent latency spike and packet loss issue on large scale network

Felix Huettner via discuss Wed, 24 Sep 2025 20:20:12 -0700

On Wed, Sep 24, 2025 at 02:10:47PM +0700, Shawn Ming wrote:
> Hi Felix,
> 
> Thanks for your advice.


Hi Shawn,

> 
> 1. To some extent, I believe that limiting ARP requests by setting
> broadcast-arps-to-all-routers=false might help address our issue. My
> concern, however, is whether applying this change might temporarily
> disrupt our service or current network connectivity, so we want to
> carefully assess the risk before proceeding. I also have a specific
> scenario in mind:
> - We're using keepalived for a service running on the /21 network. If
> one interface goes down, the IP mapping will shift to another
> interface.
> In that case, how would ARP resolution behave if
> broadcast-arps-to-all-routers is disabled? Could this impact service
> availability?

yes, such garps would no longer work.
An alternative might be to use a virtual mac address that is the same
across both interfaces and then just the mac address moves. But i have
never tried this in that context.

> 
> 2. On a related note, I’d like your opinion regarding the Open vSwitch
> process/container on the /21 network.
> Even after I set max-revalidator and max-idle to 10s and 100s
> respectively - which, according to the documentation, should mean each
> flow has a timeout of 10s - I still observe that the ovs-vswitchd
> process (and the openvswitch_vswitchd container as a whole) continues
> to consume a significant amount of CPU resources (around 300–600%
> according to top/htop and docker stats). The revalidator also seems to
> be running hard.
> In your experience, is this level of CPU usage normal or acceptable?
> Looking forward to your opinion on this!

We have an ovs-vswitchd process that regularly runs with up to 20 cores,
nearly all of which is revalidators. So 3-6 cores is definately doable.

If you actually have max-idle==100s then datapath flows should have a
timeout of 100s. The max-revalidator is only used for timing when to
latestly trigger revalidation (if i understood it correctly).

https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained#how_the_revalidator_threads_operate
is quite helpful here.

"ovs-appctl upcall/show" might be helpful to take a look at. Especially
the dump duration and if the flow limit is still at the default of
200000, or if it has been reduced by the revalidators taking too long.

Thanks,
Felixhelpful to take a look at. Especially the dump duration and if the
flow limit is still at the default of 200000, or if it has been reduced
by the revalidators taking too long.

Thanks,
Felix


> 
> Best regards,
> Shawn
> 
> 
> On Mon, Sep 22, 2025 at 1:48 PM Felix Huettner
> <[email protected]> wrote:
> >
> > On Mon, Sep 22, 2025 at 09:18:28AM +0700, Shawn Ming wrote:
> > > Hi Felix, all,
> >
> > Hi Shawn,
> >
> > >
> > > Thanks Felix for your detailed notes and suggestions. I did some
> > > additional investigation as you recommended, and here’s what I found:
> > >
> > > 1. Using tcpdump, I observed that roughly ~2,000 ARP requests are
> > > being sent from the router to all physical nodes - including those
> > > without any VMs in the /21 network.
> > > Most ARP requests are for IPs that appear to be invalid (either
> > > unallocated or currently down), while valid IPs do not generate nearly
> > > as many requests.
> >
> > i would guess that this /21 network is a publicly routable one that is
> > internet accessible? Then this is from my experience quite normal.
> > Random scanners on the internet will scan your /21 range and send
> > requests there. Upstream routers do not seem to cache arp misses and
> > therefor will send and arp request for each packet they get for unused
> > ips.
> >
> > >
> > > 2. Interestingly, only the compute nodes that host VMs attached to the
> > > /21 CIDR show latency spikes and high CPU usage.
> > > Other nodes receive the same ARP flood but don’t seem to be affected
> > > in the same way.
> >
> > Could you try setting other_config:broadcast-arps-to-all-routers=false
> > on the Logical_Switch that represents this /21 network (if the
> > implicantions are acceptable)?
> >
> > If a Logical_Switch gets an arp request for a IP it does not
> > know it will flood it to all attached Logical_Switch_Ports and
> > potentially routers. If you have a lot of such requests that might be
> > quite inefficient.
> >
> > We built the setting above for exactly that purpose.
> >
> > Please note that if you set this only arp requests with a known
> > destination ip will be processed. If you rely on GARPs of your upstream
> > router for some kind of failover mechanism that would no longer work.
> >
> > Thanks a lot,
> > Felix
> >
> > >
> > > 3. From what I’ve gathered, this behavior might be linked somehow to
> > > how megaflow handling works in OVS/OVN. I’m still digging into the
> > > details.
> > >
> > > I’d appreciate any further insights from you and everyone in the 
> > > community!
> > >
> > > Best regards,
> > > Shawn
> > >
> > > On Tue, Sep 16, 2025 at 9:19 PM Felix Huettner
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, Sep 15, 2025 at 05:16:03PM +0700, Shawn Ming via discuss wrote:
> > > > > Hello all,
> > > > >
> > > > > I am running OpenStack (deployed via Kolla-Ansible) with Neutron using
> > > > > OVN as the networking backend. The `distributed_floating_ip` option is
> > > > > not enabled.
> > > > > I have encountered an issue related to large provider networks (CIDR
> > > > > /21) and would like to seek advice from the community.
> > > >
> > > > Hi Shawn,
> > > >
> > > > i'll note below what i saw, maybe it is helpful to you.
> > > >
> > > > >
> > > > > I./ Environment / Steps to reproduce:
> > > > > - OpenStack Caracal (2024.1) deployed with Kolla-Ansible.
> > > > > - Neutron backend: OVN version 24.03.2 (not setting 
> > > > > distributed_floating_ip).
> > > > > - Create a provider network with CIDR /21.
> > > > > - Deploy some VMs directly attached to this network.
> > > > > - Observe traffic and system behavior.
> > > > >
> > > > > II./ Observed behavior:
> > > > > Note: The actual gateway IP address in the logs has been replaced for
> > > > > some reasons
> > > > > 1. VMs attached to the /21 network frequently have latency spike and 
> > > > > packet loss
> > > > > root@vm4:~# ping 192.168.1.254
> > > > > PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.
> > > > > 64 bytes from 192.168.1.254: icmp_seq=4 ttl=64 time=6.86 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=5 ttl=64 time=49.1 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=6 ttl=64 time=7.74 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=7 ttl=64 time=7.68 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=9 ttl=64 time=0.850 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=10 ttl=64 time=1.40 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=8 ttl=64 time=2317 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=11 ttl=64 time=5.31 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=13 ttl=64 time=0.749 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=14 ttl=64 time=4.06 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=15 ttl=64 time=1.67 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=16 ttl=64 time=8.24 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=17 ttl=64 time=9.61 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=18 ttl=64 time=5.71 ms
> > > > > ^C
> > > > > --- 192.168.1.254 ping statistics ---
> > > > > 18 packets transmitted, 14 received, 22.2222% packet loss, time 
> > > > > 17148ms
> > > > > rtt min/avg/max/mdev = 0.749/173.252/2316.610/594.574 ms, pipe 3
> > > >
> > > > Not only is the latency spike strange, but also the packet reordering.
> > > >
> > > > >
> > > > > Meanwhile, VMs attached to /23 network do not
> > > > > root@vm5:~# ping 192.168.2.254
> > > > > PING 192.168.2.254 (192.168.2.254) 56(84) bytes of data.
> > > > > 64 bytes from 192.168.2.254: icmp_seq=1 ttl=64 time=1.04 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=2 ttl=64 time=25.9 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=3 ttl=64 time=5.05 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=4 ttl=64 time=2.05 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=5 ttl=64 time=0.523 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=6 ttl=64 time=4.16 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=7 ttl=64 time=0.798 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=8 ttl=64 time=70.9 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=9 ttl=64 time=1.54 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=10 ttl=64 time=4.14 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=11 ttl=64 time=6.88 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=12 ttl=64 time=0.733 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=13 ttl=64 time=1.01 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=14 ttl=64 time=2.70 ms
> > > > > 64 bytes from 192.168.2.254: icmp_seq=15 ttl=64 time=26.5 ms
> > > > > ^C
> > > > > --- 192.168.2.254 ping statistics ---
> > > > > 15 packets transmitted, 15 received, 0% packet loss, time 14056ms
> > > > > rtt min/avg/max/mdev = 0.523/10.263/70.898/18.157 ms
> > > >
> > > > While the latency spikes are better this is still a high amount of
> > > > variation. That still does not feel healthy.
> > > >
> > > > >
> > > > > 2. We dedicated compute nodes hosting only one VM each for comparison
> > > > > (hosting VM in /21 network and in /23 network):
> > > > > 2.1. Compute node has VM in /21 network
> > > > > - OVS shows high CPU usage
> > > > > CONTAINER ID   NAME                   CPU %         MEM %     NET I/O
> > > > >  BLOCK I/O    PIDS
> > > > > d28dc099cc43   openvswitch_vswitchd   210.81%      0.12%     0B / 0B
> > > > > 0B / 401kB   126
> > > > >
> > > > > - OVS shows ARP flow records changing quickly and frequently
> > > > > (in second n)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.1.254" | wc -l
> > > > > 1184
> > > > > (in second n+1)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.1.254" | wc -l
> > > > > 628
> > > > > (in second n+2)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.1.254" | wc -l
> > > > > 1256
> > > > > (in second n+3)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.1.254" | wc -l
> > > > > 962
> > > >
> > > > I just compared this to what we see on our nodes.
> > > > There we mostly have 1 flow for newly sent arp requests. Since the peer
> > > > sending the arp request should cache the response there should be no
> > > > reason to send them regularly.
> > > >
> > > > I would propose you look deeper into these different flows for a single
> > > > IP. It would probably be interesting what are the differences between
> > > > them.
> > > > Maybe you also see something interesting if you do a tcpdump that
> > > > filters on arp requests and that ip address. If you see a lot of these
> > > > requests you can maybe find the cause of them.
> > > >
> > > > >
> > > > > - The number of OVS flows fluctuates, and packet drops occur even when
> > > > > no traffic is generated by the VM.
> > > > > (in second n)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:54725926968 missed:525971260 lost:69166
> > > > >   flows: 1962
> > > > >   masks: hit:57009090988 total:19 hit/pkt:1.03
> > > > >   cache: hit:54183648664 hit-rate:98.07%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > > (in second n+1)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:54725931065 missed:525972068 lost:69509
> > > > >   flows: 2474
> > > > >   masks: hit:57009110492 total:19 hit/pkt:1.03
> > > > >   cache: hit:54183652139 hit-rate:98.07%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > > (in second n+2)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:54725936481 missed:525972862 lost:69509
> > > > >   flows: 225
> > > > >   masks: hit:57009126369 total:12 hit/pkt:1.03
> > > > >   cache: hit:54183657403 hit-rate:98.07%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > >
> > > > What might be interesting here would be "ovs-appctl upcall/show".
> > > > It shows how many flows are installed over time and what the current
> > > > flow limit and dump duration is.
> > > >
> > > > >
> > > > > 2.2. Compute node has VM in /23 network show better results:
> > > > > - ARP flow count is stable.
> > > > > (in second n)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.2.254" | wc -l
> > > > > 403
> > > > > (in second n+1)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.2.254" | wc -l
> > > > > 403
> > > > > (in second n+2)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.2.254" | wc -l
> > > > > 402
> > > > > (in second n+3)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.2.254" | wc -l
> > > > > 397
> > > > >
> > > > > - Flow entries are stable.
> > > > > (in second n)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:54763442917 missed:539025268 lost:4603675
> > > > >   flows: 2666
> > > > >   masks: hit:60577538636 total:30 hit/pkt:1.10
> > > > >   cache: hit:54123911742 hit-rate:97.87%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > > (in second n+1)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:54763450196 missed:539025306 lost:4603675
> > > > >   flows: 2670
> > > > >   masks: hit:60577547869 total:31 hit/pkt:1.10
> > > > >   cache: hit:54123918904 hit-rate:97.87%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > > (in second n+2)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:54763458923 missed:539025355 lost:4603675
> > > > >   flows: 2669
> > > > >   masks: hit:60577558873 total:31 hit/pkt:1.10
> > > > >   cache: hit:54123927487 hit-rate:97.87%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > >   port 0: ovs-system (internal)
> > > > >   port 1: br-ex (internal)
> > > > >   port 2: bond1
> > > > >   port 3: br-int (internal)
> > > > >   port 4: genev_sys_6081 (geneve: packet_type=ptap)
> > > > >   port 5: tap19510eb6-89
> > > > >   port 6: tap6ff45ca7-64
> > > > >   port 7: tap6b851650-70
> > > > >   port 8: tap0d5f11f9-80
> > > > >
> > > > > - OVS CPU usage remains normal (almost no spike).
> > > > > CONTAINER ID   NAME                   CPU %        MEM %     NET I/O
> > > > > BLOCK I/O     PIDS
> > > > > 6262f4bc6ab1   openvswitch_vswitchd   11.16%      0.16%     0B / 0B
> > > > > 0B / 45.7MB   127
> > > > >
> > > > > 3. Workaround
> > > > > - As a temporary workaround, increasing `max-idle` to 1000000 and
> > > > > `max-revalidator` values to 10000 appears to reduce the problem for
> > > > > the VM in CIDR /21 (default values: `max-idle=10000ms~10s`,
> > > > > `max-revalidator=500ms` per documentation).
> > > > > 3.1. OVS ARP flow count remains stable (no fluctuation).
> > > > > (in second n)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.1.254" | wc -l
> > > > > 1902
> > > > > (in second n+1)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.1.254" | wc -l
> > > > > 1902
> > > > > (in second n+2)(openvswitch-vswitchd)[compute-node]# ovs-dpctl
> > > > > dump-flows | grep arp | grep "192.168.1.254" | wc -l
> > > > > 1903
> > > > >
> > > > > 3.2. Flow entries fluctuate less.
> > > > > (in second n)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:53647897445 missed:569251988 lost:8108302
> > > > >   flows: 2697
> > > > >   masks: hit:56660348845 total:31 hit/pkt:1.05
> > > > >   cache: hit:52974334763 hit-rate:97.71%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > > (in second n+1)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:53647898935 missed:569251992 lost:8108302
> > > > >   flows: 2701
> > > > >   masks: hit:56660350555 total:31 hit/pkt:1.05
> > > > >   cache: hit:52974336237 hit-rate:97.71%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > > (in second n+2)(openvswitch-vswitchd)[compute-node]# ovs-appctl 
> > > > > dpctl/show
> > > > > system@ovs-system:
> > > > >   lookups: hit:53647900325 missed:569251995 lost:8108302
> > > > >   flows: 2704
> > > > >   masks: hit:56660352110 total:31 hit/pkt:1.05
> > > > >   cache: hit:52974337617 hit-rate:97.71%
> > > > >   caches:
> > > > >     masks-cache: size:256
> > > > >
> > > > > 3.3. But OVS CPU usage remains high.
> > > > > CONTAINER ID   NAME                   CPU %        MEM %     NET I/O
> > > > > BLOCK I/O    PIDS
> > > > > 67fea86bbb86   openvswitch_vswitchd   487.15%      0.20%     0B / 0B
> > > > > 0B / 333MB   137
> > > > >
> > > > > 3.4. Ping to the gateway (from inside the VM) shows improved results.
> > > > > root@vm4:~# ping 192.168.1.254
> > > > > PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.
> > > > > 64 bytes from 192.168.1.254: icmp_seq=1 ttl=64 time=7.33 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=2 ttl=64 time=1.78 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=3 ttl=64 time=0.624 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=4 ttl=64 time=24.2 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=5 ttl=64 time=1.81 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=6 ttl=64 time=4.98 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=7 ttl=64 time=5.89 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=8 ttl=64 time=6.57 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=9 ttl=64 time=5.26 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=10 ttl=64 time=1.07 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=11 ttl=64 time=3.34 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=12 ttl=64 time=2.53 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=13 ttl=64 time=14.8 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=14 ttl=64 time=29.4 ms
> > > > > 64 bytes from 192.168.1.254: icmp_seq=15 ttl=64 time=2.11 ms
> > > > > ^C
> > > > > --- 192.168.1.254 ping statistics ---
> > > > > 15 packets transmitted, 15 received, 0% packet loss, time 14039ms
> > > > > rtt min/avg/max/mdev = 0.624/7.442/29.369/8.363 ms
> > > > >
> > > > > Has anyone encountered a similar problem before? If so, could you
> > > > > share what the root cause was in your case, and what you found to be
> > > > > the most effective solution?
> > > >
> > > > The workaround above seems to mostly work by just keeping flows longer
> > > > in the datapath installed. If it works that means that there seem to be
> > > > a lot of different clients (probably around these 1902) that regularly
> > > > send arp requests. But they do not send them often enough that the flows
> > > > stay in the datapath.
> > > >
> > > > So i would propose to investigate where these arp requests come from.
> > > >
> > > > Hope it helps in some way.
> > > > Felix
> > > >
> > > > > Thanks in advance!
> > > > > _______________________________________________
> > > > > discuss mailing list
> > > > > [email protected]
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] [OVN] OVN frequent latency spike and packet loss issue on large scale network

Reply via email to