Okay, the dupes come from the flooding which happens when we pass through the flow rules without matching the MAC and landing at the default action. Got it. It also seems our outages came on the heels of ipv6 probes messing out the routing. We now have ipv6 disabled in the kernel on our network hosts (left it linger on the compute hosts) and our environment is stable. I'm guessing you all typically run dual stacked or also have ipv6 disabled in the kernel in your environments? It's probably not a great feature at this point. Thanks for listening. peter ---- Original Message ---- From: Peter Eisch To: ovs-dev@openvswitch.org Sent: Thu, Sep 20, 2018, 08:27 AM Subject: Re: [ovs-dev] New config with an issue The MAC address of the default gateway (the local dvr gateway) ages out of ovs and then craziness happens. DVR then breaks and packets egress through the L3 agents. If it egresses though the active L3 bad stuff happens. The active L3 either generates "no route to host" or forwards the packet or both. In this example below the arp ages out on the hypervisors dvr and then it heals pretty quickly: 64 bytes from 10.179.0.13: icmp_seq=1927 ttl=54 time=29.001 ms 64 bytes from 10.179.0.13: icmp_seq=1928 ttl=54 time=28.875 ms 64 bytes from 10.179.0.13: icmp_seq=1929 ttl=54 time=29.097 ms 64 bytes from 10.179.0.13: icmp_seq=1929 ttl=54 time=29.109 ms (DUP!) 64 bytes from 10.179.0.13: icmp_seq=1930 ttl=54 time=28.243 ms 64 bytes from 10.179.0.13: icmp_seq=1930 ttl=54 time=28.251 ms (DUP!) 64 bytes from 10.179.0.13: icmp_seq=1931 ttl=54 time=28.514 ms 64 bytes from 10.179.0.13: icmp_seq=1932 ttl=54 time=28.304 ms 64 bytes from 10.179.0.13: icmp_seq=1933 ttl=54 time=28.343 ms If it happens to go through the inactive L3 the packets are forwarded without ceremony through its dvrsnat. Eventually we land back at the local dvr and we're good for five or more minutes. peter On Wed, Sep 19, 2018 at 02:37 PM, Peter Eisch wrote: Hi, it was suggested I ask for pointers in this list, not cross-posting.
The environment: Queens running 2.8.x with a hypervisor (dvr) and two L3 agents (dvr-snat) in HA config. DVR mode, hybrid firewall and wanting to not use any NAT north and south because the site is smallish (13 hypervisors in all). The issue: A node is created and works mostly well. I see the north->south traffic arrive via the snat interface on the active l3 host. After some time, the egress (south->north) traffic will stop routing out the fip on the hypervisor and, instead, transit through the l3 host. It proceeds to jump around over time between the three hosts on what appears to be five minute walls. At some point though, the node becomes unreachable until the egress flips again. This tells me there's an arp timing out at 300 seconds. Is there a cookbook for how to either diagnose better our configuration issue or what config options to pay close attention? Respectfully, peter _______________________________________________ dev mailing list d...@openvswitch.org (mailto:d...@openvswitch.org) https://mail.openvswitch.org/mailman/listinfo/ovs-dev (https://mail.openvswitch.org/mailman/listinfo/ovs-dev) _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev