Okay, the dupes come from the flooding which happens when we pass through the 
flow rules without matching the MAC and landing at the default action.  Got it. 
 It also seems our outages came on the heels of ipv6 probes messing out the 
routing.  We now have ipv6 disabled in the kernel on our network hosts (left it 
linger on the compute hosts) and our environment is stable.  
I'm guessing you all typically run dual stacked or also have ipv6 disabled in 
the kernel in your environments?  It's probably not a great feature at this 
point.
Thanks for listening.
peter
---- Original Message ----
 From: Peter Eisch 
 To: ovs-dev@openvswitch.org
 Sent: Thu, Sep 20, 2018, 08:27 AM
 Subject: Re: [ovs-dev] New config with an issue
 The MAC address of the default gateway (the local dvr gateway) ages out of ovs 
and then craziness happens.  DVR then breaks and packets egress through the L3 
agents.  If it egresses though the active L3 bad stuff happens.  The active L3 
either generates "no route to host" or forwards the packet or both.  In this 
example below the arp ages out on the hypervisors dvr and then it heals pretty 
quickly: 
64 bytes from 10.179.0.13: icmp_seq=1927 ttl=54 time=29.001 ms
64 bytes from 10.179.0.13: icmp_seq=1928 ttl=54 time=28.875 ms
64 bytes from 10.179.0.13: icmp_seq=1929 ttl=54 time=29.097 ms
64 bytes from 10.179.0.13: icmp_seq=1929 ttl=54 time=29.109 ms (DUP!)
64 bytes from 10.179.0.13: icmp_seq=1930 ttl=54 time=28.243 ms
64 bytes from 10.179.0.13: icmp_seq=1930 ttl=54 time=28.251 ms (DUP!)
 64 bytes from 10.179.0.13: icmp_seq=1931 ttl=54 time=28.514 ms
64 bytes from 10.179.0.13: icmp_seq=1932 ttl=54 time=28.304 ms
64 bytes from 10.179.0.13: icmp_seq=1933 ttl=54 time=28.343 ms
If it happens to go through the inactive L3 the packets are forwarded without 
ceremony through its dvrsnat.  Eventually we land back at the local dvr and 
we're good for five or more minutes.
peter
On Wed, Sep 19, 2018 at 02:37 PM, Peter Eisch  wrote:
Hi, it was suggested I ask for pointers in this list, not cross-posting.

The environment:  Queens running 2.8.x with a hypervisor (dvr) and two L3 
agents (dvr-snat) in HA config. DVR mode, hybrid firewall and wanting to not 
use any NAT north and south because the site is smallish (13 hypervisors in 
all).

The issue:  A node is created and works mostly well.  I see the north->south 
traffic arrive via the snat interface on the active l3 host.  After some time, 
the egress (south->north) traffic will stop routing out the fip on the 
hypervisor and, instead, transit through the l3 host.  It proceeds to jump 
around over time between the three hosts on what appears to be five minute 
walls.  At some point though, the node becomes unreachable until the egress 
flips again.

This tells me there's an arp timing out at 300 seconds.  Is there a cookbook 
for how to either diagnose better our configuration issue or what config 
options to pay close attention? 

Respectfully,

peter
_______________________________________________
dev mailing list
d...@openvswitch.org (mailto:d...@openvswitch.org)
https://mail.openvswitch.org/mailman/listinfo/ovs-dev 
(https://mail.openvswitch.org/mailman/listinfo/ovs-dev)
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to