I have been testing the patches, and seeing them work as expected (L3HA failovers, N/S, E/W, etc...), but I have found a couple of issues, one of them, "2", I'm not sure it's an issue, but I will describe it too, in case it's not a real issue we can move it to disc...@openvswitch.org then.
1) The expiry of the chassisredirect port MACs on the switch CAM table: In N/S routing, when any traffic needs to be handled by the master Chassis for a router the dst.mac is the MAC of the chassisredirect port. The switch knows about such MAC because it's announced via gARP on the L2 level. The problem is that, for incoming N to S traffic the router pipelines translate to the router internal leg src.mac before sending the packet to the destination Chassis. Because of that, the chassisredirect MAC/VLAN combination is never again relearned as outgoing traffic on the right port (master gw chassis port), it will eventually expire after 300 seconds. From that moment any traffic directed to the "chassisredirect" port MAC will be flooded until any other gARP happens. Everything seems to work fine at a very small scale, but that would really kill the network in real life conditions. You can see it live here: https://www.youtube.com/watch?v=VDwoXbZqUto (sorry for the audio which is missing in a couple of non-important moments, not sure why) The problematic MAC in that video is "fa:16:3e:48:66:e", the one of this chassisredirect port: logical_port : "cr-lrp-4823af55-cd17-4de8-8120-6d13c44dc86b" mac : ["fa:16:3e:48:66:e7 172.24.4.8/24"] nat_addresses : [] options : {distributed-port= "lrp-4823af55-cd17-4de8-8120-6d13c44dc86b"} parent_port : [] tag : [] tunnel_key : 3 type : chassisredirect Here I can think of one solutions: a) Make sure that the traffic is not fully processed by the lrouter flows on the gateway chassis, and let the packet egress the host with the src.mac = "chassisredirect" mac. That would make switches again relearn the MAC/VLAN to port association every time a packet flows N to S. b) which I believe doesn't work: make sure gARPs don't stop happening (or happen <300sec). Would not be a valid solution, since CAM table entries could be early expired on switches if they overflow. 2) MAC flipping on E/W traffic, which is easier to see in this blog post: https://ajo.es/ovn-distributed-ew-on-vlan/#the-end-oh-no If you want the TL;DR version for more context go to the top: https://ajo.es/ovn-distributed-ew-on-vlan/ Where the VLAN/MAC combination lives is not really important, since we never direct traffic to such mac, all the lrouter flow processing happens in OpenFlow before leaving the host. My worry here, is... for a switch, is it just enough to disable port flapping protection as we already have to do for L3HA (a MAC can move around ports based on master/backup status)., or, given the higher rate of port flapping, can it be problematic (for example, I could think of the switch logging every port flap, but I don't know if that would be the case). One solution for this could be: a) Making sure that packets that leave the host have the host MAC address on the physical interface of the provider bridge where the Logical Switch has a localport attached to. It would be fine, since that mac address is never matched on destination, but we would also need to restore it with another lflow at the moment it arrives the final Chassis. (As far as I've been told, this is what neutron/dvr does for VLAN tenant networks) I plan to start working (with some help from Anil) on a follow up patch to make sure "1" does not happen, and then "2" if we confirm that's problematic. Best, Miguel Ángel. On Tue, Jul 10, 2018 at 8:25 AM, Miguel Angel Ajo Pelayo < majop...@redhat.com> wrote: > Anil, good work!. thank you. > > I'm reviewing the patches and the behaviour of the series to make sure > everything is all right. > > E/W distributed L3 routing over L2 is an interesting problem I'm > documenting what I see to > share it on this thread. > > Best, > Miguel Ángel > > > On Mon, Jun 25, 2018 at 9:33 AM Anil Venkata <anilvenk...@redhat.com> > wrote: > >> On Sat, Jun 16, 2018 at 12:05 AM, Ben Pfaff <b...@ovn.org> wrote: >> >> > On Thu, Jun 07, 2018 at 02:59:46PM +0530, vkomm...@redhat.com wrote: >> > > From: Venkata Anil <vkomm...@redhat.com> >> > > >> > > This patch avoids tunneling and instead uses source tenant vlan >> network >> > > across hypervisors for traffic from vlan network on local hypervisor >> > > towards gateway hypervisor hosting redirect chassiss port. >> > > >> > > On the local hypervisor, when the packet enters logical router ingress >> > > pipeline from tenant vlan network, router will set REGBIT_NAT_REDIRECT >> > > and redirect the packet to gateway hypervisor, which is hosting the >> > > chassis redirect port, using tenant vlan network. >> > > Packet travelling across hypervisors will have source vlan tag and >> > > distributed gateway port MAC as destination MAC (other packet data >> > > unchanged). >> > > >> > > Gateway hypervisor will check the vlan tag and destination MAC and >> > > resubmit it to router logical ingress pipeline for routing and finding >> > > the logical output port(i.e it treats this packet as coming from the >> > > local patch port connected to tenant vlan network for routing). >> > > >> > > No changes done for return path as return path to source hypervisor >> > > always uses tenant vlan networks. >> > >> > Thanks a lot for revising the patch series. >> > >> > We've had a lot of churn in ovn-controller over the last week, and it >> > has caused some patch rejects for this patch series. Would you mind >> > rebasing and reposting it? >> > >> >> Thanks Ben. Sorry for the delay, I was on vacation. I will rebase it now. >> >> Thanks >> Anil >> _______________________________________________ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev