Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry that I forgot to update here.
On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmoodalb...@gmail.com> wrote: > > Hello all, > > To kind of proceed with the proposed fixes, with minimal impact, is the following a reasonable approach? > > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway router. With this option enabled, the nextHop IP's MAC will be learned through a ARP request on the physical network. The ARP request will be flooded on the L2 broadcast domain (for both join switch and external switch). > The RFC patch fulfils this purpose: https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/ I am working on the formal patch. > Add an option, namely learn_from_arp_request={true|false}, for a gateway router. The option is interpreted as below:\ > "true" - learn the MAC/IP binding and add a new MAC_Binding entry (default behavior) > "false" - if there is a MAC_binding for that IP and the MAC is different, then update that MAC/IP binding. The external entity might be trying to advertise the new MAC for that IP. (If we don't do this, then we will never learn External VIP to MAC changes) > > (Irrespective of, learn_from_arp_request is true or false, always do this -- if the TPA is on the router, add a new entry (it means the remote wants to communicate with this node, so it makes sense to learn the remote as well)) > I am working on this as well, but delayed a little. I hope to have something this week. > > For now, I think it is fine for ARP packets to be broadcasted on the tunnel for the `join` switch case. If it becomes a problem, then we can start looking around changing the logical flows. > > Thanks everyone for the lively discussion. > > Regards, > ~Girish > > On Thu, May 28, 2020 at 7:33 AM Tim Rozet <tro...@redhat.com> wrote: >> >> >> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dce...@redhat.com> wrote: >>> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote: >>> > Hi all >>> > >>> > Sorry for top posting. I want to thank you all for the discussion and >>> > give also some feedback from OpenStack perspective which is affected >>> > by the problem described here. >>> > >>> > In OpenStack, it's kind of common to have a shared external network >>> > (logical switch with a localnet port) across many tenants. Each tenant >>> > user may create their own router where their instances will be >>> > connected to access the external network. >>> > >>> > In such scenario, we are hitting the issue described here. In >>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning >>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router >>> > connected to the public LS. This is creating a huge problem in terms >>> > of performance and tons of events due to the MAC_Binding entries >>> > generated as a consequence of the GARPs sent for the floating IPs. >>> > >>> >>> Just as an addition to this, GARPs wouldn't be the only reason why all >>> routers would learn the MAC_Binding. Even if we wouldn't be sending >>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to >>> the outside, the router will generate an ARP request for the next hop >>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers >>> connected to the public LS and will trigger them to learn the >>> FIP-IP:FIP-MAC binding. >> >> >> Yeah we shouldn't be learning on regular ARP requests. >> >>> >>> >>> > Thanks, >>> > Daniel >>> > >>> > >>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dce...@redhat.com> wrote: >>> >> >>> >> On 5/28/20 8:34 AM, Han Zhou wrote: >>> >>> >>> >>> >>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dce...@redhat.com >>> >>> <mailto:dce...@redhat.com>> wrote: >>> >>>> >>> >>>> Hi Girish, Han, >>> >>>> >>> >>>> On 5/26/20 11:51 PM, Han Zhou wrote: >>> >>>>> >>> >>>>> >>> >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail >>> >>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com> >>> >>>>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>> wrote: >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhou...@gmail.com >>> >>> <mailto:zhou...@gmail.com> >>> >>>>> <mailto:zhou...@gmail.com <mailto:zhou...@gmail.com>>> wrote: >>> >>>>>>> >>> >>>>>>> Hi Girish, >>> >>>>>>> >>> >>>>>>> Thanks for the summary. I agree with you that GARP request v.s. reply >>> >>>>> is irrelavent to the problem here. >>> >>>> >>> >>>> Well, actually I think GARP request vs reply is relevant (at least for >>> >>>> case 1 below) because if OVN would be generating GARP replies we >>> >>>> wouldn't need the priority 80 flow to determine if an ARP request packet >>> >>>> is actually an OVN self originated GARP that needs to be flooded in the >>> >>>> L2 broadcast domain. >>> >>>> >>> >>>> On the other hand, router3 would be learning mac_binding IP2,M2 from the >>> >>>> GARP reply originated by router2 and vice versa so we'd have to restrict >>> >>>> flooding of GARP replies to non-patch ports. >>> >>>> >>> >>> >>> >>> Hi Dumitru, the point was that, on the external LS, the GRs will have to >>> >>> send ARP requests to resolve unknown IPs (at least for the external GW), >>> >>> and it has to be broadcasted, which will cause all the GRs learn all >>> >>> MACs of other GRs. This is regardless of the GARP behavior. You are >>> >>> right that if we only consider the Join switch then the GARP request >>> >>> v.s. reply does make a difference. However, GARP request/reply may be >>> >>> really needed only on the external LS. >>> >>> >>> >> >>> >> Ok, but do you see an easy way to determine if we need to add the >>> >> logical flows that flood self originated GARP packets on a given logical >>> >> switch? Right now we add them on all switches. >>> >> >>> >>>>>>> Please see my comment inline below. >>> >>>>>>> >>> >>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail >>> >>>>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com> >>> >>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>> wrote: >>> >>>>>>>> >>> >>>>>>>> Hello Dumitru, >>> >>>>>>>> >>> >>>>>>>> There are several things that are being discussed on this thread. >>> >>>>> Let me see if I can tease them out for clarity. >>> >>>>>>>> >>> >>>>>>>> 1. All the router IPs are known to OVN (the join switch case) >>> >>>>>>>> 2. Some IPs are known and some are not known (the external logical >>> >>>>> switch that connects to physical network case). >>> >>>>>>>> >>> >>>>>>>> Let us look at each of the case above: >>> >>>>>>>> >>> >>>>>>>> 1. Join Switch Case >>> >>>>>>>> >>> >>>>>>>> +----------------+ +----------------+ >>> >>>>>>>> | l3gateway | | l3gateway | >>> >>>>>>>> | router2 | | router3 | >>> >>>>>>>> +-------------+--+ +-+--------------+ >>> >>>>>>>> IP2,M2 IP3,M3 >>> >>>>>>>> | | >>> >>>>>>>> +--+-------------+---+ >>> >>>>>>>> | join switch | >>> >>>>>>>> +---------+----------+ >>> >>>>>>>> | >>> >>>>>>>> IP1,M1 >>> >>>>>>>> +-------+--------+ >>> >>>>>>>> | distributed | >>> >>>>>>>> | router | >>> >>>>>>>> +----------------+ >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> Say, GR router2 wants to send the packet out to DR and that we >>> >>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR >>> >>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all the >>> >>>>> Gateway Routers). With this in mind, when an ARP request is sent out by >>> >>>>> router2's hypervisor the packet should be directly sent to the >>> >>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit >>> >>>>> ARP/ND broadcast domain whenever possible) should have allowed only >>> >>>>> unicast. However, in ls_in_l2_lkup table we have >>> >>>>>>>> >>> >>>>>>>> table=19(ls_in_l2_lkup ), priority=80 , match=(eth.src == >>> >>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; >>> >>> output;) >>> >>>>>>>> table=19(ls_in_l2_lkup ), priority=75 , match=(flags[1] == >>> >>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport = >>> >>>>> "jtor-router2"; output;) >>> >>>>>>>> >>> >>>>>>>> As you can see, `priority=80` rule will always be hit and sent out >>> >>>>> to all the GRs. The `priority=75` rule is never hit. So, we will see ARP >>> >>>>> packets on the GENEVE tunnel. So, we need to change `priority=80` to >>> >>>>> match GARP request packets. That way, for the known OVN IPs case we >>> >>>>> don't do broadcast. >>> >>>>>>> >>> >>>>>>> Since the solution to case 2) below (i.e. >>> >>>>> learn_from_arp_request=false) solves the problem of case 1), too, I >>> >>>>> think we don't need this change just for case 1). As @Dumitru Ceara >>> >>>>> mentioned, there is some cost because it adds extra flows. It would be >>> >>>>> significant amount of flows if there are a lot of snat_and_dnat IPs. >>> >>>>> What do you think? >>> >>>> >>> >>>> I think the following might be a solution, although with the cost of >>> >>>> adding as many flows as dnat_and_snat IPs are configured: >>> >>>> >>> >>>> - priority 80: explicitly determine if an ARP request is a self >>> >>>> originated GARP for configured IP addresses and dnat_and_snat IPs (by >>> >>>> matching on all eth.src and arp.tpa pairs) and if so flood on all >>> >>>> non-patch ports. >>> >>>> - priority 75: if arp.tpa is owned by an OVN logical router port, >>> >>>> "unicast" it only on the patch port towards the router. >>> >>>> - priority 1: flood any broadcast packet. >>> >>>> >>> >>>> Together with the learn_from_arp_request=false knob this would cover >>> >>>> both case 1 (join switch) and case 2 (external switch). >>> >>>> >>> >>>> Wdyt? >>> >>>> >>> >>> Would the "learn_from_arp_request=false knob" cover both cases? If yes, >>> >>> we don't need to add more flows of priority 80, or more accurately: >>> >>> whether to update the priority-80 flows is not directly related to the >>> >>> current problem. >>> >>> >>> >> >>> >> Yes, it would, except for the fact that the ARP requests would still be >>> >> flooded to all routers (and ignored at the destination). Which is afaiu >>> >> what Girish was worried about. In order to address that part too I'm >>> >> afraid we have to update the priority-80 flows. >>> >> >>> >> Regards, >>> >> Dumitru >>> >> >>> >>>>>> >>> >>>>>> >>> >>>>>> Han, yes it will work. However, my only concern is that we would send >>> >>>>> all these ARP requests via tunnel to each of 1000 hypervisors and these >>> >>>>> hypervisors will just drop them on the floor. when they see >>> >>>>> learn_from_arp_request=false. >>> >>>>> >>> >>>>> I think maybe it is not a problem since it happens only once on the Join >>> >>>>> switch. Once the MAC is learned, it won't broadcast again. It may be >>> >>>>> more of a problem on the external LS if periodical GARP is required >>> >>>>> there. However, I'd suggest to have some test and see if it is really a >>> >>>>> problem, before trying to solve it. >>> >>>>> >>> >>>>>> >>> >>>>>> Han, Dumitru, >>> >>>>>> >>> >>>>>> Why can't we swap the priorities of the above two flows so that the >>> >>>>> ARP request for NexHop IP known to OVN will be always sent via >>> >>> `unicast`? >>> >>>>> >>> >>>>> If swapped, even GARP won't get broadcasted. Maybe that's not the >>> >>>>> desired behavior. >>> >>>>> >>> >>>> >>> >>>> This is definitely not desired as we'd be hitting the prio 75 flow that >>> >>>> would send the self originated GARP request (IPx) packet back towards >>> >>>> the router port that owns IPx. >>> >>>> >>> >>>>>> >>> >>>>>> Regards, >>> >>>>>> ~Girish >>> >>>>>> >>> >>>>>>> >>> >>>>>>>> >>> >>>>>>>> 2. External Logical Switch Case >>> >>>>>>>> >>> >>>>>>>> 10.10.10.0/24 <http://10.10.10.0/24> >>> >>> <http://10.10.10.0/24> >>> >>>>> >>> >>>>>>>> -------------------------+-------------------------- >>> >>>>>>>> | >>> >>>>>>>> localnet >>> >>>>>>>> +-----+-----+ >>> >>>>>>>> | external | >>> >>>>>>>> +------------+ LS1 +-------------+ >>> >>>>>>>> | +-----+-----+ | >>> >>>>>>>> | | | >>> >>>>>>>> 10.10.10.2 10.10.10.3 10.10.10.4 >>> >>>>>>>> SNAT SNAT SNAT >>> >>>>>>>> +-----+-----+ +-----+-----+ +-----------+ >>> >>>>>>>> | l3gateway | | l3gateway | | l3gateway | >>> >>>>>>>> | node1 | | node2 | | node3 | >>> >>>>>>>> +-----------+ +-----------+ +-----------+ >>> >>>>>>>> >>> >>>>>>>> In this case, we have some of the IPs in OVN and some in the >>> >>>>> physical network. If we fix (1) above, all the ARP requests for the >>> >>>>> OVN's router IPs will be unicast. However, all the ARP requests to >>> >>>>> external IPs, say 10.10.10.1 on the "physical router", will be >>> >>>>> broadcast. Now, we will see these ARP broadcasts on all the L3 gateway >>> >>>>> routers. With 'learn_from_arp_request=false' [a], then the MAC_Binding >>> >>>>> table will not explode for both ARP and GARP requests. >>> >>>>>>>> >>> >>>>>>>> So, I don't think GARP requests and replies is the issue here? >>> >>>>> Furthermore, learning from the GARP replies are blocked on certain >>> >>>>> routers. For example: >>> >>>>> >>> >>> https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html >>> >>>>> says "By default, updating the ARP cache on GARP replies is disabled on >>> >>>>> the router.". So, our NAT addresses mapping will not be learnt. >>> >>>> >>> >>>> Just as a side note, the above doesn't mean Juniper boxes don't support >>> >>>> learning from GARP replies, just that they'd need extra configuration. I >>> >>>> don't necessarily think that's a bad thing if properly documented in OVN >>> >>>> that we would be generating GARP replies. >>> >>>> >>> >>>> Regards, >>> >>>> Dumitru >>> >>>> >>> >>>>>>>> >>> >>>>>>>> Regards, >>> >>>>>>>> ~Girish >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> [a] - From Han's mail, the meaning of learn_from_arp_request=false >>> >>>>> --> if the TPA is on the router, add a new entry (it means the >>> >>>>>>>>> remote wants to communicate with this node, so it makes >>> >>> sense to >>> >>>>>>>>> learn the remote as well). Otherwise, ignore it and no new >>> >>>>> entry added. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>> >>> >>>>>> -- >>> >>>>>> You received this message because you are subscribed to the Google >>> >>>>> Groups "ovn-kubernetes" group. >>> >>>>>> To unsubscribe from this group and stop receiving emails from it, send >>> >>>>> an email to ovn-kubernetes+unsubscr...@googlegroups.com >>> >>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com> >>> >>>>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com >>> >>> <mailto:ovn-kubernetes%252bunsubscr...@googlegroups.com>>. >>> >>>>>> To view this discussion on the web visit >>> >>>>> >>> >>> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com . >>> >>>> >>> >>> >>> >>> -- >>> >>> You received this message because you are subscribed to the Google >>> >>> Groups "ovn-kubernetes" group. >>> >>> To unsubscribe from this group and stop receiving emails from it, send >>> >>> an email to ovn-kubernetes+unsubscr...@googlegroups.com >>> >>> <mailto:ovn-kubernetes+unsubscr...@googlegroups.com>. >>> >>> To view this discussion on the web visit >>> >>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com >>> >>> < https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer >. >>> >> >>> >> _______________________________________________ >>> >> discuss mailing list >>> >> disc...@openvswitch.org >>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>> > >>> >> -- >> You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group. >> To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernetes+unsubscr...@googlegroups.com. >> To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com .
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss