Hi Girish, Venu, I sent a RFC patch series for the solution discussed. Could you give it a try when you get the chance?
Thanks, Han On Tue, Jun 9, 2020 at 10:04 AM Han Zhou <zhou...@gmail.com> wrote: > > > On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer <venugop...@nvidia.com> > wrote: > >> Sorry for the delay, Han, a quick question below: >> >> >> >> *From:* ovn-kuberne...@googlegroups.com <ovn-kuberne...@googlegroups.com> >> *On Behalf Of *Han Zhou >> *Sent:* Wednesday, June 3, 2020 4:27 PM >> *To:* Girish Moodalbail <gmoodalb...@gmail.com> >> *Cc:* Tim Rozet <tro...@redhat.com>; Dumitru Ceara <dce...@redhat.com>; >> Daniel Alvarez Sanchez <dalva...@redhat.com>; Dan Winship < >> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss < >> ovs-discuss@openvswitch.org>; Michael Cambria <mcamb...@redhat.com>; >> Venugopal Iyer <venugop...@nvidia.com> >> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve >> table >> >> >> >> *External email: Use caution opening links or attachments* >> >> >> >> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry >> that I forgot to update here. >> >> >> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmoodalb...@gmail.com> >> wrote: >> > >> > Hello all, >> > >> > To kind of proceed with the proposed fixes, with minimal impact, is the >> following a reasonable approach? >> > >> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway >> router. With this option enabled, the nextHop IP's MAC will be learned >> through a ARP request on the physical network. The ARP request will be >> flooded on the L2 broadcast domain (for both join switch and external >> switch). >> >> > >> >> >> >> The RFC patch fulfils this purpose: >> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/ >> >> I am working on the formal patch. >> >> >> >> > Add an option, namely learn_from_arp_request={true|false}, for a >> gateway router. The option is interpreted as below:\ >> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry >> (default behavior) >> > "false" - if there is a MAC_binding for that IP and the MAC is >> different, then update that MAC/IP binding. The external entity might be >> trying to advertise the new MAC for that IP. (If we don't do this, then we >> will never learn External VIP to MAC changes) >> > >> > (Irrespective of, learn_from_arp_request is true or false, always do >> this -- if the TPA is on the router, add a new entry (it means the remote >> wants to communicate with this node, so it makes sense to learn the remote >> as well)) >> >> > >> >> >> >> I am working on this as well, but delayed a little. I hope to have >> something this week. >> >> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp >> (unsolicited ARP request or reply) instead of learn_from_arp_request? This >> is just to protect from potential rogue usage of GARP reply flooding the >> MAC bindings.?* >> >> >> > > Hi Venu, as discussed earlier in this thread it is hard to check if it is > GARP in OVN from the router ingress pipeline. The proposal here cares about > ARP request only. It seems the best option so far. > > >> *Thanks,* >> >> >> >> *-venu* >> >> >> >> > >> > For now, I think it is fine for ARP packets to be broadcasted on the >> tunnel for the `join` switch case. If it becomes a problem, then we can >> start looking around changing the logical flows. >> > >> > Thanks everyone for the lively discussion. >> > >> > Regards, >> > ~Girish >> > >> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet <tro...@redhat.com> wrote: >> >> >> >> >> >> >> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dce...@redhat.com> >> wrote: >> >>> >> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote: >> >>> > Hi all >> >>> > >> >>> > Sorry for top posting. I want to thank you all for the discussion >> and >> >>> > give also some feedback from OpenStack perspective which is affected >> >>> > by the problem described here. >> >>> > >> >>> > In OpenStack, it's kind of common to have a shared external network >> >>> > (logical switch with a localnet port) across many tenants. Each >> tenant >> >>> > user may create their own router where their instances will be >> >>> > connected to access the external network. >> >>> > >> >>> > In such scenario, we are hitting the issue described here. In >> >>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each >> spanning >> >>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router >> >>> > connected to the public LS. This is creating a huge problem in terms >> >>> > of performance and tons of events due to the MAC_Binding entries >> >>> > generated as a consequence of the GARPs sent for the floating IPs. >> >>> > >> >>> >> >>> Just as an addition to this, GARPs wouldn't be the only reason why all >> >>> routers would learn the MAC_Binding. Even if we wouldn't be sending >> >>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic >> to >> >>> the outside, the router will generate an ARP request for the next hop >> >>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers >> >>> connected to the public LS and will trigger them to learn the >> >>> FIP-IP:FIP-MAC binding. >> >> >> >> >> >> Yeah we shouldn't be learning on regular ARP requests. >> >> >> >>> >> >>> >> >>> > Thanks, >> >>> > Daniel >> >>> > >> >>> > >> >>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dce...@redhat.com> >> wrote: >> >>> >> >> >>> >> On 5/28/20 8:34 AM, Han Zhou wrote: >> >>> >>> >> >>> >>> >> >>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dce...@redhat.com >> >>> >>> <mailto:dce...@redhat.com>> wrote: >> >>> >>>> >> >>> >>>> Hi Girish, Han, >> >>> >>>> >> >>> >>>> On 5/26/20 11:51 PM, Han Zhou wrote: >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail >> >>> >>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com> >> >>> >>>>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>> >> wrote: >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhou...@gmail.com >> >>> >>> <mailto:zhou...@gmail.com> >> >>> >>>>> <mailto:zhou...@gmail.com <mailto:zhou...@gmail.com>>> wrote: >> >>> >>>>>>> >> >>> >>>>>>> Hi Girish, >> >>> >>>>>>> >> >>> >>>>>>> Thanks for the summary. I agree with you that GARP request >> v.s. reply >> >>> >>>>> is irrelavent to the problem here. >> >>> >>>> >> >>> >>>> Well, actually I think GARP request vs reply is relevant (at >> least for >> >>> >>>> case 1 below) because if OVN would be generating GARP replies we >> >>> >>>> wouldn't need the priority 80 flow to determine if an ARP >> request packet >> >>> >>>> is actually an OVN self originated GARP that needs to be flooded >> in the >> >>> >>>> L2 broadcast domain. >> >>> >>>> >> >>> >>>> On the other hand, router3 would be learning mac_binding IP2,M2 >> from the >> >>> >>>> GARP reply originated by router2 and vice versa so we'd have to >> restrict >> >>> >>>> flooding of GARP replies to non-patch ports. >> >>> >>>> >> >>> >>> >> >>> >>> Hi Dumitru, the point was that, on the external LS, the GRs will >> have to >> >>> >>> send ARP requests to resolve unknown IPs (at least for the >> external GW), >> >>> >>> and it has to be broadcasted, which will cause all the GRs learn >> all >> >>> >>> MACs of other GRs. This is regardless of the GARP behavior. You >> are >> >>> >>> right that if we only consider the Join switch then the GARP >> request >> >>> >>> v.s. reply does make a difference. However, GARP request/reply >> may be >> >>> >>> really needed only on the external LS. >> >>> >>> >> >>> >> >> >>> >> Ok, but do you see an easy way to determine if we need to add the >> >>> >> logical flows that flood self originated GARP packets on a given >> logical >> >>> >> switch? Right now we add them on all switches. >> >>> >> >> >>> >>>>>>> Please see my comment inline below. >> >>> >>>>>>> >> >>> >>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail >> >>> >>>>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com> >> >>> >>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>> >> wrote: >> >>> >>>>>>>> >> >>> >>>>>>>> Hello Dumitru, >> >>> >>>>>>>> >> >>> >>>>>>>> There are several things that are being discussed on this >> thread. >> >>> >>>>> Let me see if I can tease them out for clarity. >> >>> >>>>>>>> >> >>> >>>>>>>> 1. All the router IPs are known to OVN (the join switch case) >> >>> >>>>>>>> 2. Some IPs are known and some are not known (the external >> logical >> >>> >>>>> switch that connects to physical network case). >> >>> >>>>>>>> >> >>> >>>>>>>> Let us look at each of the case above: >> >>> >>>>>>>> >> >>> >>>>>>>> 1. Join Switch Case >> >>> >>>>>>>> >> >>> >>>>>>>> +----------------+ +----------------+ >> >>> >>>>>>>> | l3gateway | | l3gateway | >> >>> >>>>>>>> | router2 | | router3 | >> >>> >>>>>>>> +-------------+--+ +-+--------------+ >> >>> >>>>>>>> IP2,M2 IP3,M3 >> >>> >>>>>>>> | | >> >>> >>>>>>>> +--+-------------+---+ >> >>> >>>>>>>> | join switch | >> >>> >>>>>>>> +---------+----------+ >> >>> >>>>>>>> | >> >>> >>>>>>>> IP1,M1 >> >>> >>>>>>>> +-------+--------+ >> >>> >>>>>>>> | distributed | >> >>> >>>>>>>> | router | >> >>> >>>>>>>> +----------------+ >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> Say, GR router2 wants to send the packet out to DR and that >> we >> >>> >>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve >> table on GR >> >>> >>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all >> the >> >>> >>>>> Gateway Routers). With this in mind, when an ARP request is >> sent out by >> >>> >>>>> router2's hypervisor the packet should be directly sent to the >> >>> >>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd: >> Limit >> >>> >>>>> ARP/ND broadcast domain whenever possible) should have allowed >> only >> >>> >>>>> unicast. However, in ls_in_l2_lkup table we have >> >>> >>>>>>>> >> >>> >>>>>>>> table=19(ls_in_l2_lkup ), priority=80 , >> match=(eth.src == >> >>> >>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = >> "_MC_flood"; >> >>> >>> output;) >> >>> >>>>>>>> table=19(ls_in_l2_lkup ), priority=75 , >> match=(flags[1] == >> >>> >>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport = >> >>> >>>>> "jtor-router2"; output;) >> >>> >>>>>>>> >> >>> >>>>>>>> As you can see, `priority=80` rule will always be hit and >> sent out >> >>> >>>>> to all the GRs. The `priority=75` rule is never hit. So, we >> will see ARP >> >>> >>>>> packets on the GENEVE tunnel. So, we need to change >> `priority=80` to >> >>> >>>>> match GARP request packets. That way, for the known OVN IPs >> case we >> >>> >>>>> don't do broadcast. >> >>> >>>>>>> >> >>> >>>>>>> Since the solution to case 2) below (i.e. >> >>> >>>>> learn_from_arp_request=false) solves the problem of case 1), >> too, I >> >>> >>>>> think we don't need this change just for case 1). As @Dumitru >> Ceara >> >>> >>>>> mentioned, there is some cost because it adds extra flows. It >> would be >> >>> >>>>> significant amount of flows if there are a lot of snat_and_dnat >> IPs. >> >>> >>>>> What do you think? >> >>> >>>> >> >>> >>>> I think the following might be a solution, although with the >> cost of >> >>> >>>> adding as many flows as dnat_and_snat IPs are configured: >> >>> >>>> >> >>> >>>> - priority 80: explicitly determine if an ARP request is a self >> >>> >>>> originated GARP for configured IP addresses and dnat_and_snat >> IPs (by >> >>> >>>> matching on all eth.src and arp.tpa pairs) and if so flood on all >> >>> >>>> non-patch ports. >> >>> >>>> - priority 75: if arp.tpa is owned by an OVN logical router port, >> >>> >>>> "unicast" it only on the patch port towards the router. >> >>> >>>> - priority 1: flood any broadcast packet. >> >>> >>>> >> >>> >>>> Together with the learn_from_arp_request=false knob this would >> cover >> >>> >>>> both case 1 (join switch) and case 2 (external switch). >> >>> >>>> >> >>> >>>> Wdyt? >> >>> >>>> >> >>> >>> Would the "learn_from_arp_request=false knob" cover both cases? >> If yes, >> >>> >>> we don't need to add more flows of priority 80, or more >> accurately: >> >>> >>> whether to update the priority-80 flows is not directly related >> to the >> >>> >>> current problem. >> >>> >>> >> >>> >> >> >>> >> Yes, it would, except for the fact that the ARP requests would >> still be >> >>> >> flooded to all routers (and ignored at the destination). Which is >> afaiu >> >>> >> what Girish was worried about. In order to address that part too >> I'm >> >>> >> afraid we have to update the priority-80 flows. >> >>> >> >> >>> >> Regards, >> >>> >> Dumitru >> >>> >> >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> Han, yes it will work. However, my only concern is that we >> would send >> >>> >>>>> all these ARP requests via tunnel to each of 1000 hypervisors >> and these >> >>> >>>>> hypervisors will just drop them on the floor. when they see >> >>> >>>>> learn_from_arp_request=false. >> >>> >>>>> >> >>> >>>>> I think maybe it is not a problem since it happens only once on >> the Join >> >>> >>>>> switch. Once the MAC is learned, it won't broadcast again. It >> may be >> >>> >>>>> more of a problem on the external LS if periodical GARP is >> required >> >>> >>>>> there. However, I'd suggest to have some test and see if it is >> really a >> >>> >>>>> problem, before trying to solve it. >> >>> >>>>> >> >>> >>>>>> >> >>> >>>>>> Han, Dumitru, >> >>> >>>>>> >> >>> >>>>>> Why can't we swap the priorities of the above two flows so >> that the >> >>> >>>>> ARP request for NexHop IP known to OVN will be always sent via >> >>> >>> `unicast`? >> >>> >>>>> >> >>> >>>>> If swapped, even GARP won't get broadcasted. Maybe that's not >> the >> >>> >>>>> desired behavior. >> >>> >>>>> >> >>> >>>> >> >>> >>>> This is definitely not desired as we'd be hitting the prio 75 >> flow that >> >>> >>>> would send the self originated GARP request (IPx) packet back >> towards >> >>> >>>> the router port that owns IPx. >> >>> >>>> >> >>> >>>>>> >> >>> >>>>>> Regards, >> >>> >>>>>> ~Girish >> >>> >>>>>> >> >>> >>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> 2. External Logical Switch Case >> >>> >>>>>>>> >> >>> >>>>>>>> 10.10.10.0/24 <http://10.10.10.0/24> >> >>> >>> <http://10.10.10.0/24> >> >>> >>>>> >> >>> >>>>>>>> -------------------------+-------------------------- >> >>> >>>>>>>> | >> >>> >>>>>>>> localnet >> >>> >>>>>>>> +-----+-----+ >> >>> >>>>>>>> | external | >> >>> >>>>>>>> +------------+ LS1 +-------------+ >> >>> >>>>>>>> | +-----+-----+ | >> >>> >>>>>>>> | | | >> >>> >>>>>>>> 10.10.10.2 10.10.10.3 10.10.10.4 >> >>> >>>>>>>> SNAT SNAT SNAT >> >>> >>>>>>>> +-----+-----+ +-----+-----+ +-----------+ >> >>> >>>>>>>> | l3gateway | | l3gateway | | l3gateway | >> >>> >>>>>>>> | node1 | | node2 | | node3 | >> >>> >>>>>>>> +-----------+ +-----------+ +-----------+ >> >>> >>>>>>>> >> >>> >>>>>>>> In this case, we have some of the IPs in OVN and some in the >> >>> >>>>> physical network. If we fix (1) above, all the ARP requests for >> the >> >>> >>>>> OVN's router IPs will be unicast. However, all the ARP requests >> to >> >>> >>>>> external IPs, say 10.10.10.1 on the "physical router", will be >> >>> >>>>> broadcast. Now, we will see these ARP broadcasts on all the L3 >> gateway >> >>> >>>>> routers. With 'learn_from_arp_request=false' [a], then the >> MAC_Binding >> >>> >>>>> table will not explode for both ARP and GARP requests. >> >>> >>>>>>>> >> >>> >>>>>>>> So, I don't think GARP requests and replies is the issue >> here? >> >>> >>>>> Furthermore, learning from the GARP replies are blocked on >> certain >> >>> >>>>> routers. For example: >> >>> >>>>> >> >>> >>> >> https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html >> >>> >>>>> says "By default, updating the ARP cache on GARP replies is >> disabled on >> >>> >>>>> the router.". So, our NAT addresses mapping will not be learnt. >> >>> >>>> >> >>> >>>> Just as a side note, the above doesn't mean Juniper boxes don't >> support >> >>> >>>> learning from GARP replies, just that they'd need extra >> configuration. I >> >>> >>>> don't necessarily think that's a bad thing if properly >> documented in OVN >> >>> >>>> that we would be generating GARP replies. >> >>> >>>> >> >>> >>>> Regards, >> >>> >>>> Dumitru >> >>> >>>> >> >>> >>>>>>>> >> >>> >>>>>>>> Regards, >> >>> >>>>>>>> ~Girish >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> [a] - From Han's mail, the meaning of >> learn_from_arp_request=false >> >>> >>>>> --> if the TPA is on the router, add a new entry (it means the >> >>> >>>>>>>>> remote wants to communicate with this node, so it makes >> >>> >>> sense to >> >>> >>>>>>>>> learn the remote as well). Otherwise, ignore it and no >> new >> >>> >>>>> entry added. >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>> >> >>> >>>>>> -- >> >>> >>>>>> You received this message because you are subscribed to the >> Google >> >>> >>>>> Groups "ovn-kubernetes" group. >> >>> >>>>>> To unsubscribe from this group and stop receiving emails from >> it, send >> >>> >>>>> an email to ovn-kubernetes+unsubscr...@googlegroups.com >> >>> >>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com> >> >>> >>>>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com >> >>> >>> <mailto:ovn-kubernetes%252bunsubscr...@googlegroups.com>>. >> >>> >>>>>> To view this discussion on the web visit >> >>> >>>>> >> >>> >>> >> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com >> . >> >>> >>>> >> >>> >>> >> >>> >>> -- >> >>> >>> You received this message because you are subscribed to the Google >> >>> >>> Groups "ovn-kubernetes" group. >> >>> >>> To unsubscribe from this group and stop receiving emails from it, >> send >> >>> >>> an email to ovn-kubernetes+unsubscr...@googlegroups.com >> >>> >>> <mailto:ovn-kubernetes+unsubscr...@googlegroups.com>. >> >>> >>> To view this discussion on the web visit >> >>> >>> >> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com >> >>> >>> < >> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer >> >. >> >>> >> >> >>> >> _______________________________________________ >> >>> >> discuss mailing list >> >>> >> disc...@openvswitch.org >> >>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >> >>> > >> >>> >> >> -- >> >> You received this message because you are subscribed to the Google >> Groups "ovn-kubernetes" group. >> >> To unsubscribe from this group and stop receiving emails from it, send >> an email to ovn-kubernetes+unsubscr...@googlegroups.com. >> >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com >> . >> >> -- >> You received this message because you are subscribed to the Google Groups >> "ovn-kubernetes" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to ovn-kubernetes+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com >> <https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss