Hi Girish, Venu,

I sent a RFC patch series for the solution discussed. Could you give it a
try when you get the chance?

Thanks,
Han

On Tue, Jun 9, 2020 at 10:04 AM Han Zhou <zhou...@gmail.com> wrote:

>
>
> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer <venugop...@nvidia.com>
> wrote:
>
>> Sorry for the delay, Han, a quick question below:
>>
>>
>>
>> *From:* ovn-kuberne...@googlegroups.com <ovn-kuberne...@googlegroups.com>
>> *On Behalf Of *Han Zhou
>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>> *To:* Girish Moodalbail <gmoodalb...@gmail.com>
>> *Cc:* Tim Rozet <tro...@redhat.com>; Dumitru Ceara <dce...@redhat.com>;
>> Daniel Alvarez Sanchez <dalva...@redhat.com>; Dan Winship <
>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>> ovs-discuss@openvswitch.org>; Michael Cambria <mcamb...@redhat.com>;
>> Venugopal Iyer <venugop...@nvidia.com>
>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>> table
>>
>>
>>
>> *External email: Use caution opening links or attachments*
>>
>>
>>
>> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
>> that I forgot to update here.
>>
>>
>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmoodalb...@gmail.com>
>> wrote:
>> >
>> > Hello all,
>> >
>> > To kind of proceed with the proposed fixes, with minimal impact, is the
>> following a reasonable approach?
>> >
>> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
>> router. With this option enabled, the nextHop IP's MAC will be learned
>> through a ARP request on the physical network. The ARP request will be
>> flooded on the L2 broadcast domain (for both join switch and external
>> switch).
>>
>> >
>>
>>
>>
>> The RFC patch fulfils this purpose:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>
>> I am working on the formal patch.
>>
>>
>>
>> > Add an option, namely learn_from_arp_request={true|false}, for a
>> gateway router. The option is interpreted as below:\
>> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
>> (default behavior)
>> > "false" - if there is a MAC_binding for that IP and the MAC is
>> different, then update that MAC/IP binding. The external entity might be
>> trying to advertise the new MAC for that IP. (If we don't do this, then we
>> will never learn External VIP to MAC changes)
>> >
>> > (Irrespective of, learn_from_arp_request is true or false, always do
>> this -- if the TPA is on the router, add a new entry (it means the remote
>> wants to communicate with this node, so it makes sense to learn the remote
>> as well))
>>
>> >
>>
>>
>>
>> I am working on this as well, but delayed a little. I hope to have
>> something this week.
>>
>> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
>> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
>> is just to protect from potential rogue usage of  GARP reply flooding the
>> MAC bindings.?*
>>
>>
>>
>
> Hi Venu, as discussed earlier in this thread it is hard to check if it is
> GARP in OVN from the router ingress pipeline. The proposal here cares about
> ARP request only. It seems the best option so far.
>
>
>> *Thanks,*
>>
>>
>>
>> *-venu*
>>
>>
>>
>> >
>> > For now, I think it is fine for ARP packets to be broadcasted on the
>> tunnel for the `join` switch case. If it becomes a problem, then we can
>> start looking around changing the logical flows.
>> >
>> > Thanks everyone for the lively discussion.
>> >
>> > Regards,
>> > ~Girish
>> >
>> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet <tro...@redhat.com> wrote:
>> >>
>> >>
>> >>
>> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dce...@redhat.com>
>> wrote:
>> >>>
>> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>> >>> > Hi all
>> >>> >
>> >>> > Sorry for top posting. I want to thank you all for the discussion
>> and
>> >>> > give also some feedback from OpenStack perspective which is affected
>> >>> > by the problem described here.
>> >>> >
>> >>> > In OpenStack, it's kind of common to have a shared external network
>> >>> > (logical switch with a localnet port) across many tenants. Each
>> tenant
>> >>> > user may create their own router where their instances will be
>> >>> > connected to access the external network.
>> >>> >
>> >>> > In such scenario, we are hitting the issue described here. In
>> >>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each
>> spanning
>> >>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
>> >>> > connected to the public LS. This is creating a huge problem in terms
>> >>> > of performance and tons of events due to the MAC_Binding entries
>> >>> > generated as a consequence of the GARPs sent for the floating IPs.
>> >>> >
>> >>>
>> >>> Just as an addition to this, GARPs wouldn't be the only reason why all
>> >>> routers would learn the MAC_Binding. Even if we wouldn't be sending
>> >>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic
>> to
>> >>> the outside, the router will generate an ARP request for the next hop
>> >>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
>> >>> connected to the public LS and will trigger them to learn the
>> >>> FIP-IP:FIP-MAC binding.
>> >>
>> >>
>> >> Yeah we shouldn't be learning on regular ARP requests.
>> >>
>> >>>
>> >>>
>> >>> > Thanks,
>> >>> > Daniel
>> >>> >
>> >>> >
>> >>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dce...@redhat.com>
>> wrote:
>> >>> >>
>> >>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
>> >>> >>>
>> >>> >>>
>> >>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dce...@redhat.com
>> >>> >>> <mailto:dce...@redhat.com>> wrote:
>> >>> >>>>
>> >>> >>>> Hi Girish, Han,
>> >>> >>>>
>> >>> >>>> On 5/26/20 11:51 PM, Han Zhou wrote:
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>> >>> >>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>
>> >>> >>>>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>>
>> wrote:
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhou...@gmail.com
>> >>> >>> <mailto:zhou...@gmail.com>
>> >>> >>>>> <mailto:zhou...@gmail.com <mailto:zhou...@gmail.com>>> wrote:
>> >>> >>>>>>>
>> >>> >>>>>>> Hi Girish,
>> >>> >>>>>>>
>> >>> >>>>>>> Thanks for the summary. I agree with you that GARP request
>> v.s. reply
>> >>> >>>>> is irrelavent to the problem here.
>> >>> >>>>
>> >>> >>>> Well, actually I think GARP request vs reply is relevant (at
>> least for
>> >>> >>>> case 1 below) because if OVN would be generating GARP replies we
>> >>> >>>> wouldn't need the priority 80 flow to determine if an ARP
>> request packet
>> >>> >>>> is actually an OVN self originated GARP that needs to be flooded
>> in the
>> >>> >>>> L2 broadcast domain.
>> >>> >>>>
>> >>> >>>> On the other hand, router3 would be learning mac_binding IP2,M2
>> from the
>> >>> >>>> GARP reply originated by router2 and vice versa so we'd have to
>> restrict
>> >>> >>>> flooding of GARP replies to non-patch ports.
>> >>> >>>>
>> >>> >>>
>> >>> >>> Hi Dumitru, the point was that, on the external LS, the GRs will
>> have to
>> >>> >>> send ARP requests to resolve unknown IPs (at least for the
>> external GW),
>> >>> >>> and it has to be broadcasted, which will cause all the GRs learn
>> all
>> >>> >>> MACs of other GRs. This is regardless of the GARP behavior. You
>> are
>> >>> >>> right that if we only consider the Join switch then the GARP
>> request
>> >>> >>> v.s. reply does make a difference. However, GARP request/reply
>> may be
>> >>> >>> really needed only on the external LS.
>> >>> >>>
>> >>> >>
>> >>> >> Ok, but do you see an easy way to determine if we need to add the
>> >>> >> logical flows that flood self originated GARP packets on a given
>> logical
>> >>> >> switch? Right now we add them on all switches.
>> >>> >>
>> >>> >>>>>>> Please see my comment inline below.
>> >>> >>>>>>>
>> >>> >>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
>> >>> >>>>> <gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>
>> >>> >>> <mailto:gmoodalb...@gmail.com <mailto:gmoodalb...@gmail.com>>>
>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>> Hello Dumitru,
>> >>> >>>>>>>>
>> >>> >>>>>>>> There are several things that are being discussed on this
>> thread.
>> >>> >>>>> Let me see if I can tease them out for clarity.
>> >>> >>>>>>>>
>> >>> >>>>>>>> 1. All the router IPs are known to OVN (the join switch case)
>> >>> >>>>>>>> 2. Some IPs are known and some are not known (the external
>> logical
>> >>> >>>>> switch that connects to physical network case).
>> >>> >>>>>>>>
>> >>> >>>>>>>> Let us look at each of the case above:
>> >>> >>>>>>>>
>> >>> >>>>>>>> 1. Join Switch Case
>> >>> >>>>>>>>
>> >>> >>>>>>>> +----------------+        +----------------+
>> >>> >>>>>>>> |   l3gateway    |        |   l3gateway    |
>> >>> >>>>>>>> |    router2     |        |    router3     |
>> >>> >>>>>>>> +-------------+--+        +-+--------------+
>> >>> >>>>>>>>             IP2,M2         IP3,M3
>> >>> >>>>>>>>               |             |
>> >>> >>>>>>>>            +--+-------------+---+
>> >>> >>>>>>>>            |    join switch     |
>> >>> >>>>>>>>            +---------+----------+
>> >>> >>>>>>>>                      |
>> >>> >>>>>>>>                   IP1,M1
>> >>> >>>>>>>>              +-------+--------+
>> >>> >>>>>>>>              |  distributed   |
>> >>> >>>>>>>>              |     router     |
>> >>> >>>>>>>>              +----------------+
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> Say, GR router2 wants to send the packet out to DR and that
>> we
>> >>> >>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve
>> table on GR
>> >>> >>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all
>> the
>> >>> >>>>> Gateway Routers). With this in mind, when an ARP request is
>> sent out by
>> >>> >>>>> router2's hypervisor the packet should be directly sent to the
>> >>> >>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd:
>> Limit
>> >>> >>>>> ARP/ND broadcast domain whenever possible) should have allowed
>> only
>> >>> >>>>> unicast. However, in ls_in_l2_lkup table we have
>> >>> >>>>>>>>
>> >>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=80   ,
>> match=(eth.src ==
>> >>> >>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport =
>> "_MC_flood";
>> >>> >>> output;)
>> >>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=75   ,
>> match=(flags[1] ==
>> >>> >>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
>> >>> >>>>> "jtor-router2"; output;)
>> >>> >>>>>>>>
>> >>> >>>>>>>> As you can see, `priority=80` rule will always be hit and
>> sent out
>> >>> >>>>> to all the GRs. The `priority=75` rule is never hit. So, we
>> will see ARP
>> >>> >>>>> packets on the GENEVE tunnel. So, we need to change
>> `priority=80` to
>> >>> >>>>> match GARP request packets. That way, for the known OVN IPs
>> case we
>> >>> >>>>> don't do broadcast.
>> >>> >>>>>>>
>> >>> >>>>>>> Since the solution to case 2) below (i.e.
>> >>> >>>>> learn_from_arp_request=false) solves the problem of case 1),
>> too, I
>> >>> >>>>> think we don't need this change just for case 1). As @Dumitru
>> Ceara
>> >>> >>>>>  mentioned, there is some cost because it adds extra flows. It
>> would be
>> >>> >>>>> significant amount of flows if there are a lot of snat_and_dnat
>> IPs.
>> >>> >>>>> What do you think?
>> >>> >>>>
>> >>> >>>> I think the following might be a solution, although with the
>> cost of
>> >>> >>>> adding as many flows as dnat_and_snat IPs are configured:
>> >>> >>>>
>> >>> >>>> - priority 80: explicitly determine if an ARP request is a self
>> >>> >>>> originated GARP for configured IP addresses and dnat_and_snat
>> IPs (by
>> >>> >>>> matching on all eth.src and arp.tpa pairs) and if so flood on all
>> >>> >>>> non-patch ports.
>> >>> >>>> - priority 75: if arp.tpa is owned by an OVN logical router port,
>> >>> >>>> "unicast" it only on the patch port towards the router.
>> >>> >>>> - priority 1: flood any broadcast packet.
>> >>> >>>>
>> >>> >>>> Together with the learn_from_arp_request=false knob this would
>> cover
>> >>> >>>> both case 1 (join switch) and case 2 (external switch).
>> >>> >>>>
>> >>> >>>> Wdyt?
>> >>> >>>>
>> >>> >>> Would the "learn_from_arp_request=false knob" cover both cases?
>> If yes,
>> >>> >>> we don't need to add more flows of priority 80, or more
>> accurately:
>> >>> >>> whether to update the priority-80 flows is not directly related
>> to the
>> >>> >>> current problem.
>> >>> >>>
>> >>> >>
>> >>> >> Yes, it would, except for the fact that the ARP requests would
>> still be
>> >>> >> flooded to all routers (and ignored at the destination). Which is
>> afaiu
>> >>> >> what Girish was worried about. In order to address that part too
>> I'm
>> >>> >> afraid we have to update the priority-80 flows.
>> >>> >>
>> >>> >> Regards,
>> >>> >> Dumitru
>> >>> >>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Han, yes it will work. However, my only concern is that we
>> would send
>> >>> >>>>> all these ARP requests via tunnel to each of 1000 hypervisors
>> and these
>> >>> >>>>> hypervisors will just drop them on the floor. when they see
>> >>> >>>>> learn_from_arp_request=false.
>> >>> >>>>>
>> >>> >>>>> I think maybe it is not a problem since it happens only once on
>> the Join
>> >>> >>>>> switch. Once the MAC is learned, it won't broadcast again. It
>> may be
>> >>> >>>>> more of a problem on the external LS if periodical GARP is
>> required
>> >>> >>>>> there. However, I'd suggest to have some test and see if it is
>> really a
>> >>> >>>>> problem, before trying to solve it.
>> >>> >>>>>
>> >>> >>>>>>
>> >>> >>>>>> Han, Dumitru,
>> >>> >>>>>>
>> >>> >>>>>> Why can't we swap the priorities of the above two flows so
>> that the
>> >>> >>>>> ARP request for NexHop IP known to OVN will be always sent via
>> >>> >>> `unicast`?
>> >>> >>>>>
>> >>> >>>>> If swapped, even GARP won't get broadcasted. Maybe that's not
>> the
>> >>> >>>>> desired behavior.
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>> This is definitely not desired as we'd be hitting the prio 75
>> flow that
>> >>> >>>> would send the self originated GARP request (IPx) packet back
>> towards
>> >>> >>>> the router port that owns IPx.
>> >>> >>>>
>> >>> >>>>>>
>> >>> >>>>>> Regards,
>> >>> >>>>>> ~Girish
>> >>> >>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> 2. External Logical Switch Case
>> >>> >>>>>>>>
>> >>> >>>>>>>>                        10.10.10.0/24 <http://10.10.10.0/24>
>> >>> >>> <http://10.10.10.0/24>
>> >>> >>>>>
>> >>> >>>>>>>>    -------------------------+--------------------------
>> >>> >>>>>>>>                             |
>> >>> >>>>>>>>                          localnet
>> >>> >>>>>>>>                       +-----+-----+
>> >>> >>>>>>>>                       | external  |
>> >>> >>>>>>>>          +------------+    LS1    +-------------+
>> >>> >>>>>>>>          |            +-----+-----+             |
>> >>> >>>>>>>>          |                  |                   |
>> >>> >>>>>>>>      10.10.10.2         10.10.10.3          10.10.10.4
>> >>> >>>>>>>>         SNAT               SNAT                SNAT
>> >>> >>>>>>>>    +-----+-----+      +-----+-----+       +-----------+
>> >>> >>>>>>>>    | l3gateway |      | l3gateway |       | l3gateway |
>> >>> >>>>>>>>    |   node1   |      |   node2   |       |   node3   |
>> >>> >>>>>>>>    +-----------+      +-----------+       +-----------+
>> >>> >>>>>>>>
>> >>> >>>>>>>> In this case, we have some of the IPs in OVN and some in the
>> >>> >>>>> physical network. If we fix (1) above, all the ARP requests for
>> the
>> >>> >>>>> OVN's router IPs will be unicast. However, all the ARP requests
>> to
>> >>> >>>>> external IPs, say 10.10.10.1 on the "physical router", will be
>> >>> >>>>> broadcast. Now, we will see these ARP broadcasts on all the L3
>> gateway
>> >>> >>>>> routers. With 'learn_from_arp_request=false' [a], then the
>> MAC_Binding
>> >>> >>>>> table will not explode for both ARP and GARP requests.
>> >>> >>>>>>>>
>> >>> >>>>>>>> So, I don't think GARP requests and replies is the issue
>> here?
>> >>> >>>>> Furthermore, learning from the GARP replies are blocked on
>> certain
>> >>> >>>>> routers. For example:
>> >>> >>>>>
>> >>> >>>
>> https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
>> >>> >>>>>  says "By default, updating the ARP cache on GARP replies is
>> disabled on
>> >>> >>>>> the router.". So, our NAT addresses mapping will not be learnt.
>> >>> >>>>
>> >>> >>>> Just as a side note, the above doesn't mean Juniper boxes don't
>> support
>> >>> >>>> learning from GARP replies, just that they'd need extra
>> configuration. I
>> >>> >>>> don't necessarily think that's a bad thing if properly
>> documented in OVN
>> >>> >>>> that we would be generating GARP replies.
>> >>> >>>>
>> >>> >>>> Regards,
>> >>> >>>> Dumitru
>> >>> >>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> Regards,
>> >>> >>>>>>>> ~Girish
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> [a] - From Han's mail, the meaning of
>> learn_from_arp_request=false
>> >>> >>>>> --> if the TPA is on the router, add a new entry (it means the
>> >>> >>>>>>>>>     remote wants to communicate with this node, so it makes
>> >>> >>> sense to
>> >>> >>>>>>>>>     learn the remote as well). Otherwise, ignore it and no
>> new
>> >>> >>>>> entry added.
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>
>> >>> >>>>>> --
>> >>> >>>>>> You received this message because you are subscribed to the
>> Google
>> >>> >>>>> Groups "ovn-kubernetes" group.
>> >>> >>>>>> To unsubscribe from this group and stop receiving emails from
>> it, send
>> >>> >>>>> an email to ovn-kubernetes+unsubscr...@googlegroups.com
>> >>> >>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com>
>> >>> >>>>> <mailto:ovn-kubernetes%2bunsubscr...@googlegroups.com
>> >>> >>> <mailto:ovn-kubernetes%252bunsubscr...@googlegroups.com>>.
>> >>> >>>>>> To view this discussion on the web visit
>> >>> >>>>>
>> >>> >>>
>> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com
>> .
>> >>> >>>>
>> >>> >>>
>> >>> >>> --
>> >>> >>> You received this message because you are subscribed to the Google
>> >>> >>> Groups "ovn-kubernetes" group.
>> >>> >>> To unsubscribe from this group and stop receiving emails from it,
>> send
>> >>> >>> an email to ovn-kubernetes+unsubscr...@googlegroups.com
>> >>> >>> <mailto:ovn-kubernetes+unsubscr...@googlegroups.com>.
>> >>> >>> To view this discussion on the web visit
>> >>> >>>
>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com
>> >>> >>> <
>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer
>> >.
>> >>> >>
>> >>> >> _______________________________________________
>> >>> >> discuss mailing list
>> >>> >> disc...@openvswitch.org
>> >>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >>> >
>> >>>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> Groups "ovn-kubernetes" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> an email to ovn-kubernetes+unsubscr...@googlegroups.com.
>> >> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com
>> .
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "ovn-kubernetes" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to ovn-kubernetes+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to