Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-07-14 Thread Tim Rozet
Thanks for the update Girish. Are you planning on submitting an
ovn-k8s patch to enable these?

Tim Rozet
Red Hat CTO Networking Team


On Mon, Jul 13, 2020 at 9:37 PM Girish Moodalbail 
wrote:

> Hello Han,
>
> On the #openvswitch IRC channel I had provided an update on your patch
> working great on our test setup. That update was for the L3 Gateway Router
> option called* learn_from_arp_request="true|false".* With that option in
> place, the number of entries in the MAC binding table has significantly
> reduced.
>
> However, I had not provided an update on the single join switch tests.
> Sincere apologies for the delay. We just got that code to work last week,
> and we have an update. This is for the option called
> *dynamic_neigh_routers="true|false"* on the L3 Gateway Router. It works
> as expected.  With that option in place, for all of the L3 Gateway Routers
> I see just 3 entries as expected:
>
>   table=12(lr_in_arp_resolve  ), priority=500  , match=(ip4.mcast ||
> ip6.mcast), action=(next;)
>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip4),
> action=(get_arp(outport, reg0); next;)
>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip6),
> action=(get_nd(outport, xxreg0); next;)
>
> Before, on a 1000 node cluster with 1000 Gateway Routers we would see 1000
> entries per Gateway Router and therefore a total of 1M entries in the
> cluster. Now, that is not the case.
>
> Thank you!
>
> Regards,
> ~Girish
>
>
> On Wed, Jun 10, 2020 at 12:04 PM Han Zhou  wrote:
>
>>
>>
>> On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:
>>
>>> Hi Girish, Venu,
>>>
>>> I sent a RFC patch series for the solution discussed. Could you give it
>>> a try when you get the chance?
>>>
>>
>> Oops, I forgot the link:
>> https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602
>>
>>>
>>> Thanks,
>>> Han
>>>
>>> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>>>> wrote:
>>>>
>>>>> Sorry for the delay, Han, a quick question below:
>>>>>
>>>>>
>>>>>
>>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>>>> *To:* Girish Moodalbail 
>>>>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>>>>> Daniel Alvarez Sanchez ; Dan Winship <
>>>>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>>>>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>>>>> Venugopal Iyer 
>>>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in
>>>>> lr_in_arp_resolve table
>>>>>
>>>>>
>>>>>
>>>>> *External email: Use caution opening links or attachments*
>>>>>
>>>>>
>>>>>
>>>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but
>>>>> sorry that I forgot to update here.
>>>>>
>>>>>
>>>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <
>>>>> gmoodalb...@gmail.com> wrote:
>>>>> >
>>>>> > Hello all,
>>>>> >
>>>>> > To kind of proceed with the proposed fixes, with minimal impact, is
>>>>> the following a reasonable approach?
>>>>> >
>>>>> > Add an option, namely dynamic_neigh_routes={true|false}, for a
>>>>> gateway router. With this option enabled, the nextHop IP's MAC will be
>>>>> learned through a ARP request on the physical network. The ARP request 
>>>>> will
>>>>> be flooded on the L2 broadcast domain (for both join switch and external
>>>>> switch).
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> The RFC patch fulfils this purpose:
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>>>>
>>>>> I am working on the formal patch.
>>>>>
>>>>>
>>>>>
>>>>> > Add an option, namely learn_from_arp_request={true|false}, for a
>>>>> gateway router. The option is interpreted as below:\
>>>>> > "true" 

Re: [ovs-discuss] [OVN] running bfd on ecmp routes?

2020-06-16 Thread Tim Rozet
Thanks Han. See inline.
Tim Rozet
Red Hat CTO Networking Team


On Tue, Jun 16, 2020 at 1:45 PM Han Zhou  wrote:

>
>
> On Mon, Jun 15, 2020 at 7:22 AM Tim Rozet  wrote:
>
>> Hi All,
>> While looking into using ecmp routes for an OVN router I noticed there is
>> no support for BFD on these routes. Would it be possible to add this
>> capability? I would like the next hop to be removed from the openflow group
>> if BFD detection for that next hop goes down. My routes in this case would
>> be on a GR for N/S external next hop and not going across a tunnel as it
>> egresses.
>>
>> Thanks,
>>
>> Tim Rozet
>> Red Hat CTO Networking Team
>>
>> Hi Tim,
>
> Thanks for bringing this up. Yes, it is desirable to have BFD support for
> OVN routers. Here are my thoughts.
>
> In general, OVN routers are distributed. It is not easy to tell which node
> should be responsible for the BFD session, especially, to handle the
> response packets. Even if we managed to implement this, the node detects
> the failure needs to populate the information to central SB DB, so that the
> information is distributed to all nodes, to make the distributed route
> updated.
>

Right in a distributed case it would mean the BFD endpoint would be under
the network managed by OVN, and therefore reside on the same node where the
port for that endpoint resides. In the ovn-kubernetes context, it is a pod
running on a node connected to the DR.

>
> In your particular case, it may be easier, since the gateway router is
> physically located on a single node. ovn-controller on the GR node can
> maintain BFD session with the nexthops. If a session is down,
> ovn-controller may take action locally to enforce the change locally.
>

Yeah for the external network case this makes sense. I went ahead and filed
a BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1847570

>
> For both cases, more details may need to be sorted out.
>
> Alternatively, it shouldn't be hard to have an external monitoring
> service/agent that talks BFD with the nexthops, and react on the session
> status changes by updating ECMP routes in OVN NB.
>
Yeah I have a workaround plan to do this for now, using a networking health
check and signaling from K8S. The problem is this is much slower than using
real BFD, but it is better than nothing.


>
> Thanks,
> Han
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] running bfd on ecmp routes?

2020-06-15 Thread Tim Rozet
Hi All,
While looking into using ecmp routes for an OVN router I noticed there is
no support for BFD on these routes. Would it be possible to add this
capability? I would like the next hop to be removed from the openflow group
if BFD detection for that next hop goes down. My routes in this case would
be on a GR for N/S external next hop and not going across a tunnel as it
egresses.

Thanks,

Tim Rozet
Red Hat CTO Networking Team
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] RFC - OVN end to end packet tracing - ovn-global-trace

2020-06-10 Thread Tim Rozet
On Wed, Jun 10, 2020 at 3:36 AM Dumitru Ceara  wrote:

> On 6/9/20 3:47 PM, Tim Rozet wrote:
> > Hi Dumitru,
>
> Hi Tim,
>
> > Thanks for the detailed explanation. It makes sense and would like to
> > comment on a few things you touched on:
> > 1. I do think we need to somehow functionally trigger conntrack when we
> > do ofproto-trace. It's the only way to know what the real session state
> > ends up being, and we need to be able to follow that for some of the
> > complex bugs where packets are getting dropped after they enter a CT
> > based flow.
> > 2. For your ovn-global-trace, it would be great if that could return a
> > json or other parsable format, so that we could build on top of it with
> > a tool + GUI to graphically show where the problem is in the network.
>
> Ack.
>
> > 3. We really need better user guides on this stuff. Your email is the
> > best tutorial I've seen yet :) I didn't even know about the
> > ovs-tcpundump command, or ovn-detrace (until you told me previously). It
> > would be great to add an ovn troubleshooting guide or something to the
> docs.
> >
>
> I was planning on sending a patch to update the OVN docs but didn't get
> the chance to do it yet.
>
> > As an administrator I would like to have GUI showing all of the logical
> > switch ports (skydive as an example, already does this) and then click
> > on a specific port that someone has reported an issue on. At that point
> > I can click on the port and ask it to tcpdump me the traffic coming out
> > of it. From there, I can select which packet I care about and attempt to
> > do an ovn-global-trace on it, which will then show me where the packet
> > is getting dropped and why. I think this would be the ideal behavior.
> >
>
> That would be cool. Using your example (skydive) though, I guess one
> could also come up with a solution that directly uses the tools already
> existing in OVS/OVN essentially performing the steps that something like
> ovn-global-trace would do.
>

They could, but I think it would be better off living in OVN and then
consumed by something above it.


>
> Thanks,
> Dumitru
>
> > Tim Rozet
> > Red Hat CTO Networking Team
> >
> >
> > On Mon, Jun 8, 2020 at 7:53 AM Dumitru Ceara  > <mailto:dce...@redhat.com>> wrote:
> >
> > Hi everyone,
> >
> > CC-ing ovn-kubernetes mailing list as I know there's interest about
> this
> > there too.
> >
> > OVN currently has a couple of tools that help
> > tracing/tracking/simulating what would happen to packets within OVN,
> > some examples:
> >
> > 1. ovn-trace
> > 2. ovs-appctl ofproto/trace ... | ovn-detrace
> >
> > They're both really useful and provide lots of information but with
> both
> > of them quite it's hard to get an overview of the end-to-end packet
> > processing in OVN for a given packet. Therefore both solutions have
> > disadvantages when trying to troubleshoot production deployments.
> Some
> > examples:
> >
> > a. ovn-trace will not take into account any potential issues with
> > translating logical flows to openflow so if there's a bug in the
> > translation we'll not be able to detect it by looking at ovn-trace
> > output. There is the --ovs switch but the user would have to somehow
> > determine on which hypervisor to query for the openflows
> corresponding
> > to logical flows/SB entities.
> >
> > b. "ovs-appctl ofproto/trace ... | ovn-detrace" works quite well when
> > used on a single node but as soon as traffic gets tunneled to a
> > different hypervisor the user has to figure out the changes that were
> > performed on the packet on the source hypervisor and adapt the
> > packet/flow to include the tunnel information to be used when running
> > ofproto/trace on the destination hypervisor.
> >
> > c. both ovn-trace and ofproto/trace support minimal hints to specify
> the
> > new conntrack state after conntrack recirculation but that turns out
> to
> > be not enough even in simple scenarios when NAT is involved [0].
> >
> > In a production deployment one of the scenarios one would have to
> > troubleshoot is:
> >
> > "Given this OVN deployment on X nodes why isn't this specific
> > packet/traffic that is received on logical port P1 doesn't
> reach/reach
> > port P2."
> >
> > Assuming that point "c" above is addressed somehow (there are

Re: [ovs-discuss] RFC - OVN end to end packet tracing - ovn-global-trace

2020-06-09 Thread Tim Rozet
Hi Dumitru,
Thanks for the detailed explanation. It makes sense and would like to
comment on a few things you touched on:
1. I do think we need to somehow functionally trigger conntrack when we do
ofproto-trace. It's the only way to know what the real session state ends
up being, and we need to be able to follow that for some of the complex
bugs where packets are getting dropped after they enter a CT based flow.
2. For your ovn-global-trace, it would be great if that could return a json
or other parsable format, so that we could build on top of it with a tool +
GUI to graphically show where the problem is in the network.
3. We really need better user guides on this stuff. Your email is the best
tutorial I've seen yet :) I didn't even know about the
ovs-tcpundump command, or ovn-detrace (until you told me previously). It
would be great to add an ovn troubleshooting guide or something to the docs.

As an administrator I would like to have GUI showing all of the logical
switch ports (skydive as an example, already does this) and then click on a
specific port that someone has reported an issue on. At that point I can
click on the port and ask it to tcpdump me the traffic coming out of it.
>From there, I can select which packet I care about and attempt to do an
ovn-global-trace on it, which will then show me where the packet is getting
dropped and why. I think this would be the ideal behavior.

Tim Rozet
Red Hat CTO Networking Team


On Mon, Jun 8, 2020 at 7:53 AM Dumitru Ceara  wrote:

> Hi everyone,
>
> CC-ing ovn-kubernetes mailing list as I know there's interest about this
> there too.
>
> OVN currently has a couple of tools that help
> tracing/tracking/simulating what would happen to packets within OVN,
> some examples:
>
> 1. ovn-trace
> 2. ovs-appctl ofproto/trace ... | ovn-detrace
>
> They're both really useful and provide lots of information but with both
> of them quite it's hard to get an overview of the end-to-end packet
> processing in OVN for a given packet. Therefore both solutions have
> disadvantages when trying to troubleshoot production deployments. Some
> examples:
>
> a. ovn-trace will not take into account any potential issues with
> translating logical flows to openflow so if there's a bug in the
> translation we'll not be able to detect it by looking at ovn-trace
> output. There is the --ovs switch but the user would have to somehow
> determine on which hypervisor to query for the openflows corresponding
> to logical flows/SB entities.
>
> b. "ovs-appctl ofproto/trace ... | ovn-detrace" works quite well when
> used on a single node but as soon as traffic gets tunneled to a
> different hypervisor the user has to figure out the changes that were
> performed on the packet on the source hypervisor and adapt the
> packet/flow to include the tunnel information to be used when running
> ofproto/trace on the destination hypervisor.
>
> c. both ovn-trace and ofproto/trace support minimal hints to specify the
> new conntrack state after conntrack recirculation but that turns out to
> be not enough even in simple scenarios when NAT is involved [0].
>
> In a production deployment one of the scenarios one would have to
> troubleshoot is:
>
> "Given this OVN deployment on X nodes why isn't this specific
> packet/traffic that is received on logical port P1 doesn't reach/reach
> port P2."
>
> Assuming that point "c" above is addressed somehow (there are a few
> suggestions on how to do that [1]) it's still quite a lot of work for
> the engineer doing the troubleshooting to gather all the interesting
> information. One would probably do something like:
>
> 1. connect to the node running the southbound database and get the
> chassis where the logical port is bound:
>
> chassis=$(ovn-sbctl --bare --columns chassis list port_binding P1)
> hostname=$(ovn-sbctl --bare --columns hostname list chassis $chassis)
>
> 2. connect to $hostname and determine the OVS ofport id of the interface
> corresponding to P1:
>
> in_port=$(ovs-vsctl --bare --columns ofport find interface
> external_ids:iface-id=P1)
> iface=$(ovs-vsctl --bare --columns name find interface
> external_ids:iface-id=P1)
>
> 3. get a hexdump of the packet to be traced (or the flow), for example,
> on $hostname:
> flow=$(tcpdump -xx -c 1 -i $iface $pkt_filter | ovs-tcpundump)
>
> 3. run ofproto/trace on $hostname (potentially piping output to
> ovn-detrace):
>
> ovs-appctl ofproto/trace br-int in_port=$in_port $flow | ovn-detrace
> --ovnnb=$NB_CONN --ovnsb=$SB_CONN
>
> 4. In the best case the packet is fully processed on the current node
> (e.g., is dropped or forwarded out a local VIF).
>
> 5. In the worst case the packet needs to be tunneled to a remote
> hypervisor for egress on a r

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-28 Thread Tim Rozet
On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara  wrote:

> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> > Hi all
> >
> > Sorry for top posting. I want to thank you all for the discussion and
> > give also some feedback from OpenStack perspective which is affected
> > by the problem described here.
> >
> > In OpenStack, it's kind of common to have a shared external network
> > (logical switch with a localnet port) across many tenants. Each tenant
> > user may create their own router where their instances will be
> > connected to access the external network.
> >
> > In such scenario, we are hitting the issue described here. In
> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> > connected to the public LS. This is creating a huge problem in terms
> > of performance and tons of events due to the MAC_Binding entries
> > generated as a consequence of the GARPs sent for the floating IPs.
> >
>
> Just as an addition to this, GARPs wouldn't be the only reason why all
> routers would learn the MAC_Binding. Even if we wouldn't be sending
> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
> the outside, the router will generate an ARP request for the next hop
> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
> connected to the public LS and will trigger them to learn the
> FIP-IP:FIP-MAC binding.
>

Yeah we shouldn't be learning on regular ARP requests.


>
> > Thanks,
> > Daniel
> >
> >
> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara 
> wrote:
> >>
> >> On 5/28/20 8:34 AM, Han Zhou wrote:
> >>>
> >>>
> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara  >>> > wrote:
> 
>  Hi Girish, Han,
> 
>  On 5/26/20 11:51 PM, Han Zhou wrote:
> >
> >
> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
> >>> mailto:gmoodalb...@gmail.com>
> > >>
> wrote:
> >>
> >>
> >>
> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou  >>> 
> > >> wrote:
> >>>
> >>> Hi Girish,
> >>>
> >>> Thanks for the summary. I agree with you that GARP request v.s.
> reply
> > is irrelavent to the problem here.
> 
>  Well, actually I think GARP request vs reply is relevant (at least for
>  case 1 below) because if OVN would be generating GARP replies we
>  wouldn't need the priority 80 flow to determine if an ARP request
> packet
>  is actually an OVN self originated GARP that needs to be flooded in
> the
>  L2 broadcast domain.
> 
>  On the other hand, router3 would be learning mac_binding IP2,M2 from
> the
>  GARP reply originated by router2 and vice versa so we'd have to
> restrict
>  flooding of GARP replies to non-patch ports.
> 
> >>>
> >>> Hi Dumitru, the point was that, on the external LS, the GRs will have
> to
> >>> send ARP requests to resolve unknown IPs (at least for the external
> GW),
> >>> and it has to be broadcasted, which will cause all the GRs learn all
> >>> MACs of other GRs. This is regardless of the GARP behavior. You are
> >>> right that if we only consider the Join switch then the GARP request
> >>> v.s. reply does make a difference. However, GARP request/reply may be
> >>> really needed only on the external LS.
> >>>
> >>
> >> Ok, but do you see an easy way to determine if we need to add the
> >> logical flows that flood self originated GARP packets on a given logical
> >> switch? Right now we add them on all switches.
> >>
> >>> Please see my comment inline below.
> >>>
> >>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> > mailto:gmoodalb...@gmail.com>
> >>> >> wrote:
> 
>  Hello Dumitru,
> 
>  There are several things that are being discussed on this thread.
> > Let me see if I can tease them out for clarity.
> 
>  1. All the router IPs are known to OVN (the join switch case)
>  2. Some IPs are known and some are not known (the external logical
> > switch that connects to physical network case).
> 
>  Let us look at each of the case above:
> 
>  1. Join Switch Case
> 
>  ++++
>  |   l3gateway||   l3gateway|
>  |router2 ||router3 |
>  +-+--++-+--+
>  IP2,M2 IP3,M3
>    | |
> +--+-+---+
> |join switch |
> +-+--+
>   |
>    IP1,M1
>   +---++
>   

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Tim Rozet
On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer 
wrote:

> Hi, Han:
>
> 
> From: ovn-kuberne...@googlegroups.com 
> on behalf of Han Zhou 
> Sent: Thursday, May 21, 2020 4:42 PM
> To: Tim Rozet
> Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan
> Winship; ovs-discuss; ovn-kuberne...@googlegroups.com; Michael Cambria
> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table
>
> External email: Use caution opening links or attachments
>
>
>
> On Thu, May 21, 2020 at 2:35 PM Tim Rozet  tro...@redhat.com>> wrote:
> I think that if you directly connect GR to DR you don't need to learn any
> ARP with packet_in and you can preprogram the static entries. Each GR will
> have 1 enty for the DR, while the DR will have N number of entries for N
> nodes.
>
> Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N
> ports on the DR and also requires a lot of small subnets, which is not
> desirable. And since changes are needed anyway in OVN to support that, we
> moved forward with the current approach of avoiding the static ARP flows to
> solve the problem instead of directly connecting GRs to DR.
>
> Why is that not desirable? They are all private subnets with /30 (if using
ipv4). If IPv6, it's even less of a concern from an addressing perspective.

The real issue with ARP learning comes from the GR-External. You have
> to learn these, and from my conversation with Girish it seems like every GR
> is adding an entry on every ARP request it sees. This means 1 GR sends ARP
> request to external L2 network and every GR sees the ARP request and adds
> an entry. I think the behavior should be:
>
> GRs only add ARP entries when:
>
>   1.  An ARP Response is sent to it
>   2.  The GR receives a GARP broadcast, and already has an entry in his
> cache for that IP (Girish mentioned this is similar to linux arp_accept
> behavior)
>
> For 2), it is expensive to do in OVN because OpenFlow doesn't support a
> match condition of "field1 == field2", which is required to check if the
> incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to
> support something similar like linux arp_accept configuration but slightly
> different. In OVN we can configure it to alllow/disable learning from all
> ARP requests to IPs not belonging to the router, including GARPs. Would
> that solve the problem here? (@Venugopal Iyer<mailto:venugop...@nvidia.com>
> brought up the same thing about "arp_accept". I hope this reply addresses
> that as well)
>

I think the issue there is if you have an external device, which is using a
VIP and it fails over, it will usually send GARP to inform of the mac
change. In this case if you ignore GARP, what happens? You wont send
another ARP because OVN programs the arp entry forever and doesn't expire
it right? So you won't learn the new mac and keep sending packets to a dead
mac?

>
>  I can't think of any side effects to this, so seems fine to me to do
> so. Believe linux behaves that way w.r.t. ARP request
>  anyway (assuming I am reading it right).
>
> https://elixir.bootlin.com/linux/v5.7-rc6/source/net/ipv4/arp.c (L874)
>
>
> thanks,
>
> -venu
>
> In addition, as Michael Cambria pointed out in our weekly meeting, these
> ARP cache entries should have expiry timers on them. If they are
> permanently learned, you will end up with a growing ARP table over time,
> and end up in the same place. We can probably just program the GR ARP flows
> with an idle_timeout and have the flow removed. What do you think?
>
> This has been discussed before. It is also mentioned in the TODO.rst.
> However, it is not taken care because there is no good solution found yet.
> It can be done but will be expensive and the gains do not worth the costs.
> Accepting ARP requests partially reduces the needs of ARP expiration. It is
> true that it could still be a problem in some scenarios but so far we
> didn't heard any use case that has hard dependency on this.
>
> Should I file a bugzilla outlining the above so we can have proper
> tracking?
>
> I think bugzilla is out of the control of OVN community, so please feel
> free to file or not file ;)
>

Sorry folks from OVN had told me you use bugzilla to track OVN bugs, and
not JIRA or Github. What bug tracking system do you use if not BZ?

>
> Thanks,
> Han
>
>
> Thanks,
>
> Tim Rozet
> Red Hat CTO Networking Team
>
>
> On Thu, May 21, 2020 at 5:01 PM Han Zhou  zhou...@gmail.com>> wrote:
>
>
> On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer  <mailto:venugop...@nvidia.com>> wrote:
> Han,
>
> just a quick question below..
>
> __

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Tim Rozet
I think that if you directly connect GR to DR you don't need to learn any
ARP with packet_in and you can preprogram the static entries. Each GR will
have 1 enty for the DR, while the DR will have N number of entries for N
nodes.

The real issue with ARP learning comes from the GR-External. You have
to learn these, and from my conversation with Girish it seems like every GR
is adding an entry on every ARP request it sees. This means 1 GR sends ARP
request to external L2 network and every GR sees the ARP request and adds
an entry. I think the behavior should be:

GRs only add ARP entries when:

   1. An ARP *Response* is sent to it
   2. The GR receives a GARP broadcast, and already has an entry in his
   cache for that IP (Girish mentioned this is similar to linux arp_accept
   behavior)

In addition, as Michael Cambria pointed out in our weekly meeting, these
ARP cache entries should have expiry timers on them. If they are
permanently learned, you will end up with a growing ARP table over time,
and end up in the same place. We can probably just program the GR ARP flows
with an idle_timeout and have the flow removed. What do you think?

Should I file a bugzilla outlining the above so we can have proper tracking?

Thanks,

Tim Rozet
Red Hat CTO Networking Team


On Thu, May 21, 2020 at 5:01 PM Han Zhou  wrote:

>
>
> On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer 
> wrote:
>
>> Han,
>>
>> just a quick question below..
>>
>> 
>> From: ovn-kuberne...@googlegroups.com 
>> on behalf of Girish Moodalbail 
>> Sent: Tuesday, May 19, 2020 11:09 PM
>> To: Han Zhou
>> Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kuberne...@googlegroups.com
>> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table
>>
>> External email: Use caution opening links or attachments
>>
>> Hello Han,
>>
>> Please see in-line:
>>
>> On Sat, May 16, 2020 at 11:17 PM Han Zhou > zhou...@gmail.com>> wrote:
>>
>>
>> On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail > <mailto:gmoodalb...@gmail.com>> wrote:
>> Hello Han,
>>
>> Can you please explain how the dynamic resolution of the IP-to-MAC will
>> work with this new option set?
>>
>> Say the packet is being forwarded from router2 towards the distributed
>> router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
>> address M1 to set eth.dst to.
>>
>> ++++
>> |   l3gateway||   l3gateway|
>> |router2 ||router3 |
>> +-+--++-+--+
>> IP2,M2 IP3,M3
>>   | |
>>+--+-+---+
>>|join switch |
>>+-+--+
>>  |
>>   IP1,M1
>>  +---++
>>  |  distributed   |
>>  | router |
>>  ++
>>
>> The MAC M1 will not obviously in the MAC_binding table. On the hypervisor
>> where the packet originated, the router2's port and the distributed
>> router's port are locally present. So, does this result in a PACKET_IN to
>> the ovn-controller and the resolution happens there?
>>
>> Yes there will be a PACKET_IN, and then:
>> 1. ovn-controller will generate the ARP request for IP1, and send
>> PACKET_OUT to OVS.
>> 2. The ARP request will be delivered to the distributed router pipeline
>> only, because of a special handling of ARP in OVN for IPs of router ports,
>> although it is a broadcast. (It would have been broadcasted to all GRs
>> without that special handling)
>> 3. The distributed router pipeline should learn the IP-MAC binding of
>> IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send
>> ARP reply to the router2 in the distributed router pipeline.
>> 4. Router2 pipeline will handle the ARP response and learn the IP-MAC
>> binding of IP1-M1 (through a PACKET_IN to ovn-controller).
>>
>> Unfortunately, the ARP request (who as IP1) from router2 is broadcasted
>> out to all of the chassis through Geneve Tunnel. The other gateway routers
>> learn the Source mac of 'M2'. Now, each of the gateway router has an entry
>> for (IP2, M2) in the MAC binding table on their respective rtoj-
>> router port. So, the MAC_Binding table will now have N X N entries, where N
>> is the number of gateway routers.
>>
>> Per your explanation above, the ARP request should not have broadcasted
>> right?
>>
>>
>>  proba

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-13 Thread Tim Rozet
I went ahead and filed:
https://bugzilla.redhat.com/show_bug.cgi?id=1835386

Tim Rozet
Red Hat CTO Networking Team


On Mon, May 11, 2020 at 1:27 AM Han Zhou  wrote:

>
>
> On Sat, May 9, 2020 at 5:01 PM Girish Moodalbail 
> wrote:
> >
> > Hello Han, Tim
> >
> > Please see in-line:
> >
> >
> >>>>
> >>>> Hello Han,
> >>>>
> >>>> I did consider distributed gateway port. However, there are two
> issues with it
> >>>>
> >>>> 1. In order to support K8s NodePort services we need to create a
> North-South LB and L3 gateway is a perfect solution for that. AFAIK,
> >>>>DGP doesn't support it
> >>
> >>
> >> In fact DGP supports LB (at least from code
> https://github.com/ovn-org/ovn/blob/master/northd/ovn-northd.c#L9318),
> but the ovn-nb manpage may need an update.
> >
> >
> > I see
> >
> >>
> >>
> >>>>
> >>>> 2. Datapath performance would be bad with DGP. We want the packet
> meant for the host or the Internet to exit out of the hypervisor on which
> the pod exists. The L3 gateway router provides us with this functionality.
> With dgp and with OVN supporting only one instance of it, packets
> unnecessarily gets forwarded over tunnel to dgp chassis for SNATing and
> then gets forwarded back over tunnel to the host to just exit out locally.
> >>
> >>
> >> This is related to the changes needed for DGP (the first point I
> mentioned in previous email). In the diagram I draw, there will be 1000
> DGPs, each reside on a chassis, just to make sure north-south traffic can
> be forwarded on the local chassis without going through a central node,
> just like how it works today in ovn-k8s. However, maybe this is not a small
> change, because today the NAT and LB processing on such LRs (LRs with DGP)
> are all based on the assumption that there is only one DGP. For example,
> the NB schema would also need to be changed so that the NAT/LB rules for a
> router can specify DGP to determine the central processing location for
> those rules.
> >
> >
> > Correct
> >
> >>
> >>
> >> So, to summarize, if we can make multi-DGP work, it would be the best
> solution for the ovn-k8s scenario. If we can't (either because of design
> problem, or because it is too big effort for the gains), maybe configurably
> avoiding the static neighbour flows is a good way to go. Both options
> requires changes in OVN.
> >
> >
> > Han, optimizing the neighbor cache from the current O(n^2) to something
> scalable will be ideal for short-term. I am hoping that the changes to OVN
> will not be as complicated as multi-DGP work and other changes to OVN
> proposed on this email thread.
> >
> >
> >>
> >> Without changes in OVN, a further optimization based on your current
> workaround can be done is what Tim has suggested: to replace the large
> number of small join LSes (and LRPs and patch ports on both sides) by same
> number of directly connected LRPs.
> >
> >
> > Han and Tim,
> >
> > OVN supports only peering two distributed routers without a logical
> switch, however it doesn't support connecting a distributed router and an
> l3 gateway router directly as peers. I remember very clearly this being
> mentioned in the ovn-architecture man page.
> >
> > -8<--8<-
> >
> >The distributed router and the
> >gateway router are  connected  by  another  logical  switch,
>  sometimes
> >referred  to  as a ``join’’ logical switch. (OVN logical routers
> may be
> >connected to one another directly, without an intervening
>  switch,  but
> >the  OVN  implementation only supports gateway logical routers
> that are
> >connected to logical switches. Using a join logical switch also
> reduces
> >the  number  of  IP addresses needed on the distributed router.)
> >
> > -8<--8<-
> >
> > Before splitting the OVN join logical switch into several small logical
> switches, I did try directly connecting the LR to each of the node-specific
> LR using a point-to-point link but it didn't work. Since this was
> corroborated by the man page, I didn't debug the topology and moved on to
> splitting the `join` logical switch.
>
> You are right. So this *improvement* would also need change in OVN as
> well, and the benefit seems less obvious than the other two options.
>
> >
> > Regards,
> > ~Girish
> >
> >>>>
> > --
> > You received this message because you are subscribed to the Google
> Groups "ovn-kubernetes" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to ovn-kubernetes+unsubscr...@googlegroups.com.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STQ0cJU0BWPrt%2BGeFa4ehxyWGh6-rSYnZ0N89c1GTnX86g%40mail.gmail.com
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-09 Thread Tim Rozet
So we can get rid of the join logical switch. This might be a dumb
question, but why do we need an external switch? In the local gateway mode:

pod--logical switchDR---join switch (to remove) --- GR
169.x.x.2---external switch---169.x.x.1 Linux host

There's no reason in the above to have an external switch that I can see.

Perhaps in the shared gateway mode it is necessary if all of the nodes
externally attach to the same L2 network.

Tim Rozet
Red Hat CTO Networking Team


On Fri, May 8, 2020 at 4:13 PM Lorenzo Bianconi 
wrote:

> > On Wed, May 6, 2020 at 11:41 PM Han Zhou  wrote:
> >
> > >
> > >
> > > On Wed, May 6, 2020 at 12:49 AM Numan Siddique  wrote:
> > > >
>
> [...]
>
> > > > I forgot to mention, Lorenzo have similar ideas for moving the arp
> > > resolve lflows for NAT entries to mac_binding rows.
> > > >
> > >
> > > I am hesitate to the approach of moving to mac_binding as solution to
> this
> > > particular problem, because:
> > > 1. Although cost of each mac_binding entry may be much lower than a
> > > logical flow entry, it would still be O(n^2), since LRP is part of the
> key
> > > in the table.
> > >
> >
> > Agree. I realize it now.
>
> Hi Han and Numan,
>
> what about moving to mac_binding table just entries related to NAT where we
> configured the external mac address since this info is known in advance. I
> can
> share a PoC I developed few weeks ago.
>
> Regards,
> Lorenzo
>
> >
> > Thanks
> > Numan
> >
> >
> > > 2. It is better to separate the static and dynamic part clearly.
> Moving to
> > > mac_binding will lose this clarity in data, and also the ownership of
> the
> > > data as well (now mac_binding entries are added only by
> ovn-controllers).
> > > Although I am not in favor of solving the problem with this approach
> > > (because of 1)), maybe it makes sense to reduce number of logical
> flows as
> > > a general improvement by moving all neighbour information to
> mac_binding
> > > for scalability. If we do so, I would suggest to figure out a way to
> keep
> > > the data clarity between static and dynamic part.
> > >
> > > For this particular problem, we just don't want the static part
> populated
> > > because most of them are not needed except one per LRP. However, even
> > > before considering optionally disabling the static part, I wanted to
> > > understand firstly why separating the join LS would not solve the
> problem.
> > >
> > > >>
> > > >>
> > > >> Thanks
> > > >> Numan
> > > >>
> > > >>>
> > > >>> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > > >>> > programmatically related to the corresponding IP addresses, and
> in
> > > >>> > places where that's not currently true, we could try to make it
> true,
> > > >>> > and then perhaps the thousands of rules could just be replaced
> by a
> > > >>> > single rule?
> > > >>> >
> > > >>> This may be a good idea, but I am not sure how to implement in OVN
> to
> > > make it generic, since most OVN users can't make such assumption.
> > > >>>
> > > >>> On the other hand, why wouldn't splitting the join logical switch
> to
> > > 1000 LSes solve the problem? I understand that there will be 1000 more
> > > datapaths, and 1000 more LRPs, but these are all O(n), which is much
> more
> > > efficient than the O(n^2) exploding. What's the other scale issues
> created
> > > by this?
> > > >>>
> > > >>> In addition, Girish, for the external LS, I am not sure why can't
> it
> > > be shared, if all the nodes are connected to a single L2 network. (If
> they
> > > are connected to separate L2 networks, different external LSes should
> be
> > > created, at least according to current OVN model).
> > > >>>
> > > >>> Thanks,
> > > >>> Han
> > > >>> ___
> > > >>> discuss mailing list
> > > >>> disc...@openvswitch.org
> > > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > ___
> > > discuss mailing list
> > > disc...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >
>
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/20200508201301.GD47205%40localhost.localdomain
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-09 Thread Tim Rozet
Girish, Han,
>From my understanding the GR (per node) <> DR link is local subnet and
you don't want the overhead of many switch objects in OVN, but you also
dont want a all the GRs connecting to a single switch to stop large L2
domain. Isn't the simple solution to allow connecting routers to each other
without an intermediary switch?

Tim Rozet
Red Hat CTO Networking Team


On Fri, May 8, 2020 at 3:17 AM Girish Moodalbail 
wrote:

>
>
> On Thu, May 7, 2020 at 11:24 PM Han Zhou  wrote:
>
>> (Add the MLs back)
>>
>> On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail 
>> wrote:
>>
>>> Hello Han,
>>>
>>> Sorry, I was monitoring the ovn-kubernetes google group and didn't see
>>> your emails till now.
>>>
>>>
>>>>
>>>> On the other hand, why wouldn't splitting the join logical switch to
>>>> 1000 LSes solve the problem? I understand that there will be 1000 more
>>>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>>>> efficient than the O(n^2) exploding. What's the other scale issues created
>>>> by this?
>>>>
>>>
>>> Splitting a single join logical switch into 1000 different logical
>>> switch is how I have resolved the problem now. However, with this design I
>>> see following issues.
>>> (1) Complexity
>>>where one logical switch should have sufficed, we now need to create
>>> 1000 logical switches just to workaround the O(n^2) logical flows
>>> (2) IPAM management
>>>   - before I had one IP subnet 100.64.0.0/16 for the single logical
>>> switch and depended on OVN IPAM to allocate IPs off of that subnet
>>>   - now I need to first do subnet management (break a /16 to /29 CIDR)
>>> in OVN K8s and then assign each subnet to each of the join logical switch
>>> (3) each of this join logical switch is a distributed switch. The flows
>>> related to each one of them will be present in each hypervisor. This will
>>> increase the number of OpenFlow flows  However, from OVN K8s point of view
>>> this logical switch is essentially pinned to an hypervisor and its role is
>>> to connect the hypervisor's l3gateway to the distributed router.
>>>
>>> We are trying to simplify the OVN logical topology for OVN K8s so that
>>> the number of logical flows (and therefore the number of OpenFlow flows)
>>> are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and
>>> finally ovn-controller processes.
>>>
>>> Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
>>> k8s-cluster we will have 4000 + 1 (distributed router). This ends up
>>> creating around 250K OpenFlow rules in each of the hypervisior. This number
>>> is to just support the initial logical topology. I am not accounting for
>>> any flows that will be generated for k8s network polices, services, and so
>>> on.
>>>
>>>
>>>>
>>>> In addition, Girish, for the external LS, I am not sure why can't it be
>>>> shared, if all the nodes are connected to a single L2 network. (If they are
>>>> connected to separate L2 networks, different external LSes should be
>>>> created, at least according to current OVN model).
>>>>
>>>
>>> Yes, the plan was to share the same external LS with all of the L3
>>> gateway routers since they are all on the same broadcast domain. However,
>>> we will end up with the same 2M logical flows since a single external LS
>>> connects all the L3 gateway routers on the same broadcast domain.
>>>
>>> In short, for a 1000-node K8s cluster, if we reduce the logical flow
>>> explosion, then we can reduce the number of logical resources in OVN K8s
>>> topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
>>> become 1).
>>>
>>>
>> Ok, so now we are not satisfied with even O(n), and instead we want to
>> make it O(1) for some of the resources.
>> I think the major problem is the per-node gateway routers. It seems not
>> really necessary in theory. Ideally the topology can be simplified with the
>> concept of distributed gateway ports, on a single logical router (the join
>> router), and then we can remove all the join LSes and gateway routers,
>> something like below:
>>
>> +--+
>> |external logical switch   |
>> +-+-++-+
>>   | ||
>&g