[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-01 Thread Girish Moodalbail
Hello all,

Say, the logical topology is as defined below. We have a logical router
connected to 1000 gateway routers through a join switch. This is a 1000
hypervisor OVN k8s cluster where-in each gateway router is bound to their
respective hypervisor.

+---+ +---++---+ +---+
|  Gateway  | |  Gateway  | .  |  Gateway  | |  Gateway  |
| Router-1  | | Router-2  ||Router-999 | |Router-1000|
+-+-+ +-+-++-+-+ +---+---+
 100.64.0.2100.64.0.3|100.64.x.y
  | ||   |
+-+-++---+--+
|join logical_switch|
|  (100.64.0.0/16)  |
++--+
 |
 |
100.64.0.1
++---+
| logical router |
++

If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline of
Gateway Router-1, then you will see that there will be 2000 logical flow
entries (1000 for IPv4 and 1000 for IPv6) where-in each of the entry
resolves the NextHop IP address (corresponding to the 1000 routers) to
destination ethernet address. For example, on Gateway Router-1
(outport == rtoj-gr1 && reg0 == 100.64.0.3), action=(eth.dst=m3; next;)
(outport == rtoj-gr1 && reg0 == 100.64.0.4), action=(eth.dst=m4; next;)

Each router will have 2000 entries, and in total we will have 2000 * 1000 =
2M entries. That is lot of flows for OVN SB to digest.

In the topology above, the only intended path is North-South between each
gateway router and the logical router. There is no east-west traffic
between the gateway routers

We addressed this issue by creating 1000 join logical switches and each
join logical_switch connects one gateway router to the logical router.
However, this creates lots of logical resources and other scale issues in
OVN. Also, there are other places in the OVN kubernetes logical topology
that we could optimize by creating just one logical switch instead of 1000s
of logical switch (for example: instead of a separate external logical
switch with localnet port that connects gateway router to physical network,
we could just have one for the whole logical topology).

Is there an another way to solve the above problem with just keeping the
single join logical switch?

Regards,
~Girish
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-01 Thread Girish Moodalbail
Hello all,

Say, the logical topology is as defined below. We have a logical router
connected to 1000 gateway routers through a join switch. This is a 1000
hypervisor OVN k8s cluster where-in each gateway router is bound to their
respective hypervisor.

+---+ +---++---+ +---+
|  Gateway  | |  Gateway  | .  |  Gateway  | |  Gateway  |
| Router-1  | | Router-2  ||Router-999 | |Router-1000|
+-+-+ +-+-++-+-+ +---+---+
 100.64.0.2100.64.0.3|100.64.x.y
  | ||   |
+-+-++---+--+
|join logical_switch|
|  (100.64.0.0/16)  |
++--+
 |
 |
100.64.0.1
++---+
| logical router |
++

If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline of
Gateway Router-1, then you will see that there will be 2000 logical flow
entries (1000 for IPv4 and 1000 for IPv6) where-in each of the entry
resolves the NextHop IP address (corresponding to the 1000 routers) to
destination ethernet address. For example, on Gateway Router-1
(outport == rtoj-gr1 && reg0 == 100.64.0.3), action=(eth.dst=m3; next;)
(outport == rtoj-gr1 && reg0 == 100.64.0.4), action=(eth.dst=m4; next;)

Each router will have 2000 entries, and in total we will have 2000 * 1000 =
2M entries. That is lot of flows for OVN SB to digest.

In the topology above, the only intended path is North-South between each
gateway router and the logical router. There is no east-west traffic
between the gateway routers

We addressed this issue by creating 1000 join logical switches and each
join logical_switch connects one gateway router to the logical router.
However, this creates lots of logical resources and other scale issues in
OVN. Also, there are other places in the OVN kubernetes logical topology
that we could optimize by creating just one logical switch instead of 1000s
of logical switch (for example: instead of a separate external logical
switch with localnet port that connects gateway router to physical network,
we could just have one for the whole logical topology).

Is there an another way to solve the above problem with just keeping the
single join logical switch?

Regards,
~Girish
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-01 Thread Dan Winship
On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> of Gateway Router-1, then you will see that there will be 2000 logical
> flow entries...

> In the topology above, the only intended path is North-South between
> each gateway router and the logical router. There is no east-west
> traffic between the gateway routers

> Is there an another way to solve the above problem with just keeping the
> single join logical switch?

Two thoughts:

1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
just lets ARP requests pass through normally, and lets ARP replies pass
through normally as long as they are correct (ie, it doesn't let
spoofing through). This means fewer flows but more traffic. Maybe that's
the right tradeoff?

2. In most places in ovn-kubernetes, our MAC addresses are
programmatically related to the corresponding IP addresses, and in
places where that's not currently true, we could try to make it true,
and then perhaps the thousands of rules could just be replaced by a
single rule?

-- Dan

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-05 Thread Han Zhou
On Fri, May 1, 2020 at 2:14 PM Dan Winship  wrote:
>
> On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> > If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> > of Gateway Router-1, then you will see that there will be 2000 logical
> > flow entries...
>
> > In the topology above, the only intended path is North-South between
> > each gateway router and the logical router. There is no east-west
> > traffic between the gateway routers
>
> > Is there an another way to solve the above problem with just keeping the
> > single join logical switch?
>
> Two thoughts:
>
> 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
> just lets ARP requests pass through normally, and lets ARP replies pass
> through normally as long as they are correct (ie, it doesn't let
> spoofing through). This means fewer flows but more traffic. Maybe that's
> the right tradeoff?
>
The 2M entries here is not for ARP responder, but more equivalent to the
neighbour table (or ARP cache), on each LR. The ARP responder resides in
the LS (join logical switch), which is O(n) instead of O(n^2), so it is not
a problem here.

However, a similar idea may works here to avoid the O(n^2) scale issue. For
the neighbour table, actually OVN has two parts, one is statically build,
which is the 2M entires mentioned in this case, and the other is the
dynamic ARP resolve - the mac_binding table, which is dynamically populated
by handling ARP messages. To solve the problem here, it is possible to
change OVN to support configuring a LR to avoid static neighbour table, and
relies only on dynamic ARP resolving. In this case, all the gateway routers
can be configured as not using static ARP resolving, and eventually there
will be only 2 entries (one for IPv4 and one for IPv6) for each gateway
router in mac_binding table for the north-south traffic to the join router.
(of source there will be still same amount of mac_bindings in each router
for the external traffic on the other side of the gateway routers).

This change seems straightforward, but I am not sure if there is any corner
cases.

> 2. In most places in ovn-kubernetes, our MAC addresses are
> programmatically related to the corresponding IP addresses, and in
> places where that's not currently true, we could try to make it true,
> and then perhaps the thousands of rules could just be replaced by a
> single rule?
>
This may be a good idea, but I am not sure how to implement in OVN to make
it generic, since most OVN users can't make such assumption.

On the other hand, why wouldn't splitting the join logical switch to 1000
LSes solve the problem? I understand that there will be 1000 more
datapaths, and 1000 more LRPs, but these are all O(n), which is much more
efficient than the O(n^2) exploding. What's the other scale issues created
by this?

In addition, Girish, for the external LS, I am not sure why can't it be
shared, if all the nodes are connected to a single L2 network. (If they are
connected to separate L2 networks, different external LSes should be
created, at least according to current OVN model).

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-06 Thread Numan Siddique
On Wed, May 6, 2020 at 12:28 AM Han Zhou  wrote:

>
>
> On Fri, May 1, 2020 at 2:14 PM Dan Winship  wrote:
> >
> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> > > of Gateway Router-1, then you will see that there will be 2000 logical
> > > flow entries...
> >
> > > In the topology above, the only intended path is North-South between
> > > each gateway router and the logical router. There is no east-west
> > > traffic between the gateway routers
> >
> > > Is there an another way to solve the above problem with just keeping
> the
> > > single join logical switch?
> >
> > Two thoughts:
> >
> > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
> > just lets ARP requests pass through normally, and lets ARP replies pass
> > through normally as long as they are correct (ie, it doesn't let
> > spoofing through). This means fewer flows but more traffic. Maybe that's
> > the right tradeoff?
> >
> The 2M entries here is not for ARP responder, but more equivalent to the
> neighbour table (or ARP cache), on each LR. The ARP responder resides in
> the LS (join logical switch), which is O(n) instead of O(n^2), so it is not
> a problem here.
>
> However, a similar idea may works here to avoid the O(n^2) scale issue.
> For the neighbour table, actually OVN has two parts, one is statically
> build, which is the 2M entires mentioned in this case, and the other is the
> dynamic ARP resolve - the mac_binding table, which is dynamically populated
> by handling ARP messages. To solve the problem here, it is possible to
> change OVN to support configuring a LR to avoid static neighbour table, and
> relies only on dynamic ARP resolving. In this case, all the gateway routers
> can be configured as not using static ARP resolving, and eventually there
> will be only 2 entries (one for IPv4 and one for IPv6) for each gateway
> router in mac_binding table for the north-south traffic to the join router.
> (of source there will be still same amount of mac_bindings in each router
> for the external traffic on the other side of the gateway routers).
>
> This change seems straightforward, but I am not sure if there is any
> corner cases.
>

May be ovn-northd instead of adding these lflows in lr_in_arp_resolve, can
probably create a mac_binding table row in SB DB ?
This would result in less logical flows at the cost of more mac_binding
entries. The number of OF flows would still remain the same.

Thanks
Numan


> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > programmatically related to the corresponding IP addresses, and in
> > places where that's not currently true, we could try to make it true,
> > and then perhaps the thousands of rules could just be replaced by a
> > single rule?
> >
> This may be a good idea, but I am not sure how to implement in OVN to make
> it generic, since most OVN users can't make such assumption.
>
> On the other hand, why wouldn't splitting the join logical switch to 1000
> LSes solve the problem? I understand that there will be 1000 more
> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
> efficient than the O(n^2) exploding. What's the other scale issues created
> by this?
>
> In addition, Girish, for the external LS, I am not sure why can't it be
> shared, if all the nodes are connected to a single L2 network. (If they are
> connected to separate L2 networks, different external LSes should be
> created, at least according to current OVN model).
>
> Thanks,
> Han
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-06 Thread Numan Siddique
On Wed, May 6, 2020 at 12:56 PM Numan Siddique  wrote:

>
>
> On Wed, May 6, 2020 at 12:28 AM Han Zhou  wrote:
>
>>
>>
>> On Fri, May 1, 2020 at 2:14 PM Dan Winship  wrote:
>> >
>> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
>> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
>> > > of Gateway Router-1, then you will see that there will be 2000 logical
>> > > flow entries...
>> >
>> > > In the topology above, the only intended path is North-South between
>> > > each gateway router and the logical router. There is no east-west
>> > > traffic between the gateway routers
>> >
>> > > Is there an another way to solve the above problem with just keeping
>> the
>> > > single join logical switch?
>> >
>> > Two thoughts:
>> >
>> > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
>> > just lets ARP requests pass through normally, and lets ARP replies pass
>> > through normally as long as they are correct (ie, it doesn't let
>> > spoofing through). This means fewer flows but more traffic. Maybe that's
>> > the right tradeoff?
>> >
>> The 2M entries here is not for ARP responder, but more equivalent to the
>> neighbour table (or ARP cache), on each LR. The ARP responder resides in
>> the LS (join logical switch), which is O(n) instead of O(n^2), so it is not
>> a problem here.
>>
>> However, a similar idea may works here to avoid the O(n^2) scale issue.
>> For the neighbour table, actually OVN has two parts, one is statically
>> build, which is the 2M entires mentioned in this case, and the other is the
>> dynamic ARP resolve - the mac_binding table, which is dynamically populated
>> by handling ARP messages. To solve the problem here, it is possible to
>> change OVN to support configuring a LR to avoid static neighbour table, and
>> relies only on dynamic ARP resolving. In this case, all the gateway routers
>> can be configured as not using static ARP resolving, and eventually there
>> will be only 2 entries (one for IPv4 and one for IPv6) for each gateway
>> router in mac_binding table for the north-south traffic to the join router.
>> (of source there will be still same amount of mac_bindings in each router
>> for the external traffic on the other side of the gateway routers).
>>
>> This change seems straightforward, but I am not sure if there is any
>> corner cases.
>>
>
> May be ovn-northd instead of adding these lflows in lr_in_arp_resolve, can
> probably create a mac_binding table row in SB DB ?
> This would result in less logical flows at the cost of more mac_binding
> entries. The number of OF flows would still remain the same.
>

I forgot to mention, Lorenzo have similar ideas for moving the arp resolve
lflows for NAT entries to mac_binding rows.


>
> Thanks
> Numan
>
>
>> > 2. In most places in ovn-kubernetes, our MAC addresses are
>> > programmatically related to the corresponding IP addresses, and in
>> > places where that's not currently true, we could try to make it true,
>> > and then perhaps the thousands of rules could just be replaced by a
>> > single rule?
>> >
>> This may be a good idea, but I am not sure how to implement in OVN to
>> make it generic, since most OVN users can't make such assumption.
>>
>> On the other hand, why wouldn't splitting the join logical switch to 1000
>> LSes solve the problem? I understand that there will be 1000 more
>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>> efficient than the O(n^2) exploding. What's the other scale issues created
>> by this?
>>
>> In addition, Girish, for the external LS, I am not sure why can't it be
>> shared, if all the nodes are connected to a single L2 network. (If they are
>> connected to separate L2 networks, different external LSes should be
>> created, at least according to current OVN model).
>>
>> Thanks,
>> Han
>> ___
>> discuss mailing list
>> disc...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-06 Thread Han Zhou
On Wed, May 6, 2020 at 12:49 AM Numan Siddique  wrote:
>
>
>
> On Wed, May 6, 2020 at 12:56 PM Numan Siddique  wrote:
>>
>>
>>
>> On Wed, May 6, 2020 at 12:28 AM Han Zhou  wrote:
>>>
>>>
>>>
>>> On Fri, May 1, 2020 at 2:14 PM Dan Winship 
wrote:
>>> >
>>> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
>>> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress
pipeline
>>> > > of Gateway Router-1, then you will see that there will be 2000
logical
>>> > > flow entries...
>>> >
>>> > > In the topology above, the only intended path is North-South between
>>> > > each gateway router and the logical router. There is no east-west
>>> > > traffic between the gateway routers
>>> >
>>> > > Is there an another way to solve the above problem with just
keeping the
>>> > > single join logical switch?
>>> >
>>> > Two thoughts:
>>> >
>>> > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
>>> > just lets ARP requests pass through normally, and lets ARP replies
pass
>>> > through normally as long as they are correct (ie, it doesn't let
>>> > spoofing through). This means fewer flows but more traffic. Maybe
that's
>>> > the right tradeoff?
>>> >
>>> The 2M entries here is not for ARP responder, but more equivalent to
the neighbour table (or ARP cache), on each LR. The ARP responder resides
in the LS (join logical switch), which is O(n) instead of O(n^2), so it is
not a problem here.
>>>
>>> However, a similar idea may works here to avoid the O(n^2) scale issue.
For the neighbour table, actually OVN has two parts, one is statically
build, which is the 2M entires mentioned in this case, and the other is the
dynamic ARP resolve - the mac_binding table, which is dynamically populated
by handling ARP messages. To solve the problem here, it is possible to
change OVN to support configuring a LR to avoid static neighbour table, and
relies only on dynamic ARP resolving. In this case, all the gateway routers
can be configured as not using static ARP resolving, and eventually there
will be only 2 entries (one for IPv4 and one for IPv6) for each gateway
router in mac_binding table for the north-south traffic to the join router.
(of source there will be still same amount of mac_bindings in each router
for the external traffic on the other side of the gateway routers).
>>>
>>> This change seems straightforward, but I am not sure if there is any
corner cases.
>>
>>
>> May be ovn-northd instead of adding these lflows in lr_in_arp_resolve,
can probably create a mac_binding table row in SB DB ?
>> This would result in less logical flows at the cost of more mac_binding
entries. The number of OF flows would still remain the same.
>
>
> I forgot to mention, Lorenzo have similar ideas for moving the arp
resolve lflows for NAT entries to mac_binding rows.
>

I am hesitate to the approach of moving to mac_binding as solution to this
particular problem, because:
1. Although cost of each mac_binding entry may be much lower than a logical
flow entry, it would still be O(n^2), since LRP is part of the key in the
table.
2. It is better to separate the static and dynamic part clearly. Moving to
mac_binding will lose this clarity in data, and also the ownership of the
data as well (now mac_binding entries are added only by ovn-controllers).
Although I am not in favor of solving the problem with this approach
(because of 1)), maybe it makes sense to reduce number of logical flows as
a general improvement by moving all neighbour information to mac_binding
for scalability. If we do so, I would suggest to figure out a way to keep
the data clarity between static and dynamic part.

For this particular problem, we just don't want the static part populated
because most of them are not needed except one per LRP. However, even
before considering optionally disabling the static part, I wanted to
understand firstly why separating the join LS would not solve the problem.

>>
>>
>> Thanks
>> Numan
>>
>>>
>>> > 2. In most places in ovn-kubernetes, our MAC addresses are
>>> > programmatically related to the corresponding IP addresses, and in
>>> > places where that's not currently true, we could try to make it true,
>>> > and then perhaps the thousands of rules could just be replaced by a
>>> > single rule?
>>> >
>>> This may be a good idea, but I am not sure how to implement in OVN to
make it generic, since most OVN users can't make such assumption.
>>>
>>> On the other hand, why wouldn't splitting the join logical switch to
1000 LSes solve the problem? I understand that there will be 1000 more
datapaths, and 1000 more LRPs, but these are all O(n), which is much more
efficient than the O(n^2) exploding. What's the other scale issues created
by this?
>>>
>>> In addition, Girish, for the external LS, I am not sure why can't it be
shared, if all the nodes are connected to a single L2 network. (If they are
connected to separate L2 networks, different external LSes should be
created, at least according to current OVN model).
>>>
>>> Thanks,
>>>

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-06 Thread Numan Siddique
On Wed, May 6, 2020 at 11:41 PM Han Zhou  wrote:

>
>
> On Wed, May 6, 2020 at 12:49 AM Numan Siddique  wrote:
> >
> >
> >
> > On Wed, May 6, 2020 at 12:56 PM Numan Siddique  wrote:
> >>
> >>
> >>
> >> On Wed, May 6, 2020 at 12:28 AM Han Zhou  wrote:
> >>>
> >>>
> >>>
> >>> On Fri, May 1, 2020 at 2:14 PM Dan Winship 
> wrote:
> >>> >
> >>> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> >>> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress
> pipeline
> >>> > > of Gateway Router-1, then you will see that there will be 2000
> logical
> >>> > > flow entries...
> >>> >
> >>> > > In the topology above, the only intended path is North-South
> between
> >>> > > each gateway router and the logical router. There is no east-west
> >>> > > traffic between the gateway routers
> >>> >
> >>> > > Is there an another way to solve the above problem with just
> keeping the
> >>> > > single join logical switch?
> >>> >
> >>> > Two thoughts:
> >>> >
> >>> > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
> >>> > just lets ARP requests pass through normally, and lets ARP replies
> pass
> >>> > through normally as long as they are correct (ie, it doesn't let
> >>> > spoofing through). This means fewer flows but more traffic. Maybe
> that's
> >>> > the right tradeoff?
> >>> >
> >>> The 2M entries here is not for ARP responder, but more equivalent to
> the neighbour table (or ARP cache), on each LR. The ARP responder resides
> in the LS (join logical switch), which is O(n) instead of O(n^2), so it is
> not a problem here.
> >>>
> >>> However, a similar idea may works here to avoid the O(n^2) scale
> issue. For the neighbour table, actually OVN has two parts, one is
> statically build, which is the 2M entires mentioned in this case, and the
> other is the dynamic ARP resolve - the mac_binding table, which is
> dynamically populated by handling ARP messages. To solve the problem here,
> it is possible to change OVN to support configuring a LR to avoid static
> neighbour table, and relies only on dynamic ARP resolving. In this case,
> all the gateway routers can be configured as not using static ARP
> resolving, and eventually there will be only 2 entries (one for IPv4 and
> one for IPv6) for each gateway router in mac_binding table for the
> north-south traffic to the join router. (of source there will be still same
> amount of mac_bindings in each router for the external traffic on the other
> side of the gateway routers).
> >>>
> >>> This change seems straightforward, but I am not sure if there is any
> corner cases.
> >>
> >>
> >> May be ovn-northd instead of adding these lflows in lr_in_arp_resolve,
> can probably create a mac_binding table row in SB DB ?
> >> This would result in less logical flows at the cost of more mac_binding
> entries. The number of OF flows would still remain the same.
> >
> >
> > I forgot to mention, Lorenzo have similar ideas for moving the arp
> resolve lflows for NAT entries to mac_binding rows.
> >
>
> I am hesitate to the approach of moving to mac_binding as solution to this
> particular problem, because:
> 1. Although cost of each mac_binding entry may be much lower than a
> logical flow entry, it would still be O(n^2), since LRP is part of the key
> in the table.
>

Agree. I realize it now.

Thanks
Numan


> 2. It is better to separate the static and dynamic part clearly. Moving to
> mac_binding will lose this clarity in data, and also the ownership of the
> data as well (now mac_binding entries are added only by ovn-controllers).
> Although I am not in favor of solving the problem with this approach
> (because of 1)), maybe it makes sense to reduce number of logical flows as
> a general improvement by moving all neighbour information to mac_binding
> for scalability. If we do so, I would suggest to figure out a way to keep
> the data clarity between static and dynamic part.
>
> For this particular problem, we just don't want the static part populated
> because most of them are not needed except one per LRP. However, even
> before considering optionally disabling the static part, I wanted to
> understand firstly why separating the join LS would not solve the problem.
>
> >>
> >>
> >> Thanks
> >> Numan
> >>
> >>>
> >>> > 2. In most places in ovn-kubernetes, our MAC addresses are
> >>> > programmatically related to the corresponding IP addresses, and in
> >>> > places where that's not currently true, we could try to make it true,
> >>> > and then perhaps the thousands of rules could just be replaced by a
> >>> > single rule?
> >>> >
> >>> This may be a good idea, but I am not sure how to implement in OVN to
> make it generic, since most OVN users can't make such assumption.
> >>>
> >>> On the other hand, why wouldn't splitting the join logical switch to
> 1000 LSes solve the problem? I understand that there will be 1000 more
> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
> efficient than the O(n^2) exploding. What's the other scale 

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-07 Thread Han Zhou
(Add the MLs back)

On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail 
wrote:

> Hello Han,
>
> Sorry, I was monitoring the ovn-kubernetes google group and didn't see
> your emails till now.
>
>
>>
>> On the other hand, why wouldn't splitting the join logical switch to 1000
>> LSes solve the problem? I understand that there will be 1000 more
>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>> efficient than the O(n^2) exploding. What's the other scale issues created
>> by this?
>>
>
> Splitting a single join logical switch into 1000 different logical switch
> is how I have resolved the problem now. However, with this design I see
> following issues.
> (1) Complexity
>where one logical switch should have sufficed, we now need to create
> 1000 logical switches just to workaround the O(n^2) logical flows
> (2) IPAM management
>   - before I had one IP subnet 100.64.0.0/16 for the single logical
> switch and depended on OVN IPAM to allocate IPs off of that subnet
>   - now I need to first do subnet management (break a /16 to /29 CIDR) in
> OVN K8s and then assign each subnet to each of the join logical switch
> (3) each of this join logical switch is a distributed switch. The flows
> related to each one of them will be present in each hypervisor. This will
> increase the number of OpenFlow flows  However, from OVN K8s point of view
> this logical switch is essentially pinned to an hypervisor and its role is
> to connect the hypervisor's l3gateway to the distributed router.
>
> We are trying to simplify the OVN logical topology for OVN K8s so that the
> number of logical flows (and therefore the number of OpenFlow flows) are
> reduced and that reduces the pressure on ovn-northd, OVN SB DB, and finally
> ovn-controller processes.
>
> Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
> k8s-cluster we will have 4000 + 1 (distributed router). This ends up
> creating around 250K OpenFlow rules in each of the hypervisior. This number
> is to just support the initial logical topology. I am not accounting for
> any flows that will be generated for k8s network polices, services, and so
> on.
>
>
>>
>> In addition, Girish, for the external LS, I am not sure why can't it be
>> shared, if all the nodes are connected to a single L2 network. (If they are
>> connected to separate L2 networks, different external LSes should be
>> created, at least according to current OVN model).
>>
>
> Yes, the plan was to share the same external LS with all of the L3 gateway
> routers since they are all on the same broadcast domain. However, we will
> end up with the same 2M logical flows since a single external LS connects
> all the L3 gateway routers on the same broadcast domain.
>
> In short, for a 1000-node K8s cluster, if we reduce the logical flow
> explosion, then we can reduce the number of logical resources in OVN K8s
> topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
> become 1).
>
>
Ok, so now we are not satisfied with even O(n), and instead we want to make
it O(1) for some of the resources.
I think the major problem is the per-node gateway routers. It seems not
really necessary in theory. Ideally the topology can be simplified with the
concept of distributed gateway ports, on a single logical router (the join
router), and then we can remove all the join LSes and gateway routers,
something like below:

+--+
|external logical switch   |
+-+-++-+
  | ||
+-+-+ +---++-+---+
| dgp1@node1| | dgp2@node2|   ...  |dgp1000@node1000 |
+-+-+ +-+-++-+---+
  | ||
+-+-++-+
| logical router   |
+--+

(dgp = distributed gateway port)

This way, you only need one router, and also one external logical switch,
and there won't be the O(n^2) flow exploding problem for ARP resolving
because you have 1 LR only. The number of logical routers and switches
become O(1). The number of router ports are still O(n), but it is also
halved.

In reality, there are some problems of this solution that need to be
addressed.

Firstly, it would require some change in OVN because currently OVN has a
limitation that each LR can only have one gateway router port. However, it
doesn't seem to be anything fundamental that would prevent us from removing
that restriction to support multiple distributed gateway ports on a single
LR. I'd like to hear from more OVN folks in case there is some reason we
shouldn't do this.

The other thing that I am not so sure is about connecting the logical
router to the external logical switch through multiple ports. This means we
will have multiple ports of the logical router on the same subnet, which is
something we usual

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-07 Thread Numan Siddique
On Fri, May 8, 2020 at 11:54 AM Han Zhou  wrote:

> (Add the MLs back)
>
> On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail 
> wrote:
>
>> Hello Han,
>>
>> Sorry, I was monitoring the ovn-kubernetes google group and didn't see
>> your emails till now.
>>
>>
>>>
>>> On the other hand, why wouldn't splitting the join logical switch to
>>> 1000 LSes solve the problem? I understand that there will be 1000 more
>>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>>> efficient than the O(n^2) exploding. What's the other scale issues created
>>> by this?
>>>
>>
>> Splitting a single join logical switch into 1000 different logical switch
>> is how I have resolved the problem now. However, with this design I see
>> following issues.
>> (1) Complexity
>>where one logical switch should have sufficed, we now need to create
>> 1000 logical switches just to workaround the O(n^2) logical flows
>> (2) IPAM management
>>   - before I had one IP subnet 100.64.0.0/16 for the single logical
>> switch and depended on OVN IPAM to allocate IPs off of that subnet
>>   - now I need to first do subnet management (break a /16 to /29 CIDR) in
>> OVN K8s and then assign each subnet to each of the join logical switch
>> (3) each of this join logical switch is a distributed switch. The flows
>> related to each one of them will be present in each hypervisor. This will
>> increase the number of OpenFlow flows  However, from OVN K8s point of view
>> this logical switch is essentially pinned to an hypervisor and its role is
>> to connect the hypervisor's l3gateway to the distributed router.
>>
>> We are trying to simplify the OVN logical topology for OVN K8s so that
>> the number of logical flows (and therefore the number of OpenFlow flows)
>> are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and
>> finally ovn-controller processes.
>>
>> Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
>> k8s-cluster we will have 4000 + 1 (distributed router). This ends up
>> creating around 250K OpenFlow rules in each of the hypervisior. This number
>> is to just support the initial logical topology. I am not accounting for
>> any flows that will be generated for k8s network polices, services, and so
>> on.
>>
>>
>>>
>>> In addition, Girish, for the external LS, I am not sure why can't it be
>>> shared, if all the nodes are connected to a single L2 network. (If they are
>>> connected to separate L2 networks, different external LSes should be
>>> created, at least according to current OVN model).
>>>
>>
>> Yes, the plan was to share the same external LS with all of the L3
>> gateway routers since they are all on the same broadcast domain. However,
>> we will end up with the same 2M logical flows since a single external LS
>> connects all the L3 gateway routers on the same broadcast domain.
>>
>> In short, for a 1000-node K8s cluster, if we reduce the logical flow
>> explosion, then we can reduce the number of logical resources in OVN K8s
>> topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
>> become 1).
>>
>>
> Ok, so now we are not satisfied with even O(n), and instead we want to
> make it O(1) for some of the resources.
> I think the major problem is the per-node gateway routers. It seems not
> really necessary in theory. Ideally the topology can be simplified with the
> concept of distributed gateway ports, on a single logical router (the join
> router), and then we can remove all the join LSes and gateway routers,
> something like below:
>
> +--+
> |external logical switch   |
> +-+-++-+
>   | ||
> +-+-+ +---++-+---+
> | dgp1@node1| | dgp2@node2|   ...  |dgp1000@node1000 |
> +-+-+ +-+-++-+---+
>   | ||
> +-+-++-+
> | logical router   |
> +--+
>
> (dgp = distributed gateway port)
>
> This way, you only need one router, and also one external logical switch,
> and there won't be the O(n^2) flow exploding problem for ARP resolving
> because you have 1 LR only. The number of logical routers and switches
> become O(1). The number of router ports are still O(n), but it is also
> halved.
>
> In reality, there are some problems of this solution that need to be
> addressed.
>
> Firstly, it would require some change in OVN because currently OVN has a
> limitation that each LR can only have one gateway router port. However, it
> doesn't seem to be anything fundamental that would prevent us from removing
> that restriction to support multiple distributed gateway ports on a single
> LR. I'd like to hear from more OVN folks in case there is some reason we
> shouldn't do this.
>
>
I'd be happy if ovn-kube makes use of logical ro

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-08 Thread Girish Moodalbail
On Thu, May 7, 2020 at 11:24 PM Han Zhou  wrote:

> (Add the MLs back)
>
> On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail 
> wrote:
>
>> Hello Han,
>>
>> Sorry, I was monitoring the ovn-kubernetes google group and didn't see
>> your emails till now.
>>
>>
>>>
>>> On the other hand, why wouldn't splitting the join logical switch to
>>> 1000 LSes solve the problem? I understand that there will be 1000 more
>>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>>> efficient than the O(n^2) exploding. What's the other scale issues created
>>> by this?
>>>
>>
>> Splitting a single join logical switch into 1000 different logical switch
>> is how I have resolved the problem now. However, with this design I see
>> following issues.
>> (1) Complexity
>>where one logical switch should have sufficed, we now need to create
>> 1000 logical switches just to workaround the O(n^2) logical flows
>> (2) IPAM management
>>   - before I had one IP subnet 100.64.0.0/16 for the single logical
>> switch and depended on OVN IPAM to allocate IPs off of that subnet
>>   - now I need to first do subnet management (break a /16 to /29 CIDR) in
>> OVN K8s and then assign each subnet to each of the join logical switch
>> (3) each of this join logical switch is a distributed switch. The flows
>> related to each one of them will be present in each hypervisor. This will
>> increase the number of OpenFlow flows  However, from OVN K8s point of view
>> this logical switch is essentially pinned to an hypervisor and its role is
>> to connect the hypervisor's l3gateway to the distributed router.
>>
>> We are trying to simplify the OVN logical topology for OVN K8s so that
>> the number of logical flows (and therefore the number of OpenFlow flows)
>> are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and
>> finally ovn-controller processes.
>>
>> Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
>> k8s-cluster we will have 4000 + 1 (distributed router). This ends up
>> creating around 250K OpenFlow rules in each of the hypervisior. This number
>> is to just support the initial logical topology. I am not accounting for
>> any flows that will be generated for k8s network polices, services, and so
>> on.
>>
>>
>>>
>>> In addition, Girish, for the external LS, I am not sure why can't it be
>>> shared, if all the nodes are connected to a single L2 network. (If they are
>>> connected to separate L2 networks, different external LSes should be
>>> created, at least according to current OVN model).
>>>
>>
>> Yes, the plan was to share the same external LS with all of the L3
>> gateway routers since they are all on the same broadcast domain. However,
>> we will end up with the same 2M logical flows since a single external LS
>> connects all the L3 gateway routers on the same broadcast domain.
>>
>> In short, for a 1000-node K8s cluster, if we reduce the logical flow
>> explosion, then we can reduce the number of logical resources in OVN K8s
>> topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
>> become 1).
>>
>>
> Ok, so now we are not satisfied with even O(n), and instead we want to
> make it O(1) for some of the resources.
> I think the major problem is the per-node gateway routers. It seems not
> really necessary in theory. Ideally the topology can be simplified with the
> concept of distributed gateway ports, on a single logical router (the join
> router), and then we can remove all the join LSes and gateway routers,
> something like below:
>
> +--+
> |external logical switch   |
> +-+-++-+
>   | ||
> +-+-+ +---++-+---+
> | dgp1@node1| | dgp2@node2|   ...  |dgp1000@node1000 |
> +-+-+ +-+-++-+---+
>   | ||
> +-+-++-+
> | logical router   |
> +--+
>
> (dgp = distributed gateway port)
>
> This way, you only need one router, and also one external logical switch,
> and there won't be the O(n^2) flow exploding problem for ARP resolving
> because you have 1 LR only. The number of logical routers and switches
> become O(1). The number of router ports are still O(n), but it is also
> halved.
>
> In reality, there are some problems of this solution that need to be
> addressed.
>
> Firstly, it would require some change in OVN because currently OVN has a
> limitation that each LR can only have one gateway router port. However, it
> doesn't seem to be anything fundamental that would prevent us from removing
> that restriction to support multiple distributed gateway ports on a single
> LR. I'd like to hear from more OVN folks in case there is some reason we
> shouldn't do this.
>
> The other thing that I am not so sure is about c

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-08 Thread Han Zhou
On Fri, May 8, 2020 at 6:41 AM Tim Rozet  wrote:

> Girish, Han,
> From my understanding the GR (per node) <> DR link is local subnet and
> you don't want the overhead of many switch objects in OVN, but you also
> dont want a all the GRs connecting to a single switch to stop large L2
> domain. Isn't the simple solution to allow connecting routers to each other
> without an intermediary switch?
>
>
Tim Rozet
> Red Hat CTO Networking Team
>
>
Hi Tim,

Thanks for the suggestion. This should be an improvement, but it doesn't
completely solve the problem mentioned by Girish.
- Subnet management for the large number of transit subnet is still needed.
- For the external logical switch, this doesn't help.
It is still O(n) regarding number of datapaths, same as the approach of
spliting the join LS, but it is more optimal, because for each of the
direct connections between the LR and GRs, the cost of  is avoided. I think it is worth to try.

Hi Girish, for the DGP solution, please see my comments below:

>
> On Fri, May 8, 2020 at 3:17 AM Girish Moodalbail 
> wrote:
>
>>
>>
>> On Thu, May 7, 2020 at 11:24 PM Han Zhou  wrote:
>>
>>> (Add the MLs back)
>>>
>>> On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail 
>>> wrote:
>>>
 Hello Han,

 Sorry, I was monitoring the ovn-kubernetes google group and didn't see
 your emails till now.


>
> On the other hand, why wouldn't splitting the join logical switch to
> 1000 LSes solve the problem? I understand that there will be 1000 more
> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
> efficient than the O(n^2) exploding. What's the other scale issues created
> by this?
>

 Splitting a single join logical switch into 1000 different logical
 switch is how I have resolved the problem now. However, with this design I
 see following issues.
 (1) Complexity
where one logical switch should have sufficed, we now need to create
 1000 logical switches just to workaround the O(n^2) logical flows
 (2) IPAM management
   - before I had one IP subnet 100.64.0.0/16 for the single logical
 switch and depended on OVN IPAM to allocate IPs off of that subnet
   - now I need to first do subnet management (break a /16 to /29 CIDR)
 in OVN K8s and then assign each subnet to each of the join logical switch
 (3) each of this join logical switch is a distributed switch. The flows
 related to each one of them will be present in each hypervisor. This will
 increase the number of OpenFlow flows  However, from OVN K8s point of view
 this logical switch is essentially pinned to an hypervisor and its role is
 to connect the hypervisor's l3gateway to the distributed router.

 We are trying to simplify the OVN logical topology for OVN K8s so that
 the number of logical flows (and therefore the number of OpenFlow flows)
 are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and
 finally ovn-controller processes.

 Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
 k8s-cluster we will have 4000 + 1 (distributed router). This ends up
 creating around 250K OpenFlow rules in each of the hypervisior. This number
 is to just support the initial logical topology. I am not accounting for
 any flows that will be generated for k8s network polices, services, and so
 on.


>
> In addition, Girish, for the external LS, I am not sure why can't it
> be shared, if all the nodes are connected to a single L2 network. (If they
> are connected to separate L2 networks, different external LSes should be
> created, at least according to current OVN model).
>

 Yes, the plan was to share the same external LS with all of the L3
 gateway routers since they are all on the same broadcast domain. However,
 we will end up with the same 2M logical flows since a single external LS
 connects all the L3 gateway routers on the same broadcast domain.

 In short, for a 1000-node K8s cluster, if we reduce the logical flow
 explosion, then we can reduce the number of logical resources in OVN K8s
 topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
 become 1).


>>> Ok, so now we are not satisfied with even O(n), and instead we want to
>>> make it O(1) for some of the resources.
>>> I think the major problem is the per-node gateway routers. It seems not
>>> really necessary in theory. Ideally the topology can be simplified with the
>>> concept of distributed gateway ports, on a single logical router (the join
>>> router), and then we can remove all the join LSes and gateway routers,
>>> something like below:
>>>
>>> +--+
>>> |external logical switch   |
>>> +-+-++-+
>>>   | ||
>>> +---

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-08 Thread Lorenzo Bianconi
> On Wed, May 6, 2020 at 11:41 PM Han Zhou  wrote:
> 
> >
> >
> > On Wed, May 6, 2020 at 12:49 AM Numan Siddique  wrote:
> > >

[...]

> > > I forgot to mention, Lorenzo have similar ideas for moving the arp
> > resolve lflows for NAT entries to mac_binding rows.
> > >
> >
> > I am hesitate to the approach of moving to mac_binding as solution to this
> > particular problem, because:
> > 1. Although cost of each mac_binding entry may be much lower than a
> > logical flow entry, it would still be O(n^2), since LRP is part of the key
> > in the table.
> >
> 
> Agree. I realize it now.

Hi Han and Numan,

what about moving to mac_binding table just entries related to NAT where we
configured the external mac address since this info is known in advance. I can
share a PoC I developed few weeks ago.

Regards,
Lorenzo

> 
> Thanks
> Numan
> 
> 
> > 2. It is better to separate the static and dynamic part clearly. Moving to
> > mac_binding will lose this clarity in data, and also the ownership of the
> > data as well (now mac_binding entries are added only by ovn-controllers).
> > Although I am not in favor of solving the problem with this approach
> > (because of 1)), maybe it makes sense to reduce number of logical flows as
> > a general improvement by moving all neighbour information to mac_binding
> > for scalability. If we do so, I would suggest to figure out a way to keep
> > the data clarity between static and dynamic part.
> >
> > For this particular problem, we just don't want the static part populated
> > because most of them are not needed except one per LRP. However, even
> > before considering optionally disabling the static part, I wanted to
> > understand firstly why separating the join LS would not solve the problem.
> >
> > >>
> > >>
> > >> Thanks
> > >> Numan
> > >>
> > >>>
> > >>> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > >>> > programmatically related to the corresponding IP addresses, and in
> > >>> > places where that's not currently true, we could try to make it true,
> > >>> > and then perhaps the thousands of rules could just be replaced by a
> > >>> > single rule?
> > >>> >
> > >>> This may be a good idea, but I am not sure how to implement in OVN to
> > make it generic, since most OVN users can't make such assumption.
> > >>>
> > >>> On the other hand, why wouldn't splitting the join logical switch to
> > 1000 LSes solve the problem? I understand that there will be 1000 more
> > datapaths, and 1000 more LRPs, but these are all O(n), which is much more
> > efficient than the O(n^2) exploding. What's the other scale issues created
> > by this?
> > >>>
> > >>> In addition, Girish, for the external LS, I am not sure why can't it
> > be shared, if all the nodes are connected to a single L2 network. (If they
> > are connected to separate L2 networks, different external LSes should be
> > created, at least according to current OVN model).
> > >>>
> > >>> Thanks,
> > >>> Han
> > >>> ___
> > >>> discuss mailing list
> > >>> disc...@openvswitch.org
> > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >

> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss



signature.asc
Description: PGP signature
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-09 Thread Numan Siddique
On Sat, May 9, 2020 at 6:44 PM Tim Rozet  wrote:

> So we can get rid of the join logical switch. This might be a dumb
> question, but why do we need an external switch? In the local gateway mode:
>
> pod--logical switchDR---join switch (to remove) --- GR
> 169.x.x.2---external switch---169.x.x.1 Linux host
>
> There's no reason in the above to have an external switch that I can see.
>

If there is no external switch i.e logical switch with localnet port, then
how will the packet go out from br-int  ?
The packet is supposed to go out  from br-int via the patch port connecting
to the provider bridge.

Thanks
Numan


>
> Perhaps in the shared gateway mode it is necessary if all of the nodes
> externally attach to the same L2 network.
>
> Tim Rozet
> Red Hat CTO Networking Team
>
>
> On Fri, May 8, 2020 at 4:13 PM Lorenzo Bianconi <
> lorenzo.bianc...@redhat.com> wrote:
>
>> > On Wed, May 6, 2020 at 11:41 PM Han Zhou  wrote:
>> >
>> > >
>> > >
>> > > On Wed, May 6, 2020 at 12:49 AM Numan Siddique 
>> wrote:
>> > > >
>>
>> [...]
>>
>> > > > I forgot to mention, Lorenzo have similar ideas for moving the arp
>> > > resolve lflows for NAT entries to mac_binding rows.
>> > > >
>> > >
>> > > I am hesitate to the approach of moving to mac_binding as solution to
>> this
>> > > particular problem, because:
>> > > 1. Although cost of each mac_binding entry may be much lower than a
>> > > logical flow entry, it would still be O(n^2), since LRP is part of
>> the key
>> > > in the table.
>> > >
>> >
>> > Agree. I realize it now.
>>
>> Hi Han and Numan,
>>
>> what about moving to mac_binding table just entries related to NAT where
>> we
>> configured the external mac address since this info is known in advance.
>> I can
>> share a PoC I developed few weeks ago.
>>
>> Regards,
>> Lorenzo
>>
>> >
>> > Thanks
>> > Numan
>> >
>> >
>> > > 2. It is better to separate the static and dynamic part clearly.
>> Moving to
>> > > mac_binding will lose this clarity in data, and also the ownership of
>> the
>> > > data as well (now mac_binding entries are added only by
>> ovn-controllers).
>> > > Although I am not in favor of solving the problem with this approach
>> > > (because of 1)), maybe it makes sense to reduce number of logical
>> flows as
>> > > a general improvement by moving all neighbour information to
>> mac_binding
>> > > for scalability. If we do so, I would suggest to figure out a way to
>> keep
>> > > the data clarity between static and dynamic part.
>> > >
>> > > For this particular problem, we just don't want the static part
>> populated
>> > > because most of them are not needed except one per LRP. However, even
>> > > before considering optionally disabling the static part, I wanted to
>> > > understand firstly why separating the join LS would not solve the
>> problem.
>> > >
>> > > >>
>> > > >>
>> > > >> Thanks
>> > > >> Numan
>> > > >>
>> > > >>>
>> > > >>> > 2. In most places in ovn-kubernetes, our MAC addresses are
>> > > >>> > programmatically related to the corresponding IP addresses, and
>> in
>> > > >>> > places where that's not currently true, we could try to make it
>> true,
>> > > >>> > and then perhaps the thousands of rules could just be replaced
>> by a
>> > > >>> > single rule?
>> > > >>> >
>> > > >>> This may be a good idea, but I am not sure how to implement in
>> OVN to
>> > > make it generic, since most OVN users can't make such assumption.
>> > > >>>
>> > > >>> On the other hand, why wouldn't splitting the join logical switch
>> to
>> > > 1000 LSes solve the problem? I understand that there will be 1000 more
>> > > datapaths, and 1000 more LRPs, but these are all O(n), which is much
>> more
>> > > efficient than the O(n^2) exploding. What's the other scale issues
>> created
>> > > by this?
>> > > >>>
>> > > >>> In addition, Girish, for the external LS, I am not sure why can't
>> it
>> > > be shared, if all the nodes are connected to a single L2 network. (If
>> they
>> > > are connected to separate L2 networks, different external LSes should
>> be
>> > > created, at least according to current OVN model).
>> > > >>>
>> > > >>> Thanks,
>> > > >>> Han
>> > > >>> ___
>> > > >>> discuss mailing list
>> > > >>> disc...@openvswitch.org
>> > > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> > > ___
>> > > discuss mailing list
>> > > disc...@openvswitch.org
>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> > >
>>
>> > ___
>> > discuss mailing list
>> > disc...@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "ovn-kubernetes" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to ovn-kubernetes+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://g

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-09 Thread Tim Rozet
Girish, Han,
>From my understanding the GR (per node) <> DR link is local subnet and
you don't want the overhead of many switch objects in OVN, but you also
dont want a all the GRs connecting to a single switch to stop large L2
domain. Isn't the simple solution to allow connecting routers to each other
without an intermediary switch?

Tim Rozet
Red Hat CTO Networking Team


On Fri, May 8, 2020 at 3:17 AM Girish Moodalbail 
wrote:

>
>
> On Thu, May 7, 2020 at 11:24 PM Han Zhou  wrote:
>
>> (Add the MLs back)
>>
>> On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail 
>> wrote:
>>
>>> Hello Han,
>>>
>>> Sorry, I was monitoring the ovn-kubernetes google group and didn't see
>>> your emails till now.
>>>
>>>

 On the other hand, why wouldn't splitting the join logical switch to
 1000 LSes solve the problem? I understand that there will be 1000 more
 datapaths, and 1000 more LRPs, but these are all O(n), which is much more
 efficient than the O(n^2) exploding. What's the other scale issues created
 by this?

>>>
>>> Splitting a single join logical switch into 1000 different logical
>>> switch is how I have resolved the problem now. However, with this design I
>>> see following issues.
>>> (1) Complexity
>>>where one logical switch should have sufficed, we now need to create
>>> 1000 logical switches just to workaround the O(n^2) logical flows
>>> (2) IPAM management
>>>   - before I had one IP subnet 100.64.0.0/16 for the single logical
>>> switch and depended on OVN IPAM to allocate IPs off of that subnet
>>>   - now I need to first do subnet management (break a /16 to /29 CIDR)
>>> in OVN K8s and then assign each subnet to each of the join logical switch
>>> (3) each of this join logical switch is a distributed switch. The flows
>>> related to each one of them will be present in each hypervisor. This will
>>> increase the number of OpenFlow flows  However, from OVN K8s point of view
>>> this logical switch is essentially pinned to an hypervisor and its role is
>>> to connect the hypervisor's l3gateway to the distributed router.
>>>
>>> We are trying to simplify the OVN logical topology for OVN K8s so that
>>> the number of logical flows (and therefore the number of OpenFlow flows)
>>> are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and
>>> finally ovn-controller processes.
>>>
>>> Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
>>> k8s-cluster we will have 4000 + 1 (distributed router). This ends up
>>> creating around 250K OpenFlow rules in each of the hypervisior. This number
>>> is to just support the initial logical topology. I am not accounting for
>>> any flows that will be generated for k8s network polices, services, and so
>>> on.
>>>
>>>

 In addition, Girish, for the external LS, I am not sure why can't it be
 shared, if all the nodes are connected to a single L2 network. (If they are
 connected to separate L2 networks, different external LSes should be
 created, at least according to current OVN model).

>>>
>>> Yes, the plan was to share the same external LS with all of the L3
>>> gateway routers since they are all on the same broadcast domain. However,
>>> we will end up with the same 2M logical flows since a single external LS
>>> connects all the L3 gateway routers on the same broadcast domain.
>>>
>>> In short, for a 1000-node K8s cluster, if we reduce the logical flow
>>> explosion, then we can reduce the number of logical resources in OVN K8s
>>> topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
>>> become 1).
>>>
>>>
>> Ok, so now we are not satisfied with even O(n), and instead we want to
>> make it O(1) for some of the resources.
>> I think the major problem is the per-node gateway routers. It seems not
>> really necessary in theory. Ideally the topology can be simplified with the
>> concept of distributed gateway ports, on a single logical router (the join
>> router), and then we can remove all the join LSes and gateway routers,
>> something like below:
>>
>> +--+
>> |external logical switch   |
>> +-+-++-+
>>   | ||
>> +-+-+ +---++-+---+
>> | dgp1@node1| | dgp2@node2|   ...  |dgp1000@node1000 |
>> +-+-+ +-+-++-+---+
>>   | ||
>> +-+-++-+
>> | logical router   |
>> +--+
>>
>> (dgp = distributed gateway port)
>>
>> This way, you only need one router, and also one external logical switch,
>> and there won't be the O(n^2) flow exploding problem for ARP resolving
>> because you have 1 LR only. The number of logical routers and switches
>> become O(1). The number of router ports are still O(n), but it is also
>> halved.
>>

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-09 Thread Tim Rozet
So we can get rid of the join logical switch. This might be a dumb
question, but why do we need an external switch? In the local gateway mode:

pod--logical switchDR---join switch (to remove) --- GR
169.x.x.2---external switch---169.x.x.1 Linux host

There's no reason in the above to have an external switch that I can see.

Perhaps in the shared gateway mode it is necessary if all of the nodes
externally attach to the same L2 network.

Tim Rozet
Red Hat CTO Networking Team


On Fri, May 8, 2020 at 4:13 PM Lorenzo Bianconi 
wrote:

> > On Wed, May 6, 2020 at 11:41 PM Han Zhou  wrote:
> >
> > >
> > >
> > > On Wed, May 6, 2020 at 12:49 AM Numan Siddique  wrote:
> > > >
>
> [...]
>
> > > > I forgot to mention, Lorenzo have similar ideas for moving the arp
> > > resolve lflows for NAT entries to mac_binding rows.
> > > >
> > >
> > > I am hesitate to the approach of moving to mac_binding as solution to
> this
> > > particular problem, because:
> > > 1. Although cost of each mac_binding entry may be much lower than a
> > > logical flow entry, it would still be O(n^2), since LRP is part of the
> key
> > > in the table.
> > >
> >
> > Agree. I realize it now.
>
> Hi Han and Numan,
>
> what about moving to mac_binding table just entries related to NAT where we
> configured the external mac address since this info is known in advance. I
> can
> share a PoC I developed few weeks ago.
>
> Regards,
> Lorenzo
>
> >
> > Thanks
> > Numan
> >
> >
> > > 2. It is better to separate the static and dynamic part clearly.
> Moving to
> > > mac_binding will lose this clarity in data, and also the ownership of
> the
> > > data as well (now mac_binding entries are added only by
> ovn-controllers).
> > > Although I am not in favor of solving the problem with this approach
> > > (because of 1)), maybe it makes sense to reduce number of logical
> flows as
> > > a general improvement by moving all neighbour information to
> mac_binding
> > > for scalability. If we do so, I would suggest to figure out a way to
> keep
> > > the data clarity between static and dynamic part.
> > >
> > > For this particular problem, we just don't want the static part
> populated
> > > because most of them are not needed except one per LRP. However, even
> > > before considering optionally disabling the static part, I wanted to
> > > understand firstly why separating the join LS would not solve the
> problem.
> > >
> > > >>
> > > >>
> > > >> Thanks
> > > >> Numan
> > > >>
> > > >>>
> > > >>> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > > >>> > programmatically related to the corresponding IP addresses, and
> in
> > > >>> > places where that's not currently true, we could try to make it
> true,
> > > >>> > and then perhaps the thousands of rules could just be replaced
> by a
> > > >>> > single rule?
> > > >>> >
> > > >>> This may be a good idea, but I am not sure how to implement in OVN
> to
> > > make it generic, since most OVN users can't make such assumption.
> > > >>>
> > > >>> On the other hand, why wouldn't splitting the join logical switch
> to
> > > 1000 LSes solve the problem? I understand that there will be 1000 more
> > > datapaths, and 1000 more LRPs, but these are all O(n), which is much
> more
> > > efficient than the O(n^2) exploding. What's the other scale issues
> created
> > > by this?
> > > >>>
> > > >>> In addition, Girish, for the external LS, I am not sure why can't
> it
> > > be shared, if all the nodes are connected to a single L2 network. (If
> they
> > > are connected to separate L2 networks, different external LSes should
> be
> > > created, at least according to current OVN model).
> > > >>>
> > > >>> Thanks,
> > > >>> Han
> > > >>> ___
> > > >>> discuss mailing list
> > > >>> disc...@openvswitch.org
> > > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > ___
> > > discuss mailing list
> > > disc...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >
>
> > ___
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/20200508201301.GD47205%40localhost.localdomain
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-09 Thread Girish Moodalbail
Hello Han, Tim

Please see in-line:



> Hello Han,
>>>
>>> I did consider distributed gateway port. However, there are two issues
>>> with it
>>>
>>> 1. In order to support K8s NodePort services we need to create a
>>> North-South LB and L3 gateway is a perfect solution for that. AFAIK,
>>>DGP doesn't support it
>>>
>>
> In fact DGP supports LB (at least from code
> https://github.com/ovn-org/ovn/blob/master/northd/ovn-northd.c#L9318),
> but the ovn-nb manpage may need an update.
>

I see


>
>
>> 2. Datapath performance would be bad with DGP. We want the packet meant
>>> for the host or the Internet to exit out of the hypervisor on which the pod
>>> exists. The L3 gateway router provides us with this functionality. With dgp
>>> and with OVN supporting only one instance of it, packets unnecessarily gets
>>> forwarded over tunnel to dgp chassis for SNATing and then gets forwarded
>>> back over tunnel to the host to just exit out locally.
>>>
>>
> This is related to the changes needed for DGP (the first point I mentioned
> in previous email). In the diagram I draw, there will be 1000 DGPs, each
> reside on a chassis, just to make sure north-south traffic can be forwarded
> on the local chassis without going through a central node, just like how it
> works today in ovn-k8s. However, maybe this is not a small change, because
> today the NAT and LB processing on such LRs (LRs with DGP) are all based on
> the assumption that there is only one DGP. For example, the NB schema would
> also need to be changed so that the NAT/LB rules for a router can specify
> DGP to determine the central processing location for those rules.
>

Correct


>
> So, to summarize, if we can make multi-DGP work, it would be the best
> solution for the ovn-k8s scenario. If we can't (either because of design
> problem, or because it is too big effort for the gains), maybe configurably
> avoiding the static neighbour flows is a good way to go. Both options
> requires changes in OVN.
>

Han, optimizing the neighbor cache from the current O(n^2) to something
scalable will be ideal for short-term. I am hoping that the changes to OVN
will not be as complicated as multi-DGP work and other changes to OVN
proposed on this email thread.



> Without changes in OVN, a further optimization based on your current
> workaround can be done is what Tim has suggested: to replace the large
> number of small join LSes (and LRPs and patch ports on both sides) by same
> number of directly connected LRPs.
>

Han and Tim,

OVN supports only peering two distributed routers without a logical switch,
however it doesn't support connecting a distributed router and an l3
gateway router directly as peers. I remember very clearly this being
mentioned in the ovn-architecture man page.

-8<--8<-

   The distributed router and the
   gateway router are  connected  by  another  logical  switch,  sometimes
   referred  to  as a ``join’’ logical switch. (OVN logical routers may be
   connected to one another directly, without an intervening  switch,  but
   the  OVN  implementation only supports gateway logical routers that are
   connected to logical switches. Using a join logical switch also reduces
   the  number  of  IP addresses needed on the distributed router.)

-8<--8<-

Before splitting the OVN join logical switch into several small logical
switches, I did try directly connecting the LR to each of the node-specific
LR using a point-to-point link but it didn't work. Since this was
corroborated by the man page, I didn't debug the topology and moved on to
splitting the `join` logical switch.

Regards,
~Girish


>>>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-10 Thread Han Zhou
On Sat, May 9, 2020 at 5:01 PM Girish Moodalbail 
wrote:
>
> Hello Han, Tim
>
> Please see in-line:
>
>

 Hello Han,

 I did consider distributed gateway port. However, there are two issues
with it

 1. In order to support K8s NodePort services we need to create a
North-South LB and L3 gateway is a perfect solution for that. AFAIK,
DGP doesn't support it
>>
>>
>> In fact DGP supports LB (at least from code
https://github.com/ovn-org/ovn/blob/master/northd/ovn-northd.c#L9318), but
the ovn-nb manpage may need an update.
>
>
> I see
>
>>
>>

 2. Datapath performance would be bad with DGP. We want the packet
meant for the host or the Internet to exit out of the hypervisor on which
the pod exists. The L3 gateway router provides us with this functionality.
With dgp and with OVN supporting only one instance of it, packets
unnecessarily gets forwarded over tunnel to dgp chassis for SNATing and
then gets forwarded back over tunnel to the host to just exit out locally.
>>
>>
>> This is related to the changes needed for DGP (the first point I
mentioned in previous email). In the diagram I draw, there will be 1000
DGPs, each reside on a chassis, just to make sure north-south traffic can
be forwarded on the local chassis without going through a central node,
just like how it works today in ovn-k8s. However, maybe this is not a small
change, because today the NAT and LB processing on such LRs (LRs with DGP)
are all based on the assumption that there is only one DGP. For example,
the NB schema would also need to be changed so that the NAT/LB rules for a
router can specify DGP to determine the central processing location for
those rules.
>
>
> Correct
>
>>
>>
>> So, to summarize, if we can make multi-DGP work, it would be the best
solution for the ovn-k8s scenario. If we can't (either because of design
problem, or because it is too big effort for the gains), maybe configurably
avoiding the static neighbour flows is a good way to go. Both options
requires changes in OVN.
>
>
> Han, optimizing the neighbor cache from the current O(n^2) to something
scalable will be ideal for short-term. I am hoping that the changes to OVN
will not be as complicated as multi-DGP work and other changes to OVN
proposed on this email thread.
>
>
>>
>> Without changes in OVN, a further optimization based on your current
workaround can be done is what Tim has suggested: to replace the large
number of small join LSes (and LRPs and patch ports on both sides) by same
number of directly connected LRPs.
>
>
> Han and Tim,
>
> OVN supports only peering two distributed routers without a logical
switch, however it doesn't support connecting a distributed router and an
l3 gateway router directly as peers. I remember very clearly this being
mentioned in the ovn-architecture man page.
>
> -8<--8<-
>
>The distributed router and the
>gateway router are  connected  by  another  logical  switch,
 sometimes
>referred  to  as a ``join’’ logical switch. (OVN logical routers
may be
>connected to one another directly, without an intervening  switch,
 but
>the  OVN  implementation only supports gateway logical routers
that are
>connected to logical switches. Using a join logical switch also
reduces
>the  number  of  IP addresses needed on the distributed router.)
>
> -8<--8<-
>
> Before splitting the OVN join logical switch into several small logical
switches, I did try directly connecting the LR to each of the node-specific
LR using a point-to-point link but it didn't work. Since this was
corroborated by the man page, I didn't debug the topology and moved on to
splitting the `join` logical switch.

You are right. So this *improvement* would also need change in OVN as well,
and the benefit seems less obvious than the other two options.

>
> Regards,
> ~Girish
>

> --
> You received this message because you are subscribed to the Google Groups
"ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
email to ovn-kubernetes+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STQ0cJU0BWPrt%2BGeFa4ehxyWGh6-rSYnZ0N89c1GTnX86g%40mail.gmail.com
.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-13 Thread Tim Rozet
I went ahead and filed:
https://bugzilla.redhat.com/show_bug.cgi?id=1835386

Tim Rozet
Red Hat CTO Networking Team


On Mon, May 11, 2020 at 1:27 AM Han Zhou  wrote:

>
>
> On Sat, May 9, 2020 at 5:01 PM Girish Moodalbail 
> wrote:
> >
> > Hello Han, Tim
> >
> > Please see in-line:
> >
> >
> 
>  Hello Han,
> 
>  I did consider distributed gateway port. However, there are two
> issues with it
> 
>  1. In order to support K8s NodePort services we need to create a
> North-South LB and L3 gateway is a perfect solution for that. AFAIK,
> DGP doesn't support it
> >>
> >>
> >> In fact DGP supports LB (at least from code
> https://github.com/ovn-org/ovn/blob/master/northd/ovn-northd.c#L9318),
> but the ovn-nb manpage may need an update.
> >
> >
> > I see
> >
> >>
> >>
> 
>  2. Datapath performance would be bad with DGP. We want the packet
> meant for the host or the Internet to exit out of the hypervisor on which
> the pod exists. The L3 gateway router provides us with this functionality.
> With dgp and with OVN supporting only one instance of it, packets
> unnecessarily gets forwarded over tunnel to dgp chassis for SNATing and
> then gets forwarded back over tunnel to the host to just exit out locally.
> >>
> >>
> >> This is related to the changes needed for DGP (the first point I
> mentioned in previous email). In the diagram I draw, there will be 1000
> DGPs, each reside on a chassis, just to make sure north-south traffic can
> be forwarded on the local chassis without going through a central node,
> just like how it works today in ovn-k8s. However, maybe this is not a small
> change, because today the NAT and LB processing on such LRs (LRs with DGP)
> are all based on the assumption that there is only one DGP. For example,
> the NB schema would also need to be changed so that the NAT/LB rules for a
> router can specify DGP to determine the central processing location for
> those rules.
> >
> >
> > Correct
> >
> >>
> >>
> >> So, to summarize, if we can make multi-DGP work, it would be the best
> solution for the ovn-k8s scenario. If we can't (either because of design
> problem, or because it is too big effort for the gains), maybe configurably
> avoiding the static neighbour flows is a good way to go. Both options
> requires changes in OVN.
> >
> >
> > Han, optimizing the neighbor cache from the current O(n^2) to something
> scalable will be ideal for short-term. I am hoping that the changes to OVN
> will not be as complicated as multi-DGP work and other changes to OVN
> proposed on this email thread.
> >
> >
> >>
> >> Without changes in OVN, a further optimization based on your current
> workaround can be done is what Tim has suggested: to replace the large
> number of small join LSes (and LRPs and patch ports on both sides) by same
> number of directly connected LRPs.
> >
> >
> > Han and Tim,
> >
> > OVN supports only peering two distributed routers without a logical
> switch, however it doesn't support connecting a distributed router and an
> l3 gateway router directly as peers. I remember very clearly this being
> mentioned in the ovn-architecture man page.
> >
> > -8<--8<-
> >
> >The distributed router and the
> >gateway router are  connected  by  another  logical  switch,
>  sometimes
> >referred  to  as a ``join’’ logical switch. (OVN logical routers
> may be
> >connected to one another directly, without an intervening
>  switch,  but
> >the  OVN  implementation only supports gateway logical routers
> that are
> >connected to logical switches. Using a join logical switch also
> reduces
> >the  number  of  IP addresses needed on the distributed router.)
> >
> > -8<--8<-
> >
> > Before splitting the OVN join logical switch into several small logical
> switches, I did try directly connecting the LR to each of the node-specific
> LR using a point-to-point link but it didn't work. Since this was
> corroborated by the man page, I didn't debug the topology and moved on to
> splitting the `join` logical switch.
>
> You are right. So this *improvement* would also need change in OVN as
> well, and the benefit seems less obvious than the other two options.
>
> >
> > Regards,
> > ~Girish
> >
> 
> > --
> > You received this message because you are subscribed to the Google
> Groups "ovn-kubernetes" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to ovn-kubernetes+unsubscr...@googlegroups.com.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STQ0cJU0BWPrt%2BGeFa4ehxyWGh6-rSYnZ0N89c1GTnX86g%40mail.gmail.com
> .
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-16 Thread Han Zhou
On Tue, May 5, 2020 at 11:57 AM Han Zhou  wrote:
>
>
>
> On Fri, May 1, 2020 at 2:14 PM Dan Winship  wrote:
> >
> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> > > of Gateway Router-1, then you will see that there will be 2000 logical
> > > flow entries...
> >
> > > In the topology above, the only intended path is North-South between
> > > each gateway router and the logical router. There is no east-west
> > > traffic between the gateway routers
> >
> > > Is there an another way to solve the above problem with just keeping
the
> > > single join logical switch?
> >
> > Two thoughts:
> >
> > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
> > just lets ARP requests pass through normally, and lets ARP replies pass
> > through normally as long as they are correct (ie, it doesn't let
> > spoofing through). This means fewer flows but more traffic. Maybe that's
> > the right tradeoff?
> >
> The 2M entries here is not for ARP responder, but more equivalent to the
neighbour table (or ARP cache), on each LR. The ARP responder resides in
the LS (join logical switch), which is O(n) instead of O(n^2), so it is not
a problem here.
>
> However, a similar idea may works here to avoid the O(n^2) scale issue.
For the neighbour table, actually OVN has two parts, one is statically
build, which is the 2M entires mentioned in this case, and the other is the
dynamic ARP resolve - the mac_binding table, which is dynamically populated
by handling ARP messages. To solve the problem here, it is possible to
change OVN to support configuring a LR to avoid static neighbour table, and
relies only on dynamic ARP resolving. In this case, all the gateway routers
can be configured as not using static ARP resolving, and eventually there
will be only 2 entries (one for IPv4 and one for IPv6) for each gateway
router in mac_binding table for the north-south traffic to the join router.
(of source there will be still same amount of mac_bindings in each router
for the external traffic on the other side of the gateway routers).
>
> This change seems straightforward, but I am not sure if there is any
corner cases.

Hi Girish,

I've sent a RFC patch here for the above proposal:
https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
For this use case, just set options:dynamic_neigh_routes=true for all the
Gateway Routers. Could you try it in your scale environment and see if it
solves the problem?

Thanks,
Han

>
> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > programmatically related to the corresponding IP addresses, and in
> > places where that's not currently true, we could try to make it true,
> > and then perhaps the thousands of rules could just be replaced by a
> > single rule?
> >
> This may be a good idea, but I am not sure how to implement in OVN to
make it generic, since most OVN users can't make such assumption.
>
> On the other hand, why wouldn't splitting the join logical switch to 1000
LSes solve the problem? I understand that there will be 1000 more
datapaths, and 1000 more LRPs, but these are all O(n), which is much more
efficient than the O(n^2) exploding. What's the other scale issues created
by this?
>
> In addition, Girish, for the external LS, I am not sure why can't it be
shared, if all the nodes are connected to a single L2 network. (If they are
connected to separate L2 networks, different external LSes should be
created, at least according to current OVN model).
>
> Thanks,
> Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-16 Thread Girish Moodalbail
On Sat, May 16, 2020 at 12:36 AM Han Zhou  wrote:

>
>
> On Tue, May 5, 2020 at 11:57 AM Han Zhou  wrote:
> >
> >
> >
> > On Fri, May 1, 2020 at 2:14 PM Dan Winship 
> wrote:
> > >
> > > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> > > > If we now look at table=12 (lr_in_arp_resolve) in the ingress
> pipeline
> > > > of Gateway Router-1, then you will see that there will be 2000
> logical
> > > > flow entries...
> > >
> > > > In the topology above, the only intended path is North-South between
> > > > each gateway router and the logical router. There is no east-west
> > > > traffic between the gateway routers
> > >
> > > > Is there an another way to solve the above problem with just keeping
> the
> > > > single join logical switch?
> > >
> > > Two thoughts:
> > >
> > > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
> > > just lets ARP requests pass through normally, and lets ARP replies pass
> > > through normally as long as they are correct (ie, it doesn't let
> > > spoofing through). This means fewer flows but more traffic. Maybe
> that's
> > > the right tradeoff?
> > >
> > The 2M entries here is not for ARP responder, but more equivalent to the
> neighbour table (or ARP cache), on each LR. The ARP responder resides in
> the LS (join logical switch), which is O(n) instead of O(n^2), so it is not
> a problem here.
> >
> > However, a similar idea may works here to avoid the O(n^2) scale issue.
> For the neighbour table, actually OVN has two parts, one is statically
> build, which is the 2M entires mentioned in this case, and the other is the
> dynamic ARP resolve - the mac_binding table, which is dynamically populated
> by handling ARP messages. To solve the problem here, it is possible to
> change OVN to support configuring a LR to avoid static neighbour table, and
> relies only on dynamic ARP resolving. In this case, all the gateway routers
> can be configured as not using static ARP resolving, and eventually there
> will be only 2 entries (one for IPv4 and one for IPv6) for each gateway
> router in mac_binding table for the north-south traffic to the join router.
> (of source there will be still same amount of mac_bindings in each router
> for the external traffic on the other side of the gateway routers).
> >
> > This change seems straightforward, but I am not sure if there is any
> corner cases.
>
> Hi Girish,
>
> I've sent a RFC patch here for the above proposal:
> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
> For this use case, just set options:dynamic_neigh_routes=true for all the
> Gateway Routers. Could you try it in your scale environment and see if it
> solves the problem?
>
> Thanks,
> Han
>
> >
> > > 2. In most places in ovn-kubernetes, our MAC addresses are
> > > programmatically related to the corresponding IP addresses, and in
> > > places where that's not currently true, we could try to make it true,
> > > and then perhaps the thousands of rules could just be replaced by a
> > > single rule?
> > >
> > This may be a good idea, but I am not sure how to implement in OVN to
> make it generic, since most OVN users can't make such assumption.
> >
> > On the other hand, why wouldn't splitting the join logical switch to
> 1000 LSes solve the problem? I understand that there will be 1000 more
> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
> efficient than the O(n^2) exploding. What's the other scale issues created
> by this?
> >
> > In addition, Girish, for the external LS, I am not sure why can't it be
> shared, if all the nodes are connected to a single L2 network. (If they are
> connected to separate L2 networks, different external LSes should be
> created, at least according to current OVN model).
>

Thanks Han for the patch. Will give it a try and let you know.

Regards,
~Girish


> >
> > Thanks,
> > Han
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-16 Thread Girish Moodalbail
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will
work with this new option set?

Say the packet is being forwarded from router2 towards the distributed
router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
address M1 to set eth.dst to.

++++
|   l3gateway||   l3gateway|
|router2 ||router3 |
+-+--++-+--+
IP2,M2 IP3,M3
  | |
   +--+-+---+
   |join switch |
   +-+--+
 |
  IP1,M1
 +---++
 |  distributed   |
 | router |
 ++

The MAC M1 will not obviously in the MAC_binding table. On the hypervisor
where the packet originated, the router2's port and the distributed
router's port are locally present. So, does this result in a PACKET_IN to
the ovn-controller and the resolution happens there?

How about the resolution of IP3-to-M3 happen on gateway router2? Will there
be an ARP request packet that will be broadcasted on the join switch for
this case?

Regards,
~Girish

On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail 
wrote:

>
>
> On Sat, May 16, 2020 at 12:36 AM Han Zhou  wrote:
>
>>
>>
>> On Tue, May 5, 2020 at 11:57 AM Han Zhou  wrote:
>> >
>> >
>> >
>> > On Fri, May 1, 2020 at 2:14 PM Dan Winship 
>> wrote:
>> > >
>> > > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
>> > > > If we now look at table=12 (lr_in_arp_resolve) in the ingress
>> pipeline
>> > > > of Gateway Router-1, then you will see that there will be 2000
>> logical
>> > > > flow entries...
>> > >
>> > > > In the topology above, the only intended path is North-South between
>> > > > each gateway router and the logical router. There is no east-west
>> > > > traffic between the gateway routers
>> > >
>> > > > Is there an another way to solve the above problem with just
>> keeping the
>> > > > single join logical switch?
>> > >
>> > > Two thoughts:
>> > >
>> > > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
>> > > just lets ARP requests pass through normally, and lets ARP replies
>> pass
>> > > through normally as long as they are correct (ie, it doesn't let
>> > > spoofing through). This means fewer flows but more traffic. Maybe
>> that's
>> > > the right tradeoff?
>> > >
>> > The 2M entries here is not for ARP responder, but more equivalent to
>> the neighbour table (or ARP cache), on each LR. The ARP responder resides
>> in the LS (join logical switch), which is O(n) instead of O(n^2), so it is
>> not a problem here.
>> >
>> > However, a similar idea may works here to avoid the O(n^2) scale issue.
>> For the neighbour table, actually OVN has two parts, one is statically
>> build, which is the 2M entires mentioned in this case, and the other is the
>> dynamic ARP resolve - the mac_binding table, which is dynamically populated
>> by handling ARP messages. To solve the problem here, it is possible to
>> change OVN to support configuring a LR to avoid static neighbour table, and
>> relies only on dynamic ARP resolving. In this case, all the gateway routers
>> can be configured as not using static ARP resolving, and eventually there
>> will be only 2 entries (one for IPv4 and one for IPv6) for each gateway
>> router in mac_binding table for the north-south traffic to the join router.
>> (of source there will be still same amount of mac_bindings in each router
>> for the external traffic on the other side of the gateway routers).
>> >
>> > This change seems straightforward, but I am not sure if there is any
>> corner cases.
>>
>> Hi Girish,
>>
>> I've sent a RFC patch here for the above proposal:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>> For this use case, just set options:dynamic_neigh_routes=true for all the
>> Gateway Routers. Could you try it in your scale environment and see if it
>> solves the problem?
>>
>> Thanks,
>> Han
>>
>> >
>> > > 2. In most places in ovn-kubernetes, our MAC addresses are
>> > > programmatically related to the corresponding IP addresses, and in
>> > > places where that's not currently true, we could try to make it true,
>> > > and then perhaps the thousands of rules could just be replaced by a
>> > > single rule?
>> > >
>> > This may be a good idea, but I am not sure how to implement in OVN to
>> make it generic, since most OVN users can't make such assumption.
>> >
>> > On the other hand, why wouldn't splitting the join logical switch to
>> 1000 LSes solve the problem? I understand that there will be 1000 more
>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>> efficient than the O(n^2) exploding. What's the other scale issues created
>> by this?
>> >
>> > In addition, Girish, for the external LS, I am not sure why can't it be

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-16 Thread Han Zhou
On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail 
wrote:

> Hello Han,
>
> Can you please explain how the dynamic resolution of the IP-to-MAC will
> work with this new option set?
>
> Say the packet is being forwarded from router2 towards the distributed
> router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
> address M1 to set eth.dst to.
>
> ++++
> |   l3gateway||   l3gateway|
> |router2 ||router3 |
> +-+--++-+--+
> IP2,M2 IP3,M3
>   | |
>+--+-+---+
>|join switch |
>+-+--+
>  |
>   IP1,M1
>  +---++
>  |  distributed   |
>  | router |
>  ++
>
> The MAC M1 will not obviously in the MAC_binding table. On the hypervisor
> where the packet originated, the router2's port and the distributed
> router's port are locally present. So, does this result in a PACKET_IN to
> the ovn-controller and the resolution happens there?
>

Yes there will be a PACKET_IN, and then:
1. ovn-controller will generate the ARP request for IP1, and send
PACKET_OUT to OVS.
2. The ARP request will be delivered to the distributed router pipeline
only, because of a special handling of ARP in OVN for IPs of router ports,
although it is a broadcast. (It would have been broadcasted to all GRs
without that special handling)
3. The distributed router pipeline should learn the IP-MAC binding of
IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send
ARP reply to the router2 in the distributed router pipeline.
4. Router2 pipeline will handle the ARP response and learn the IP-MAC
binding of IP1-M1 (through a PACKET_IN to ovn-controller).


>
> How about the resolution of IP3-to-M3 happen on gateway router2? Will
> there be an ARP request packet that will be broadcasted on the join switch
> for this case?
>

I think in the use case of ovn-k8s, as you described before, this should
not happen. However, if this does happen, it is similar to above steps,
except that in step 2) and 3) the ARP request and response will be sent
between the chassises through tunnel. If this happens between all pairs of
GRs, then there will be again O(n^2) MAC_Binding entries.

I haven't tested the GR scenario yet, so I can't guarantee it works as
expected. Please let me know if you see any problems. I will submit formal
patch with more test cases if it is confirmed in your environment.

Thanks,
Han


>
> Regards,
> ~Girish
>
> On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail 
> wrote:
>
>>
>>
>> On Sat, May 16, 2020 at 12:36 AM Han Zhou  wrote:
>>
>>>
>>>
>>> On Tue, May 5, 2020 at 11:57 AM Han Zhou  wrote:
>>> >
>>> >
>>> >
>>> > On Fri, May 1, 2020 at 2:14 PM Dan Winship 
>>> wrote:
>>> > >
>>> > > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
>>> > > > If we now look at table=12 (lr_in_arp_resolve) in the ingress
>>> pipeline
>>> > > > of Gateway Router-1, then you will see that there will be 2000
>>> logical
>>> > > > flow entries...
>>> > >
>>> > > > In the topology above, the only intended path is North-South
>>> between
>>> > > > each gateway router and the logical router. There is no east-west
>>> > > > traffic between the gateway routers
>>> > >
>>> > > > Is there an another way to solve the above problem with just
>>> keeping the
>>> > > > single join logical switch?
>>> > >
>>> > > Two thoughts:
>>> > >
>>> > > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
>>> > > just lets ARP requests pass through normally, and lets ARP replies
>>> pass
>>> > > through normally as long as they are correct (ie, it doesn't let
>>> > > spoofing through). This means fewer flows but more traffic. Maybe
>>> that's
>>> > > the right tradeoff?
>>> > >
>>> > The 2M entries here is not for ARP responder, but more equivalent to
>>> the neighbour table (or ARP cache), on each LR. The ARP responder resides
>>> in the LS (join logical switch), which is O(n) instead of O(n^2), so it is
>>> not a problem here.
>>> >
>>> > However, a similar idea may works here to avoid the O(n^2) scale
>>> issue. For the neighbour table, actually OVN has two parts, one is
>>> statically build, which is the 2M entires mentioned in this case, and the
>>> other is the dynamic ARP resolve - the mac_binding table, which is
>>> dynamically populated by handling ARP messages. To solve the problem here,
>>> it is possible to change OVN to support configuring a LR to avoid static
>>> neighbour table, and relies only on dynamic ARP resolving. In this case,
>>> all the gateway routers can be configured as not using static ARP
>>> resolving, and eventually there will be only 2 entries (one for IPv4 and
>>> one for IPv6) for each gateway router in mac_binding table for the
>>> north-south traffic to the join router. (of so

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-17 Thread Girish Moodalbail
Thanks Han for the explanation. Yes, there is no east-west traffic between
the GRs (I was just curious to know). So, if the ARP request/response
between GR and DR is confined to the same chassis, then there shouldn't be
O(n^2) explosion per-your explanation.

Will get back to you on how the test goes in the next few days.

Regards,
Girish

On Sat, May 16, 2020 at 11:17 PM Han Zhou  wrote:

>
>
> On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail 
> wrote:
>
>> Hello Han,
>>
>> Can you please explain how the dynamic resolution of the IP-to-MAC will
>> work with this new option set?
>>
>> Say the packet is being forwarded from router2 towards the distributed
>> router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
>> address M1 to set eth.dst to.
>>
>> ++++
>> |   l3gateway||   l3gateway|
>> |router2 ||router3 |
>> +-+--++-+--+
>> IP2,M2 IP3,M3
>>   | |
>>+--+-+---+
>>|join switch |
>>+-+--+
>>  |
>>   IP1,M1
>>  +---++
>>  |  distributed   |
>>  | router |
>>  ++
>>
>> The MAC M1 will not obviously in the MAC_binding table. On the hypervisor
>> where the packet originated, the router2's port and the distributed
>> router's port are locally present. So, does this result in a PACKET_IN to
>> the ovn-controller and the resolution happens there?
>>
>
> Yes there will be a PACKET_IN, and then:
> 1. ovn-controller will generate the ARP request for IP1, and send
> PACKET_OUT to OVS.
> 2. The ARP request will be delivered to the distributed router pipeline
> only, because of a special handling of ARP in OVN for IPs of router ports,
> although it is a broadcast. (It would have been broadcasted to all GRs
> without that special handling)
> 3. The distributed router pipeline should learn the IP-MAC binding of
> IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send
> ARP reply to the router2 in the distributed router pipeline.
> 4. Router2 pipeline will handle the ARP response and learn the IP-MAC
> binding of IP1-M1 (through a PACKET_IN to ovn-controller).
>
>
>>
>> How about the resolution of IP3-to-M3 happen on gateway router2? Will
>> there be an ARP request packet that will be broadcasted on the join switch
>> for this case?
>>
>
> I think in the use case of ovn-k8s, as you described before, this should
> not happen. However, if this does happen, it is similar to above steps,
> except that in step 2) and 3) the ARP request and response will be sent
> between the chassises through tunnel. If this happens between all pairs of
> GRs, then there will be again O(n^2) MAC_Binding entries.
>
> I haven't tested the GR scenario yet, so I can't guarantee it works as
> expected. Please let me know if you see any problems. I will submit formal
> patch with more test cases if it is confirmed in your environment.
>
> Thanks,
> Han
>
>
>>
>> Regards,
>> ~Girish
>>
>> On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail 
>> wrote:
>>
>>>
>>>
>>> On Sat, May 16, 2020 at 12:36 AM Han Zhou  wrote:
>>>


 On Tue, May 5, 2020 at 11:57 AM Han Zhou  wrote:
 >
 >
 >
 > On Fri, May 1, 2020 at 2:14 PM Dan Winship 
 wrote:
 > >
 > > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
 > > > If we now look at table=12 (lr_in_arp_resolve) in the ingress
 pipeline
 > > > of Gateway Router-1, then you will see that there will be 2000
 logical
 > > > flow entries...
 > >
 > > > In the topology above, the only intended path is North-South
 between
 > > > each gateway router and the logical router. There is no east-west
 > > > traffic between the gateway routers
 > >
 > > > Is there an another way to solve the above problem with just
 keeping the
 > > > single join logical switch?
 > >
 > > Two thoughts:
 > >
 > > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
 > > just lets ARP requests pass through normally, and lets ARP replies
 pass
 > > through normally as long as they are correct (ie, it doesn't let
 > > spoofing through). This means fewer flows but more traffic. Maybe
 that's
 > > the right tradeoff?
 > >
 > The 2M entries here is not for ARP responder, but more equivalent to
 the neighbour table (or ARP cache), on each LR. The ARP responder resides
 in the LS (join logical switch), which is O(n) instead of O(n^2), so it is
 not a problem here.
 >
 > However, a similar idea may works here to avoid the O(n^2) scale
 issue. For the neighbour table, actually OVN has two parts, one is
 statically build, which is the 2M entires mentioned in this case, and the
 other is the dynamic ARP

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-19 Thread Girish Moodalbail
Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou  wrote:

>
>
> On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail 
> wrote:
>
>> Hello Han,
>>
>> Can you please explain how the dynamic resolution of the IP-to-MAC will
>> work with this new option set?
>>
>> Say the packet is being forwarded from router2 towards the distributed
>> router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
>> address M1 to set eth.dst to.
>>
>> ++++
>> |   l3gateway||   l3gateway|
>> |router2 ||router3 |
>> +-+--++-+--+
>> IP2,M2 IP3,M3
>>   | |
>>+--+-+---+
>>|join switch |
>>+-+--+
>>  |
>>   IP1,M1
>>  +---++
>>  |  distributed   |
>>  | router |
>>  ++
>>
>> The MAC M1 will not obviously in the MAC_binding table. On the hypervisor
>> where the packet originated, the router2's port and the distributed
>> router's port are locally present. So, does this result in a PACKET_IN to
>> the ovn-controller and the resolution happens there?
>>
>
> Yes there will be a PACKET_IN, and then:
> 1. ovn-controller will generate the ARP request for IP1, and send
> PACKET_OUT to OVS.
> 2. The ARP request will be delivered to the distributed router pipeline
> only, because of a special handling of ARP in OVN for IPs of router ports,
> although it is a broadcast. (It would have been broadcasted to all GRs
> without that special handling)
> 3. The distributed router pipeline should learn the IP-MAC binding of
> IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send
> ARP reply to the router2 in the distributed router pipeline.
> 4. Router2 pipeline will handle the ARP response and learn the IP-MAC
> binding of IP1-M1 (through a PACKET_IN to ovn-controller).
>

Unfortunately, the ARP request (who as IP1) from router2 is broadcasted out
to all of the chassis through Geneve Tunnel. The other gateway routers
learn the Source mac of 'M2'. Now, each of the gateway router has an entry
for (IP2, M2) in the MAC binding table on their respective rtoj-
router port. So, the MAC_Binding table will now have N X N entries, where N
is the number of gateway routers.

Per your explanation above, the ARP request should not have broadcasted
right? Note that the direction of  ARP request is from Gateway Router to
Distributed Router.

Regards,
~Girish



>
>
>>
>> How about the resolution of IP3-to-M3 happen on gateway router2? Will
>> there be an ARP request packet that will be broadcasted on the join switch
>> for this case?
>>
>
> I think in the use case of ovn-k8s, as you described before, this should
> not happen. However, if this does happen, it is similar to above steps,
> except that in step 2) and 3) the ARP request and response will be sent
> between the chassises through tunnel. If this happens between all pairs of
> GRs, then there will be again O(n^2) MAC_Binding entries.
>
> I haven't tested the GR scenario yet, so I can't guarantee it works as
> expected. Please let me know if you see any problems. I will submit formal
> patch with more test cases if it is confirmed in your environment.
>
> Thanks,
> Han
>
>
>>
>> Regards,
>> ~Girish
>>
>> On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail 
>> wrote:
>>
>>>
>>>
>>> On Sat, May 16, 2020 at 12:36 AM Han Zhou  wrote:
>>>


 On Tue, May 5, 2020 at 11:57 AM Han Zhou  wrote:
 >
 >
 >
 > On Fri, May 1, 2020 at 2:14 PM Dan Winship 
 wrote:
 > >
 > > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
 > > > If we now look at table=12 (lr_in_arp_resolve) in the ingress
 pipeline
 > > > of Gateway Router-1, then you will see that there will be 2000
 logical
 > > > flow entries...
 > >
 > > > In the topology above, the only intended path is North-South
 between
 > > > each gateway router and the logical router. There is no east-west
 > > > traffic between the gateway routers
 > >
 > > > Is there an another way to solve the above problem with just
 keeping the
 > > > single join logical switch?
 > >
 > > Two thoughts:
 > >
 > > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
 > > just lets ARP requests pass through normally, and lets ARP replies
 pass
 > > through normally as long as they are correct (ie, it doesn't let
 > > spoofing through). This means fewer flows but more traffic. Maybe
 that's
 > > the right tradeoff?
 > >
 > The 2M entries here is not for ARP responder, but more equivalent to
 the neighbour table (or ARP cache), on each LR. The ARP responder resides
 in the LS (join logical switch), which is O(n) instead of O(n^2), so it is
 n

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Venugopal Iyer
Han,

just a quick question below..


From: ovn-kuberne...@googlegroups.com  on 
behalf of Girish Moodalbail 
Sent: Tuesday, May 19, 2020 11:09 PM
To: Han Zhou
Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kuberne...@googlegroups.com
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou 
mailto:zhou...@gmail.com>> wrote:


On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail 
mailto:gmoodalb...@gmail.com>> wrote:
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will work 
with this new option set?

Say the packet is being forwarded from router2 towards the distributed router? 
So, nexthop (reg0) is set to IP1 and we need to find the MAC address M1 to set 
eth.dst to.

++++
|   l3gateway||   l3gateway|
|router2 ||router3 |
+-+--++-+--+
IP2,M2 IP3,M3
  | |
   +--+-+---+
   |join switch |
   +-+--+
 |
  IP1,M1
 +---++
 |  distributed   |
 | router |
 ++

The MAC M1 will not obviously in the MAC_binding table. On the hypervisor where 
the packet originated, the router2's port and the distributed router's port are 
locally present. So, does this result in a PACKET_IN to the ovn-controller and 
the resolution happens there?

Yes there will be a PACKET_IN, and then:
1. ovn-controller will generate the ARP request for IP1, and send PACKET_OUT to 
OVS.
2. The ARP request will be delivered to the distributed router pipeline only, 
because of a special handling of ARP in OVN for IPs of router ports, although 
it is a broadcast. (It would have been broadcasted to all GRs without that 
special handling)
3. The distributed router pipeline should learn the IP-MAC binding of IP2-M2 
(through a PACKET_IN to ovn-controller), and at the same time send ARP reply to 
the router2 in the distributed router pipeline.
4. Router2 pipeline will handle the ARP response and learn the IP-MAC binding 
of IP1-M1 (through a PACKET_IN to ovn-controller).

Unfortunately, the ARP request (who as IP1) from router2 is broadcasted out to 
all of the chassis through Geneve Tunnel. The other gateway routers learn the 
Source mac of 'M2'. Now, each of the gateway router has an entry for (IP2, M2) 
in the MAC binding table on their respective rtoj- router port. So, the 
MAC_Binding table will now have N X N entries, where N is the number of gateway 
routers.

Per your explanation above, the ARP request should not have broadcasted right? 


 probably obvious and I am missing it, but..
 I see the lflow to direct ARP request to the router port, instead of 
bcast. However,
 we also add flows to bcast self-originated (unsolicitated ?) arp requests 
(we should
 not see this  for router IPs, I suppose). But, given we just match on the 
source 
 MAC address  of the packet for such packets, does it differ from the ARP 
 request generated for Router IP?

thanks,

-venu

Note that the direction of  ARP request is from Gateway Router to Distributed 
Router.

Regards,
~Girish




How about the resolution of IP3-to-M3 happen on gateway router2? Will there be 
an ARP request packet that will be broadcasted on the join switch for this case?

I think in the use case of ovn-k8s, as you described before, this should not 
happen. However, if this does happen, it is similar to above steps, except that 
in step 2) and 3) the ARP request and response will be sent between the 
chassises through tunnel. If this happens between all pairs of GRs, then there 
will be again O(n^2) MAC_Binding entries.

I haven't tested the GR scenario yet, so I can't guarantee it works as 
expected. Please let me know if you see any problems. I will submit formal 
patch with more test cases if it is confirmed in your environment.

Thanks,
Han


Regards,
~Girish

On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail 
mailto:gmoodalb...@gmail.com>> wrote:


On Sat, May 16, 2020 at 12:36 AM Han Zhou 
mailto:zhou...@gmail.com>> wrote:


On Tue, May 5, 2020 at 11:57 AM Han Zhou mailto:hz...@ovn.org>> 
wrote:
>
>
>
> On Fri, May 1, 2020 at 2:14 PM Dan Winship 
> mailto:danwins...@redhat.com>> wrote:
> >
> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> > > of Gateway Router-1, then you will see that there will be 2000 logical
> > > flow entries...
> >
> > > In the topology above, the only intended path is North-South between
> > > each g

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Han Zhou
On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer 
wrote:

> Han,
>
> just a quick question below..
>
> 
> From: ovn-kuberne...@googlegroups.com 
> on behalf of Girish Moodalbail 
> Sent: Tuesday, May 19, 2020 11:09 PM
> To: Han Zhou
> Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kuberne...@googlegroups.com
> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table
>
> External email: Use caution opening links or attachments
>
> Hello Han,
>
> Please see in-line:
>
> On Sat, May 16, 2020 at 11:17 PM Han Zhou  zhou...@gmail.com>> wrote:
>
>
> On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail  <mailto:gmoodalb...@gmail.com>> wrote:
> Hello Han,
>
> Can you please explain how the dynamic resolution of the IP-to-MAC will
> work with this new option set?
>
> Say the packet is being forwarded from router2 towards the distributed
> router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
> address M1 to set eth.dst to.
>
> ++++
> |   l3gateway||   l3gateway|
> |router2 ||router3 |
> +-+--++-+--+
> IP2,M2 IP3,M3
>   | |
>+--+-+---+
>|join switch |
>+-+--+
>  |
>   IP1,M1
>  +---++
>  |  distributed   |
>  | router |
>  ++
>
> The MAC M1 will not obviously in the MAC_binding table. On the hypervisor
> where the packet originated, the router2's port and the distributed
> router's port are locally present. So, does this result in a PACKET_IN to
> the ovn-controller and the resolution happens there?
>
> Yes there will be a PACKET_IN, and then:
> 1. ovn-controller will generate the ARP request for IP1, and send
> PACKET_OUT to OVS.
> 2. The ARP request will be delivered to the distributed router pipeline
> only, because of a special handling of ARP in OVN for IPs of router ports,
> although it is a broadcast. (It would have been broadcasted to all GRs
> without that special handling)
> 3. The distributed router pipeline should learn the IP-MAC binding of
> IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send
> ARP reply to the router2 in the distributed router pipeline.
> 4. Router2 pipeline will handle the ARP response and learn the IP-MAC
> binding of IP1-M1 (through a PACKET_IN to ovn-controller).
>
> Unfortunately, the ARP request (who as IP1) from router2 is broadcasted
> out to all of the chassis through Geneve Tunnel. The other gateway routers
> learn the Source mac of 'M2'. Now, each of the gateway router has an entry
> for (IP2, M2) in the MAC binding table on their respective rtoj-
> router port. So, the MAC_Binding table will now have N X N entries, where N
> is the number of gateway routers.
>
> Per your explanation above, the ARP request should not have broadcasted
> right?
>
>
>  probably obvious and I am missing it, but..
>  I see the lflow to direct ARP request to the router port, instead of
> bcast. However,
>  we also add flows to bcast self-originated (unsolicitated ?) arp
> requests (we should
>  not see this  for router IPs, I suppose). But, given we just match on
> the source
>  MAC address  of the packet for such packets, does it differ from the
> ARP
>  request generated for Router IP?
>
> Good catch! That seems to be the reason why it is broadcasted. I thought
the feature was only allowing GARP to be broadcasted, but it is actually
allowing (G)ARP including regular ARP generated by the LRs. It can be an
easy fix to: commit 32f5ebb062 ("ovn-northd: Limit ARP/ND broadcast domain
whenever possible."), but I am not sure if there are other concerns of
doing that. @Dumitru Ceara  to comment if we can
restrict it to be GARP only.

On the other hand, in this use case, if there are any ARP from the
distributed router to any of the GRs, then all the GRs should have learned
the MAC-bindings of the IP1-M1, and they won't send ARP for IP1 any more,
thus would not result in N x N MAC-bindings, right? In the real use case,
it may depend on which direction of traffic comes first. If it is always
from external to k8s workloads first, then yes it will end up with N x N
mac-bindings finally.


> thanks,
>
> -venu
>
> Note that the direction of  ARP request is from Gateway Router to
> Distributed Router.
>
> Regards,
> ~Girish
>
>
>
>
> How about the resolution of IP3-to-M3 happen on gateway router2? Will
> there be an ARP request p

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Tim Rozet
I think that if you directly connect GR to DR you don't need to learn any
ARP with packet_in and you can preprogram the static entries. Each GR will
have 1 enty for the DR, while the DR will have N number of entries for N
nodes.

The real issue with ARP learning comes from the GR-External. You have
to learn these, and from my conversation with Girish it seems like every GR
is adding an entry on every ARP request it sees. This means 1 GR sends ARP
request to external L2 network and every GR sees the ARP request and adds
an entry. I think the behavior should be:

GRs only add ARP entries when:

   1. An ARP *Response* is sent to it
   2. The GR receives a GARP broadcast, and already has an entry in his
   cache for that IP (Girish mentioned this is similar to linux arp_accept
   behavior)

In addition, as Michael Cambria pointed out in our weekly meeting, these
ARP cache entries should have expiry timers on them. If they are
permanently learned, you will end up with a growing ARP table over time,
and end up in the same place. We can probably just program the GR ARP flows
with an idle_timeout and have the flow removed. What do you think?

Should I file a bugzilla outlining the above so we can have proper tracking?

Thanks,

Tim Rozet
Red Hat CTO Networking Team


On Thu, May 21, 2020 at 5:01 PM Han Zhou  wrote:

>
>
> On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer 
> wrote:
>
>> Han,
>>
>> just a quick question below..
>>
>> 
>> From: ovn-kuberne...@googlegroups.com 
>> on behalf of Girish Moodalbail 
>> Sent: Tuesday, May 19, 2020 11:09 PM
>> To: Han Zhou
>> Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kuberne...@googlegroups.com
>> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table
>>
>> External email: Use caution opening links or attachments
>>
>> Hello Han,
>>
>> Please see in-line:
>>
>> On Sat, May 16, 2020 at 11:17 PM Han Zhou > zhou...@gmail.com>> wrote:
>>
>>
>> On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail > <mailto:gmoodalb...@gmail.com>> wrote:
>> Hello Han,
>>
>> Can you please explain how the dynamic resolution of the IP-to-MAC will
>> work with this new option set?
>>
>> Say the packet is being forwarded from router2 towards the distributed
>> router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
>> address M1 to set eth.dst to.
>>
>> ++++
>> |   l3gateway||   l3gateway|
>> |router2 ||router3 |
>> +-+--++-+--+
>> IP2,M2 IP3,M3
>>   | |
>>+--+-+---+
>>|join switch |
>>+-+--+
>>  |
>>   IP1,M1
>>  +---++
>>  |  distributed   |
>>  | router |
>>  ++
>>
>> The MAC M1 will not obviously in the MAC_binding table. On the hypervisor
>> where the packet originated, the router2's port and the distributed
>> router's port are locally present. So, does this result in a PACKET_IN to
>> the ovn-controller and the resolution happens there?
>>
>> Yes there will be a PACKET_IN, and then:
>> 1. ovn-controller will generate the ARP request for IP1, and send
>> PACKET_OUT to OVS.
>> 2. The ARP request will be delivered to the distributed router pipeline
>> only, because of a special handling of ARP in OVN for IPs of router ports,
>> although it is a broadcast. (It would have been broadcasted to all GRs
>> without that special handling)
>> 3. The distributed router pipeline should learn the IP-MAC binding of
>> IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send
>> ARP reply to the router2 in the distributed router pipeline.
>> 4. Router2 pipeline will handle the ARP response and learn the IP-MAC
>> binding of IP1-M1 (through a PACKET_IN to ovn-controller).
>>
>> Unfortunately, the ARP request (who as IP1) from router2 is broadcasted
>> out to all of the chassis through Geneve Tunnel. The other gateway routers
>> learn the Source mac of 'M2'. Now, each of the gateway router has an entry
>> for (IP2, M2) in the MAC binding table on their respective rtoj-
>> router port. So, the MAC_Binding table will now have N X N entries, where N
>> is the number of gateway routers.
>>
>> Per your explanation above, the ARP request should not have broadcasted
>> right?
>>

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Venugopal Iyer
Hi, Han:


From: ovn-kuberne...@googlegroups.com  on 
behalf of Han Zhou 
Sent: Thursday, May 21, 2020 2:00 PM
To: Venugopal Iyer; Dumitru Ceara
Cc: Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; 
ovn-kuberne...@googlegroups.com
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer 
mailto:venugop...@nvidia.com>> wrote:
Han,

just a quick question below..


From: ovn-kuberne...@googlegroups.com<mailto:ovn-kuberne...@googlegroups.com> 
mailto:ovn-kuberne...@googlegroups.com>> on 
behalf of Girish Moodalbail 
mailto:gmoodalb...@gmail.com>>
Sent: Tuesday, May 19, 2020 11:09 PM
To: Han Zhou
Cc: Han Zhou; Dan Winship; ovs-discuss; 
ovn-kuberne...@googlegroups.com<mailto:ovn-kuberne...@googlegroups.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou 
mailto:zhou...@gmail.com><mailto:zhou...@gmail.com<mailto:zhou...@gmail.com>>>
 wrote:


On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail 
mailto:gmoodalb...@gmail.com><mailto:gmoodalb...@gmail.com<mailto:gmoodalb...@gmail.com>>>
 wrote:
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will work 
with this new option set?

Say the packet is being forwarded from router2 towards the distributed router? 
So, nexthop (reg0) is set to IP1 and we need to find the MAC address M1 to set 
eth.dst to.

++++
|   l3gateway||   l3gateway|
|router2 ||router3 |
+-+--++-+--+
IP2,M2 IP3,M3
  | |
   +--+-+---+
   |join switch |
   +-+--+
 |
  IP1,M1
 +---++
 |  distributed   |
 | router |
 ++

The MAC M1 will not obviously in the MAC_binding table. On the hypervisor where 
the packet originated, the router2's port and the distributed router's port are 
locally present. So, does this result in a PACKET_IN to the ovn-controller and 
the resolution happens there?

Yes there will be a PACKET_IN, and then:
1. ovn-controller will generate the ARP request for IP1, and send PACKET_OUT to 
OVS.
2. The ARP request will be delivered to the distributed router pipeline only, 
because of a special handling of ARP in OVN for IPs of router ports, although 
it is a broadcast. (It would have been broadcasted to all GRs without that 
special handling)
3. The distributed router pipeline should learn the IP-MAC binding of IP2-M2 
(through a PACKET_IN to ovn-controller), and at the same time send ARP reply to 
the router2 in the distributed router pipeline.
4. Router2 pipeline will handle the ARP response and learn the IP-MAC binding 
of IP1-M1 (through a PACKET_IN to ovn-controller).

Unfortunately, the ARP request (who as IP1) from router2 is broadcasted out to 
all of the chassis through Geneve Tunnel. The other gateway routers learn the 
Source mac of 'M2'. Now, each of the gateway router has an entry for (IP2, M2) 
in the MAC binding table on their respective rtoj- router port. So, the 
MAC_Binding table will now have N X N entries, where N is the number of gateway 
routers.

Per your explanation above, the ARP request should not have broadcasted right?


 probably obvious and I am missing it, but..
 I see the lflow to direct ARP request to the router port, instead of 
bcast. However,
 we also add flows to bcast self-originated (unsolicitated ?) arp requests 
(we should
 not see this  for router IPs, I suppose). But, given we just match on the 
source
 MAC address  of the packet for such packets, does it differ from the ARP
 request generated for Router IP?

Good catch! That seems to be the reason why it is broadcasted. I thought the 
feature was only allowing GARP to be broadcasted, but it is actually allowing 
(G)ARP including regular ARP generated by the LRs. It can be an easy fix to: 
commit 32f5ebb062 ("ovn-northd: Limit ARP/ND broadcast domain whenever 
possible."), but I am not sure if there are other concerns of doing that. 
@Dumitru Ceara<mailto:dce...@redhat.com> to comment if we can restrict it to be 
GARP only.

On the other hand, in this use case, if there are any ARP from the distributed 
router to any of the GRs, then all the GRs should have learned the MAC-bindings 
of the IP1-M1, and they won't send ARP for IP1 any more, thus would not result 
in N x N MAC-bindings, right? In the real use case, it may depend on which 
direction of traffic

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Han Zhou
On Thu, May 21, 2020 at 2:35 PM Tim Rozet  wrote:

> I think that if you directly connect GR to DR you don't need to learn any
> ARP with packet_in and you can preprogram the static entries. Each GR will
> have 1 enty for the DR, while the DR will have N number of entries for N
> nodes.
>

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N
ports on the DR and also requires a lot of small subnets, which is not
desirable. And since changes are needed anyway in OVN to support that, we
moved forward with the current approach of avoiding the static ARP flows to
solve the problem instead of directly connecting GRs to DR.


> The real issue with ARP learning comes from the GR-External. You have
> to learn these, and from my conversation with Girish it seems like every GR
> is adding an entry on every ARP request it sees. This means 1 GR sends ARP
> request to external L2 network and every GR sees the ARP request and adds
> an entry. I think the behavior should be:
>
> GRs only add ARP entries when:
>
>1. An ARP *Response* is sent to it
>2. The GR receives a GARP broadcast, and already has an entry in his
>cache for that IP (Girish mentioned this is similar to linux arp_accept
>behavior)
>
> For 2), it is expensive to do in OVN because OpenFlow doesn't support a
match condition of "field1 == field2", which is required to check if the
incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to
support something similar like linux arp_accept configuration but slightly
different. In OVN we can configure it to alllow/disable learning from all
ARP requests to IPs not belonging to the router, including GARPs. Would
that solve the problem here? (@Venugopal Iyer 
brought up the same thing about "arp_accept". I hope this reply addresses
that as well)

In addition, as Michael Cambria pointed out in our weekly meeting, these
> ARP cache entries should have expiry timers on them. If they are
> permanently learned, you will end up with a growing ARP table over time,
> and end up in the same place. We can probably just program the GR ARP flows
> with an idle_timeout and have the flow removed. What do you think?
>
> This has been discussed before. It is also mentioned in the TODO.rst.
However, it is not taken care because there is no good solution found yet.
It can be done but will be expensive and the gains do not worth the costs.
Accepting ARP requests partially reduces the needs of ARP expiration. It is
true that it could still be a problem in some scenarios but so far we
didn't heard any use case that has hard dependency on this.


> Should I file a bugzilla outlining the above so we can have proper
> tracking?
>

I think bugzilla is out of the control of OVN community, so please feel
free to file or not file ;)

Thanks,
Han


> Thanks,
>
> Tim Rozet
> Red Hat CTO Networking Team
>
>
> On Thu, May 21, 2020 at 5:01 PM Han Zhou  wrote:
>
>>
>>
>> On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer 
>> wrote:
>>
>>> Han,
>>>
>>> just a quick question below..
>>>
>>> 
>>> From: ovn-kuberne...@googlegroups.com 
>>> on behalf of Girish Moodalbail 
>>> Sent: Tuesday, May 19, 2020 11:09 PM
>>> To: Han Zhou
>>> Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kuberne...@googlegroups.com
>>> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>>> table
>>>
>>> External email: Use caution opening links or attachments
>>>
>>> Hello Han,
>>>
>>> Please see in-line:
>>>
>>> On Sat, May 16, 2020 at 11:17 PM Han Zhou >> zhou...@gmail.com>> wrote:
>>>
>>>
>>> On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail <
>>> gmoodalb...@gmail.com<mailto:gmoodalb...@gmail.com>> wrote:
>>> Hello Han,
>>>
>>> Can you please explain how the dynamic resolution of the IP-to-MAC will
>>> work with this new option set?
>>>
>>> Say the packet is being forwarded from router2 towards the distributed
>>> router? So, nexthop (reg0) is set to IP1 and we need to find the MAC
>>> address M1 to set eth.dst to.
>>>
>>> ++++
>>> |   l3gateway||   l3gateway|
>>> |router2 ||router3 |
>>> +-+--++-+--+
>>> IP2,M2 IP3,M3
>>>   | |
>>>+--+-+---+
>>>|join switch |
>>>+-+--+
>>>  

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Venugopal Iyer
Hi, Han:


From: ovn-kuberne...@googlegroups.com  on 
behalf of Han Zhou 
Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet
Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; 
ovs-discuss; ovn-kuberne...@googlegroups.com; Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 2:35 PM Tim Rozet 
mailto:tro...@redhat.com>> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP 
with packet_in and you can preprogram the static entries. Each GR will have 1 
enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports 
on the DR and also requires a lot of small subnets, which is not desirable. And 
since changes are needed anyway in OVN to support that, we moved forward with 
the current approach of avoiding the static ARP flows to solve the problem 
instead of directly connecting GRs to DR.


The real issue with ARP learning comes from the GR-External. You have to 
learn these, and from my conversation with Girish it seems like every GR is 
adding an entry on every ARP request it sees. This means 1 GR sends ARP request 
to external L2 network and every GR sees the ARP request and adds an entry. I 
think the behavior should be:

GRs only add ARP entries when:

  1.  An ARP Response is sent to it
  2.  The GR receives a GARP broadcast, and already has an entry in his cache 
for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match 
condition of "field1 == field2", which is required to check if the incoming ARP 
request is a GARP, i.e. SPA == TPA. However, it is ok to support something 
similar like linux arp_accept configuration but slightly different. In OVN we 
can configure it to alllow/disable learning from all ARP requests to IPs not 
belonging to the router, including GARPs. Would that solve the problem here? 
(@Venugopal Iyer<mailto:venugop...@nvidia.com>  brought up the same thing about 
"arp_accept". I hope this reply addresses that as well)

 I can't think of any side effects to this, so seems fine to me to do so. 
Believe linux behaves that way w.r.t. ARP request
 anyway (assuming I am reading it right).

https://elixir.bootlin.com/linux/v5.7-rc6/source/net/ipv4/arp.c (L874)


thanks,

-venu

In addition, as Michael Cambria pointed out in our weekly meeting, these ARP 
cache entries should have expiry timers on them. If they are permanently 
learned, you will end up with a growing ARP table over time, and end up in the 
same place. We can probably just program the GR ARP flows with an idle_timeout 
and have the flow removed. What do you think?

This has been discussed before. It is also mentioned in the TODO.rst. However, 
it is not taken care because there is no good solution found yet. It can be 
done but will be expensive and the gains do not worth the costs. Accepting ARP 
requests partially reduces the needs of ARP expiration. It is true that it 
could still be a problem in some scenarios but so far we didn't heard any use 
case that has hard dependency on this.

Should I file a bugzilla outlining the above so we can have proper tracking?

I think bugzilla is out of the control of OVN community, so please feel free to 
file or not file ;)

Thanks,
Han


Thanks,

Tim Rozet
Red Hat CTO Networking Team


On Thu, May 21, 2020 at 5:01 PM Han Zhou 
mailto:zhou...@gmail.com>> wrote:


On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer 
mailto:venugop...@nvidia.com>> wrote:
Han,

just a quick question below..


From: ovn-kuberne...@googlegroups.com<mailto:ovn-kuberne...@googlegroups.com> 
mailto:ovn-kuberne...@googlegroups.com>> on 
behalf of Girish Moodalbail 
mailto:gmoodalb...@gmail.com>>
Sent: Tuesday, May 19, 2020 11:09 PM
To: Han Zhou
Cc: Han Zhou; Dan Winship; ovs-discuss; 
ovn-kuberne...@googlegroups.com<mailto:ovn-kuberne...@googlegroups.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou 
mailto:zhou...@gmail.com><mailto:zhou...@gmail.com<mailto:zhou...@gmail.com>>>
 wrote:


On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail 
mailto:gmoodalb...@gmail.com><mailto:gmoodalb...@gmail.com<mailto:gmoodalb...@gmail.com>>>
 wrote:
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will work 
with this new option set?

Say the packet is being forwarded from router2 towards the distributed router? 
So, nexthop (reg0) is set to IP1 and we need to find 

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Tim Rozet
On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer 
wrote:

> Hi, Han:
>
> 
> From: ovn-kuberne...@googlegroups.com 
> on behalf of Han Zhou 
> Sent: Thursday, May 21, 2020 4:42 PM
> To: Tim Rozet
> Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan
> Winship; ovs-discuss; ovn-kuberne...@googlegroups.com; Michael Cambria
> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table
>
> External email: Use caution opening links or attachments
>
>
>
> On Thu, May 21, 2020 at 2:35 PM Tim Rozet  tro...@redhat.com>> wrote:
> I think that if you directly connect GR to DR you don't need to learn any
> ARP with packet_in and you can preprogram the static entries. Each GR will
> have 1 enty for the DR, while the DR will have N number of entries for N
> nodes.
>
> Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N
> ports on the DR and also requires a lot of small subnets, which is not
> desirable. And since changes are needed anyway in OVN to support that, we
> moved forward with the current approach of avoiding the static ARP flows to
> solve the problem instead of directly connecting GRs to DR.
>
> Why is that not desirable? They are all private subnets with /30 (if using
ipv4). If IPv6, it's even less of a concern from an addressing perspective.

The real issue with ARP learning comes from the GR-External. You have
> to learn these, and from my conversation with Girish it seems like every GR
> is adding an entry on every ARP request it sees. This means 1 GR sends ARP
> request to external L2 network and every GR sees the ARP request and adds
> an entry. I think the behavior should be:
>
> GRs only add ARP entries when:
>
>   1.  An ARP Response is sent to it
>   2.  The GR receives a GARP broadcast, and already has an entry in his
> cache for that IP (Girish mentioned this is similar to linux arp_accept
> behavior)
>
> For 2), it is expensive to do in OVN because OpenFlow doesn't support a
> match condition of "field1 == field2", which is required to check if the
> incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to
> support something similar like linux arp_accept configuration but slightly
> different. In OVN we can configure it to alllow/disable learning from all
> ARP requests to IPs not belonging to the router, including GARPs. Would
> that solve the problem here? (@Venugopal Iyer<mailto:venugop...@nvidia.com>
> brought up the same thing about "arp_accept". I hope this reply addresses
> that as well)
>

I think the issue there is if you have an external device, which is using a
VIP and it fails over, it will usually send GARP to inform of the mac
change. In this case if you ignore GARP, what happens? You wont send
another ARP because OVN programs the arp entry forever and doesn't expire
it right? So you won't learn the new mac and keep sending packets to a dead
mac?

>
>  I can't think of any side effects to this, so seems fine to me to do
> so. Believe linux behaves that way w.r.t. ARP request
>  anyway (assuming I am reading it right).
>
> https://elixir.bootlin.com/linux/v5.7-rc6/source/net/ipv4/arp.c (L874)
>
>
> thanks,
>
> -venu
>
> In addition, as Michael Cambria pointed out in our weekly meeting, these
> ARP cache entries should have expiry timers on them. If they are
> permanently learned, you will end up with a growing ARP table over time,
> and end up in the same place. We can probably just program the GR ARP flows
> with an idle_timeout and have the flow removed. What do you think?
>
> This has been discussed before. It is also mentioned in the TODO.rst.
> However, it is not taken care because there is no good solution found yet.
> It can be done but will be expensive and the gains do not worth the costs.
> Accepting ARP requests partially reduces the needs of ARP expiration. It is
> true that it could still be a problem in some scenarios but so far we
> didn't heard any use case that has hard dependency on this.
>
> Should I file a bugzilla outlining the above so we can have proper
> tracking?
>
> I think bugzilla is out of the control of OVN community, so please feel
> free to file or not file ;)
>

Sorry folks from OVN had told me you use bugzilla to track OVN bugs, and
not JIRA or Github. What bug tracking system do you use if not BZ?

>
> Thanks,
> Han
>
>
> Thanks,
>
> Tim Rozet
> Red Hat CTO Networking Team
>
>
> On Thu, May 21, 2020 at 5:01 PM Han Zhou  zhou...@gmail.com>> wrote:
>
>
> On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer  <mailto:venugop...@nvidia.com>> wrote:
> Han,
>
> just a quick questi

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Girish Moodalbail
On Thu, May 21, 2020 at 6:58 PM Tim Rozet  wrote:

> On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer 
> wrote:
>
>> Hi, Han:
>>
>> 
>> From: ovn-kuberne...@googlegroups.com 
>> on behalf of Han Zhou 
>> Sent: Thursday, May 21, 2020 4:42 PM
>> To: Tim Rozet
>> Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan
>> Winship; ovs-discuss; ovn-kuberne...@googlegroups.com; Michael Cambria
>> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table
>>
>> External email: Use caution opening links or attachments
>>
>>
>>
>> On Thu, May 21, 2020 at 2:35 PM Tim Rozet > tro...@redhat.com>> wrote:
>> I think that if you directly connect GR to DR you don't need to learn any
>> ARP with packet_in and you can preprogram the static entries. Each GR will
>> have 1 enty for the DR, while the DR will have N number of entries for N
>> nodes.
>>
>> Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N
>> ports on the DR and also requires a lot of small subnets, which is not
>> desirable. And since changes are needed anyway in OVN to support that, we
>> moved forward with the current approach of avoiding the static ARP flows to
>> solve the problem instead of directly connecting GRs to DR.
>>
>> Why is that not desirable? They are all private subnets with /30 (if
> using ipv4). If IPv6, it's even less of a concern from an addressing
> perspective.
>

It is not just about the subnet management but also the additional logical
flows that created between two ways of connecting DR and GR.

Say, we have a fix that efficiently allows one to connect 1000s of GR using
a single logical switch, then would you rather use that instead of 1000
patch cables connecting a GR to DR? It is not only the issue of Subnet
Management for those 1000 point-to-point connections but also those 1000
patch ports are local to each of the chassis, so we need to understand in
such a topology how many addition logical flows gets created in the SB and
how many OpenFlow flows gets created on each of the 1000 chassis for those
1000 patch cables.


>
> The real issue with ARP learning comes from the GR-External. You have
>> to learn these, and from my conversation with Girish it seems like every GR
>> is adding an entry on every ARP request it sees. This means 1 GR sends ARP
>> request to external L2 network and every GR sees the ARP request and adds
>> an entry. I think the behavior should be:
>>
>> GRs only add ARP entries when:
>>
>>   1.  An ARP Response is sent to it
>>   2.  The GR receives a GARP broadcast, and already has an entry in his
>> cache for that IP (Girish mentioned this is similar to linux arp_accept
>> behavior)
>>
>> For 2), it is expensive to do in OVN because OpenFlow doesn't support a
>> match condition of "field1 == field2", which is required to check if the
>> incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to
>> support something similar like linux arp_accept configuration but slightly
>> different. In OVN we can configure it to alllow/disable learning from all
>> ARP requests to IPs not belonging to the router, including GARPs. Would
>> that solve the problem here? (@Venugopal Iyer> venugop...@nvidia.com>  brought up the same thing about "arp_accept". I
>> hope this reply addresses that as well)
>>
>
> I think the issue there is if you have an external device, which is using
> a VIP and it fails over, it will usually send GARP to inform of the mac
> change. In this case if you ignore GARP, what happens? You wont send
> another ARP because OVN programs the arp entry forever and doesn't expire
> it right? So you won't learn the new mac and keep sending packets to a dead
> mac?
>

I think we will have to support GARP otherwise VIPs will not work like Tim
mentions. If we do learn from GARP and as long as the GARP itself is not
originated by any of the 1000s GRs, then we should be fine.

Regards,
~Girish
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-21 Thread Han Zhou
On Thu, May 21, 2020 at 7:12 PM Girish Moodalbail 
wrote:

>
>
> On Thu, May 21, 2020 at 6:58 PM Tim Rozet  wrote:
>
>> On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer 
>> wrote:
>>
>>> Hi, Han:
>>>
>>> 
>>> From: ovn-kuberne...@googlegroups.com 
>>> on behalf of Han Zhou 
>>> Sent: Thursday, May 21, 2020 4:42 PM
>>> To: Tim Rozet
>>> Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan
>>> Winship; ovs-discuss; ovn-kuberne...@googlegroups.com; Michael Cambria
>>> Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>>> table
>>>
>>> External email: Use caution opening links or attachments
>>>
>>>
>>>
>>> On Thu, May 21, 2020 at 2:35 PM Tim Rozet >> tro...@redhat.com>> wrote:
>>> I think that if you directly connect GR to DR you don't need to learn
>>> any ARP with packet_in and you can preprogram the static entries. Each GR
>>> will have 1 enty for the DR, while the DR will have N number of entries for
>>> N nodes.
>>>
>>> Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N
>>> ports on the DR and also requires a lot of small subnets, which is not
>>> desirable. And since changes are needed anyway in OVN to support that, we
>>> moved forward with the current approach of avoiding the static ARP flows to
>>> solve the problem instead of directly connecting GRs to DR.
>>>
>>> Why is that not desirable? They are all private subnets with /30 (if
>> using ipv4). If IPv6, it's even less of a concern from an addressing
>> perspective.
>>
>
> It is not just about the subnet management but also the additional logical
> flows that created between two ways of connecting DR and GR.
>
> Say, we have a fix that efficiently allows one to connect 1000s of GR
> using a single logical switch, then would you rather use that instead of
> 1000 patch cables connecting a GR to DR? It is not only the issue of Subnet
> Management for those 1000 point-to-point connections but also those 1000
> patch ports are local to each of the chassis, so we need to understand in
> such a topology how many addition logical flows gets created in the SB and
> how many OpenFlow flows gets created on each of the 1000 chassis for those
> 1000 patch cables.
>
>
>>
>> The real issue with ARP learning comes from the GR-External. You have
>>> to learn these, and from my conversation with Girish it seems like every GR
>>> is adding an entry on every ARP request it sees. This means 1 GR sends ARP
>>> request to external L2 network and every GR sees the ARP request and adds
>>> an entry. I think the behavior should be:
>>>
>>> GRs only add ARP entries when:
>>>
>>>   1.  An ARP Response is sent to it
>>>   2.  The GR receives a GARP broadcast, and already has an entry in his
>>> cache for that IP (Girish mentioned this is similar to linux arp_accept
>>> behavior)
>>>
>>> For 2), it is expensive to do in OVN because OpenFlow doesn't support a
>>> match condition of "field1 == field2", which is required to check if the
>>> incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to
>>> support something similar like linux arp_accept configuration but slightly
>>> different. In OVN we can configure it to alllow/disable learning from all
>>> ARP requests to IPs not belonging to the router, including GARPs. Would
>>> that solve the problem here? (@Venugopal Iyer>> venugop...@nvidia.com>  brought up the same thing about "arp_accept". I
>>> hope this reply addresses that as well)
>>>
>>
>> I think the issue there is if you have an external device, which is using
>> a VIP and it fails over, it will usually send GARP to inform of the mac
>> change. In this case if you ignore GARP, what happens? You wont send
>> another ARP because OVN programs the arp entry forever and doesn't expire
>> it right? So you won't learn the new mac and keep sending packets to a dead
>> mac?
>>
>
> I think we will have to support GARP otherwise VIPs will not work like Tim
> mentions. If we do learn from GARP and as long as the GARP itself is not
> originated by any of the 1000s GRs, then we should be fine.
>
> Right, I didn't thought this through. I thought it is just a configurable
option, but it seems we will always need to support GARP, so the option
becomes useless.
However, there is no easy way to 

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-22 Thread Venugopal Iyer
A couple of comments below:


From: ovn-kuberne...@googlegroups.com  on 
behalf of Han Zhou 
Sent: Thursday, May 21, 2020 7:43 PM
To: Girish Moodalbail
Cc: Tim Rozet; Venugopal Iyer; Dumitru Ceara; Han Zhou; Dan Winship; 
ovs-discuss; ovn-kuberne...@googlegroups.com; Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 7:12 PM Girish Moodalbail 
mailto:gmoodalb...@gmail.com>> wrote:


On Thu, May 21, 2020 at 6:58 PM Tim Rozet 
mailto:tro...@redhat.com>> wrote:
On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer 
mailto:venugop...@nvidia.com>> wrote:
Hi, Han:


From: ovn-kuberne...@googlegroups.com<mailto:ovn-kuberne...@googlegroups.com> 
mailto:ovn-kuberne...@googlegroups.com>> on 
behalf of Han Zhou mailto:zhou...@gmail.com>>
Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet
Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; 
ovs-discuss; 
ovn-kuberne...@googlegroups.com<mailto:ovn-kuberne...@googlegroups.com>; 
Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 2:35 PM Tim Rozet 
mailto:tro...@redhat.com><mailto:tro...@redhat.com<mailto:tro...@redhat.com>>>
 wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP 
with packet_in and you can preprogram the static entries. Each GR will have 1 
enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports 
on the DR and also requires a lot of small subnets, which is not desirable. And 
since changes are needed anyway in OVN to support that, we moved forward with 
the current approach of avoiding the static ARP flows to solve the problem 
instead of directly connecting GRs to DR.

Why is that not desirable? They are all private subnets with /30 (if using 
ipv4). If IPv6, it's even less of a concern from an addressing perspective.

It is not just about the subnet management but also the additional logical 
flows that created between two ways of connecting DR and GR.

Say, we have a fix that efficiently allows one to connect 1000s of GR using a 
single logical switch, then would you rather use that instead of 1000 patch 
cables connecting a GR to DR? It is not only the issue of Subnet Management for 
those 1000 point-to-point connections but also those 1000 patch ports are local 
to each of the chassis, so we need to understand in such a topology how many 
addition logical flows gets created in the SB and how many OpenFlow flows gets 
created on each of the 1000 chassis for those 1000 patch cables.


The real issue with ARP learning comes from the GR-External. You have to 
learn these, and from my conversation with Girish it seems like every GR is 
adding an entry on every ARP request it sees. This means 1 GR sends ARP request 
to external L2 network and every GR sees the ARP request and adds an entry. I 
think the behavior should be:

GRs only add ARP entries when:

  1.  An ARP Response is sent to it
  2.  The GR receives a GARP broadcast, and already has an entry in his cache 
for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match 
condition of "field1 == field2", which is required to check if the incoming ARP 
request is a GARP, i.e. SPA == TPA. However, it is ok to support something 
similar like linux arp_accept configuration but slightly different. In OVN we 
can configure it to alllow/disable learning from all ARP requests to IPs not 
belonging to the router, including GARPs. Would that solve the problem here? 
(@Venugopal Iyer<mailto:venugop...@nvidia.com<mailto:venugop...@nvidia.com>>  
brought up the same thing about "arp_accept". I hope this reply addresses that 
as well)

I think the issue there is if you have an external device, which is using a VIP 
and it fails over, it will usually send GARP to inform of the mac change. In 
this case if you ignore GARP, what happens? You wont send another ARP because 
OVN programs the arp entry forever and doesn't expire it right? So you won't 
learn the new mac and keep sending packets to a dead mac?

I think we will have to support GARP otherwise VIPs will not work like Tim 
mentions. If we do learn from GARP and as long as the GARP itself is not 
originated by any of the 1000s GRs, then we should be fine.

Right, I didn't thought this through. I thought it is just a configurable 
option, but it seems we will always need to support GARP, so the option becomes 
useless.
However, there is no easy way to achieve: "do lear

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-22 Thread Han Zhou
On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer 
wrote:

> A couple of comments below:
>
>
>
>
>  I suppose the use of GARP as a reply v/s response is not very clear;
> [1], Section 3 seems to offer a concise summary of this. If the application
> sends GARP as
>  a reply we are covered, but the question is if the GARP is a request
> (which is allowed) then what our response should be. Tim is right, we can't
> ignore
>  the request (more so, since aging is not supported currently),
> however "arp_accept" ignores the request for creating a new cache entry,
> not updating
>  an existing one (see last para below)
>
> [2]
> arp_accept - BOOLEAN
> Define behavior for gratuitous ARP frames who's IP is not
> already present in the ARP table:
> 0 - don't create new entries in the ARP table
> 1 - create new entries in the ARP table
>
> Both replies and requests type gratuitous arp will trigger the
> ARP table to be updated, if this setting is on.
>
> If the ARP table already contains the IP address of the
> gratuitous arp frame, the arp table will be updated regardless
> if this setting is on or off.
>
>  if we lookup and get a hit, we should still process the GARP; only if
> we don't  have a hit, we should ignore (instead of
>  creating an entry). BTW, do we update today? if I understand the use
> of reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT (assuming lookup_arp
>  returns 1 if entry exists), I am not sure it does? maybe I missed it
> ..
>
> thanks,
>
> -venu
>
> [1]https://www.ietf.org/rfc/rfc5227.txt
>
>
(Not sure why the indent format of your reply is not correct at least on my
client - it mixes all previous replies together so one cannot tell which
part was from whom, so I truncated all of them.)

Thanks Venu. I think this would work: we can add an option similar but
different from arp_accept (because it is not easy to OVN to tell if it is
GARP on the ingress pipeline). The option can be named like:
learn_from_arp_request.
When ARP request is received, always check if an old entry existed for the
SPA. If existed and MAC is different, then update the mac-binding entry. If
the entry doesn't exist, check the option setting:
"true" - add a new entry.
"false" - if the TPA is on the router, add a new entry (it means the remote
wants to communicate with this node, so it makes sense to learn the remote
as well). Otherwise, ignore it and no new entry added.

Do you think this works?
Regarding your question on lookup_arp(), today it looks up for the same
IP-MAC binding, just avoid unnecessary updating if the pair already existed
and not changed.

Thanks,
Han
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-22 Thread Venugopal Iyer
Sorry, Han, for messing up the indents, looks like my outlook browser client is 
either set
correctly, or doesn’t work well.

Let me try from the app and see if it is any better..

From: ovn-kuberne...@googlegroups.com  On 
Behalf Of Han Zhou
Sent: Friday, May 22, 2020 1:51 PM
To: Venugopal Iyer 
Cc: Girish Moodalbail ; Tim Rozet ; 
Dumitru Ceara ; Han Zhou ; Dan Winship 
; ovs-discuss ; 
ovn-kuberne...@googlegroups.com; Michael Cambria 
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer 
mailto:venugop...@nvidia.com>> wrote:
A couple of comments below:




 I suppose the use of GARP as a reply v/s response is not very clear; [1], 
Section 3 seems to offer a concise summary of this. If the application sends 
GARP as
 a reply we are covered, but the question is if the GARP is a request 
(which is allowed) then what our response should be. Tim is right, we can't 
ignore
 the request (more so, since aging is not supported currently), however 
"arp_accept" ignores the request for creating a new cache entry, not updating
 an existing one (see last para below)

[2]
arp_accept - BOOLEAN
Define behavior for gratuitous ARP frames who's IP is not
already present in the ARP table:
0 - don't create new entries in the ARP table
1 - create new entries in the ARP table

Both replies and requests type gratuitous arp will trigger the
ARP table to be updated, if this setting is on.

If the ARP table already contains the IP address of the
gratuitous arp frame, the arp table will be updated regardless
if this setting is on or off.

 if we lookup and get a hit, we should still process the GARP; only if we 
don't  have a hit, we should ignore (instead of
 creating an entry). BTW, do we update today? if I understand the use of 
reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT (assuming lookup_arp
 returns 1 if entry exists), I am not sure it does? maybe I missed it ..

thanks,

-venu

[1]https://www.ietf.org/rfc/rfc5227.txt

(Not sure why the indent format of your reply is not correct at least on my 
client - it mixes all previous replies together so one cannot tell which part 
was from whom, so I truncated all of them.)

Thanks Venu. I think this would work: we can add an option similar but 
different from arp_accept (because it is not easy to OVN to tell if it is GARP 
on the ingress pipeline). The option can be named like: learn_from_arp_request.
When ARP request is received, always check if an old entry existed for the SPA. 
If existed and MAC is different, then update the mac-binding entry. If the 
entry doesn't exist, check the option setting:
"true" - add a new entry.
"false" - if the TPA is on the router, add a new entry (it means the remote 
wants to communicate with this node, so it makes sense to learn the remote as 
well). Otherwise, ignore it and no new entry added.
[vi> ] yes, I believe that should work.

Do you think this works?
Regarding your question on lookup_arp(), today it looks up for the same IP-MAC 
binding, just avoid unnecessary updating if the pair already existed and not 
changed.
thanks,

-venu

Thanks,
Han
--
You received this message because you are subscribed to the Google Groups 
"ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
ovn-kubernetes+unsubscr...@googlegroups.com<mailto:ovn-kubernetes+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCk%3DeqsrifyfSuBcLFUNdbtFOESdeqg-M%2BZch%2BiQNiJTiA%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCk%3DeqsrifyfSuBcLFUNdbtFOESdeqg-M%2BZch%2BiQNiJTiA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-22 Thread Girish Moodalbail
On Fri, May 22, 2020 at 1:51 PM Han Zhou  wrote:

>
>
> On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer 
> wrote:
>
>> A couple of comments below:
>>
>>
>>
>>
>>  I suppose the use of GARP as a reply v/s response is not very clear;
>> [1], Section 3 seems to offer a concise summary of this. If the application
>> sends GARP as
>>  a reply we are covered, but the question is if the GARP is a request
>> (which is allowed) then what our response should be. Tim is right, we can't
>> ignore
>>  the request (more so, since aging is not supported currently),
>> however "arp_accept" ignores the request for creating a new cache entry,
>> not updating
>>  an existing one (see last para below)
>>
>> [2]
>> arp_accept - BOOLEAN
>> Define behavior for gratuitous ARP frames who's IP is not
>> already present in the ARP table:
>> 0 - don't create new entries in the ARP table
>> 1 - create new entries in the ARP table
>>
>> Both replies and requests type gratuitous arp will trigger the
>> ARP table to be updated, if this setting is on.
>>
>> If the ARP table already contains the IP address of the
>> gratuitous arp frame, the arp table will be updated regardless
>> if this setting is on or off.
>>
>>  if we lookup and get a hit, we should still process the GARP; only
>> if we don't  have a hit, we should ignore (instead of
>>  creating an entry). BTW, do we update today? if I understand the use
>> of reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT (assuming lookup_arp
>>  returns 1 if entry exists), I am not sure it does? maybe I missed it
>> ..
>>
>> thanks,
>>
>> -venu
>>
>> [1]https://www.ietf.org/rfc/rfc5227.txt
>>
>>
> (Not sure why the indent format of your reply is not correct at least on
> my client - it mixes all previous replies together so one cannot tell which
> part was from whom, so I truncated all of them.)
>
> Thanks Venu. I think this would work: we can add an option similar but
> different from arp_accept (because it is not easy to OVN to tell if it is
> GARP on the ingress pipeline). The option can be named like:
> learn_from_arp_request.
> When ARP request is received, always check if an old entry existed for the
> SPA. If existed and MAC is different, then update the mac-binding entry. If
> the entry doesn't exist, check the option setting:
> "true" - add a new entry.
> "false" - if the TPA is on the router, add a new entry (it means the
> remote wants to communicate with this node, so it makes sense to learn the
> remote as well). Otherwise, ignore it and no new entry added.
>
> Do you think this works?
>

I think this should work as well.

For the single join switch connected to 1000 GRs, it should work as well
(assuming your other fix for dynamic learning is present as well). However,
in this case,  even with this option set we will still be sending the ARP
broadcast out from Node1 to each of the other 999 Nodes. After the packets
have travelled through the tunnel, we are going to drop the packet on the
target hypervisor, if `learn_from_arp_request=true'. As I understand, we
are waiting for reply from @Dumitru Ceara  to understand
why such a flow is required, correct?

Regards,
~Girish
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-25 Thread Dumitru Ceara
On 5/23/20 12:56 AM, Girish Moodalbail wrote:
> 
> 
> On Fri, May 22, 2020 at 1:51 PM Han Zhou  > wrote:
> 
> 
> 
> On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer
> mailto:venugop...@nvidia.com>> wrote:
> 
> A couple of comments below:
> 
> 
> 
> 
>  I suppose the use of GARP as a reply v/s response is not
> very clear; [1], Section 3 seems to offer a concise summary of
> this. If the application sends GARP as
>  a reply we are covered, but the question is if the GARP is
> a request (which is allowed) then what our response should be.
> Tim is right, we can't ignore
>  the request (more so, since aging is not supported
> currently), however "arp_accept" ignores the request for
> creating a new cache entry, not updating
>  an existing one (see last para below)
> 
> [2]
> arp_accept - BOOLEAN
>         Define behavior for gratuitous ARP frames who's IP is not
>         already present in the ARP table:
>         0 - don't create new entries in the ARP table
>         1 - create new entries in the ARP table
> 
>         Both replies and requests type gratuitous arp will
> trigger the
>         ARP table to be updated, if this setting is on.
> 
>         If the ARP table already contains the IP address of the
>         gratuitous arp frame, the arp table will be updated
> regardless
>         if this setting is on or off.
> 
>  if we lookup and get a hit, we should still process the
> GARP; only if we don't  have a hit, we should ignore (instead of
>  creating an entry). BTW, do we update today? if I
> understand the use of reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT
> (assuming lookup_arp
>  returns 1 if entry exists), I am not sure it does? maybe I
> missed it ..
> 
> thanks,
> 
> -venu
> 
> [1]https://www.ietf.org/rfc/rfc5227.txt
> 
> 
> (Not sure why the indent format of your reply is not correct at
> least on my client - it mixes all previous replies together so one
> cannot tell which part was from whom, so I truncated all of them.)
> 
> Thanks Venu. I think this would work: we can add an option similar
> but different from arp_accept (because it is not easy to OVN to tell
> if it is GARP on the ingress pipeline). The option can be named
> like: learn_from_arp_request.
> When ARP request is received, always check if an old entry existed
> for the SPA. If existed and MAC is different, then update the
> mac-binding entry. If the entry doesn't exist, check the option setting:
> "true" - add a new entry.
> "false" - if the TPA is on the router, add a new entry (it means the
> remote wants to communicate with this node, so it makes sense to
> learn the remote as well). Otherwise, ignore it and no new entry added.
> 
> Do you think this works?
> 
> 
> I think this should work as well.
> 
> For the single join switch connected to 1000 GRs, it should work as well
> (assuming your other fix for dynamic learning is present as well).
> However, in this case,  even with this option set we will still be
> sending the ARP broadcast out from Node1 to each of the other 999 Nodes.
> After the packets have travelled through the tunnel, we are going to
> drop the packet on the target hypervisor, if
> `learn_from_arp_request=true'. As I understand, we are waiting for reply
> from @Dumitru Ceara  to understand why such a
> flow is required, correct?
> 

As Han pointed out, commit 32f5ebb062 ("ovn-northd: Limit ARP/ND
broadcast domain whenever possible.") added logical flows in the LS
S_SWITCH_IN_L2_LKUP stage to explicitly flood ARP/ND requests originated
from router owned IP interfaces. This was done for a couple of reasons:

1. ARP requests for destinations/next-hops outside OVN need to be
flooded in the broadcast domain anyway and would otherwise match the
lowest priority rule in S_SWITCH_IN_L2_LKUP that would flood them
nevertheless.

2. OVN sends periodic GARP requests for router owned IPs (i.e., NAT
addresses and logical_router_port addresses) to update external
switch/router FDB/ARP caches in scenarios like VM migration:
6bfbb4c24187 ("ovn: Send GARP on localnet."). These packets should be
flooded in the broadcast domain too.

I think we have a few options:

1. Change OVN behavior and use GARP replies instead of GARP requests.
The effect should be (almost [1]) the same from the external devices
perspective but the advantage is that we can completely remove the
logical flows that match on self originated ARP packets. This is quite
easy to achieve and I have a patch ready for it if we decide to go this way.

2. Make the flows that match on self originated ARP traffic more
explicit and restrict them to GARP requests. For example, for a logi

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-26 Thread Venugopal Iyer
Hi, Dumitru:

-Original Message-
From: Dumitru Ceara  
Sent: Monday, May 25, 2020 3:55 AM
To: Girish Moodalbail ; Han Zhou 
Cc: Venugopal Iyer ; Tim Rozet ; Han 
Zhou ; Dan Winship ; ovs-discuss 
; ovn-kuberne...@googlegroups.com; Michael Cambria 

Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments


On 5/23/20 12:56 AM, Girish Moodalbail wrote:
>
>
> On Fri, May 22, 2020 at 1:51 PM Han Zhou  <mailto:zhou...@gmail.com>> wrote:
>
>
>
> On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer
> mailto:venugop...@nvidia.com>> wrote:
>
> A couple of comments below:
>
>
>
>
>  I suppose the use of GARP as a reply v/s response is not
> very clear; [1], Section 3 seems to offer a concise summary of
> this. If the application sends GARP as
>  a reply we are covered, but the question is if the GARP is
> a request (which is allowed) then what our response should be.
> Tim is right, we can't ignore
>  the request (more so, since aging is not supported
> currently), however "arp_accept" ignores the request for
> creating a new cache entry, not updating
>  an existing one (see last para below)
>
> [2]
> arp_accept - BOOLEAN
> Define behavior for gratuitous ARP frames who's IP is not
> already present in the ARP table:
> 0 - don't create new entries in the ARP table
> 1 - create new entries in the ARP table
>
> Both replies and requests type gratuitous arp will
> trigger the
> ARP table to be updated, if this setting is on.
>
> If the ARP table already contains the IP address of the
> gratuitous arp frame, the arp table will be updated
> regardless
> if this setting is on or off.
>
>  if we lookup and get a hit, we should still process the
> GARP; only if we don't  have a hit, we should ignore (instead of
>  creating an entry). BTW, do we update today? if I
> understand the use of reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT
> (assuming lookup_arp
>  returns 1 if entry exists), I am not sure it does? maybe I
> missed it ..
>
> thanks,
>
> -venu
>
> [1]https://www.ietf.org/rfc/rfc5227.txt
>
>
> (Not sure why the indent format of your reply is not correct at
> least on my client - it mixes all previous replies together so one
> cannot tell which part was from whom, so I truncated all of them.)
>
> Thanks Venu. I think this would work: we can add an option similar
> but different from arp_accept (because it is not easy to OVN to tell
> if it is GARP on the ingress pipeline). The option can be named
> like: learn_from_arp_request.
> When ARP request is received, always check if an old entry existed
> for the SPA. If existed and MAC is different, then update the
> mac-binding entry. If the entry doesn't exist, check the option setting:
> "true" - add a new entry.
> "false" - if the TPA is on the router, add a new entry (it means the
> remote wants to communicate with this node, so it makes sense to
> learn the remote as well). Otherwise, ignore it and no new entry added.
>
> Do you think this works?
>
>
> I think this should work as well.
>
> For the single join switch connected to 1000 GRs, it should work as 
> well (assuming your other fix for dynamic learning is present as well).
> However, in this case,  even with this option set we will still be 
> sending the ARP broadcast out from Node1 to each of the other 999 Nodes.
> After the packets have travelled through the tunnel, we are going to 
> drop the packet on the target hypervisor, if 
> `learn_from_arp_request=true'. As I understand, we are waiting for 
> reply from @Dumitru Ceara <mailto:dce...@redhat.com> to understand why 
> such a flow is required, correct?
>

As Han pointed out, commit 32f5ebb062 ("ovn-northd: Limit ARP/ND broadcast 
domain whenever possible.") added logical flows in the LS S_SWITCH_IN_L2_LKUP 
stage to explicitly flood ARP/ND requests originated from router owned IP 
interfaces. This was done for a couple of reasons:

1. ARP requests for destinations/next-hops outside OVN need to be flooded in 
the broadcast domain anyway and would otherwise match the lowest priority rule 
in S_SWITCH_IN_L2_LKUP that would flood them nevertheless.

2. OVN sends periodic GARP requests for router owned IPs (i.e., NAT addresses 
and lo

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-26 Thread Girish Moodalbail
Hello Dumitru,

There are several things that are being discussed on this thread. Let me
see if I can tease them out for clarity.

1. All the router IPs are known to OVN (the join switch case)
2. Some IPs are known and some are not known (the external logical switch
that connects to physical network case).

Let us look at each of the case above:

1. Join Switch Case

++++
|   l3gateway||   l3gateway|
|router2 ||router3 |
+-+--++-+--+
IP2,M2 IP3,M3
  | |
   +--+-+---+
   |join switch |
   +-+--+
 |
  IP1,M1
 +---++
 |  distributed   |
 | router |
 ++


Say, GR router2 wants to send the packet out to DR and that we don't have
static mappings of MAC to IP in lr_in_arp_resolve table on GR router2 (with
Han's patch of dynamic_neigh_routes=true for all the Gateway Routers). With
this in mind, when an ARP request is sent out by router2's hypervisor the
packet should be directly sent to the distributed router alone. Your
commit 32f5ebb0622 (ovn-northd: Limit ARP/ND broadcast domain whenever
possible) should have allowed only unicast. However, in ls_in_l2_lkup table
we have

  table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src == { M2 }
&& (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
  table=19(ls_in_l2_lkup  ), priority=75   , match=(flags[1] == 0 &&
arp.op == 1 && arp.tpa == { IP1}), action=(outport = "jtor-router2";
output;)

As you can see, `priority=80` rule will always be hit and sent out to all
the GRs. The `priority=75` rule is never hit. So, we will see ARP packets
on the GENEVE tunnel. So, we need to change `priority=80` to match GARP
request packets. That way, for the known OVN IPs case we don't do broadcast.

2. External Logical Switch Case

   10.10.10.0/24
   -+--
|
 localnet
  +-+-+
  | external  |
 ++LS1+-+
 |+-+-+ |
 |  |   |
 10.10.10.2 10.10.10.3  10.10.10.4
SNAT   SNATSNAT
   +-+-+  +-+-+   +---+
   | l3gateway |  | l3gateway |   | l3gateway |
   |   node1   |  |   node2   |   |   node3   |
   +---+  +---+   +---+

In this case, we have some of the IPs in OVN and some in the physical
network. If we fix (1) above, all the ARP requests for the OVN's router IPs
will be unicast. However, all the ARP requests to external IPs, say
10.10.10.1 on the "physical router", will be broadcast. Now, we will see
these ARP broadcasts on all the L3 gateway routers. With
'learn_from_arp_request=false' [a], then the MAC_Binding table will not
explode for both ARP and GARP requests.

So, I don't think GARP requests and replies is the issue here? Furthermore,
learning from the GARP replies are blocked on certain routers. For
example:
https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
says "By default, updating the ARP cache on GARP replies is disabled on the
router.". So, our NAT addresses mapping will not be learnt.

Regards,
~Girish


[a] - From Han's mail, the meaning of learn_from_arp_request=false --> if
the TPA is on the router, add a new entry (it means the
> remote wants to communicate with this node, so it makes sense to
> learn the remote as well). Otherwise, ignore it and no new entry
added.



On Mon, May 25, 2020 at 3:55 AM Dumitru Ceara  wrote:

> On 5/23/20 12:56 AM, Girish Moodalbail wrote:
> >
> >
> > On Fri, May 22, 2020 at 1:51 PM Han Zhou  > > wrote:
> >
> >
> >
> > On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer
> > mailto:venugop...@nvidia.com>> wrote:
> >
> > A couple of comments below:
> >
> >
> >
> >
> >  I suppose the use of GARP as a reply v/s response is not
> > very clear; [1], Section 3 seems to offer a concise summary of
> > this. If the application sends GARP as
> >  a reply we are covered, but the question is if the GARP is
> > a request (which is allowed) then what our response should be.
> > Tim is right, we can't ignore
> >  the request (more so, since aging is not supported
> > currently), however "arp_accept" ignores the request for
> > creating a new cache entry, not updating
> >  an existing one (see last para below)
> >
> > [2]
> > arp_accept - BOOLEAN
> > Define behavior for

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-26 Thread Han Zhou
Hi Girish,

Thanks for the summary. I agree with you that GARP request v.s. reply is
irrelavent to the problem here.
Please see my comment inline below.

On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail 
wrote:
>
> Hello Dumitru,
>
> There are several things that are being discussed on this thread. Let me
see if I can tease them out for clarity.
>
> 1. All the router IPs are known to OVN (the join switch case)
> 2. Some IPs are known and some are not known (the external logical switch
that connects to physical network case).
>
> Let us look at each of the case above:
>
> 1. Join Switch Case
>
> ++++
> |   l3gateway||   l3gateway|
> |router2 ||router3 |
> +-+--++-+--+
> IP2,M2 IP3,M3
>   | |
>+--+-+---+
>|join switch |
>+-+--+
>  |
>   IP1,M1
>  +---++
>  |  distributed   |
>  | router |
>  ++
>
>
> Say, GR router2 wants to send the packet out to DR and that we don't have
static mappings of MAC to IP in lr_in_arp_resolve table on GR router2 (with
Han's patch of dynamic_neigh_routes=true for all the Gateway Routers). With
this in mind, when an ARP request is sent out by router2's hypervisor the
packet should be directly sent to the distributed router alone. Your commit
32f5ebb0622 (ovn-northd: Limit ARP/ND broadcast domain whenever possible)
should have allowed only unicast. However, in ls_in_l2_lkup table we have
>
>   table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src == { M2 }
&& (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
>   table=19(ls_in_l2_lkup  ), priority=75   , match=(flags[1] == 0 &&
arp.op == 1 && arp.tpa == { IP1}), action=(outport = "jtor-router2";
output;)
>
> As you can see, `priority=80` rule will always be hit and sent out to all
the GRs. The `priority=75` rule is never hit. So, we will see ARP packets
on the GENEVE tunnel. So, we need to change `priority=80` to match GARP
request packets. That way, for the known OVN IPs case we don't do broadcast.

Since the solution to case 2) below (i.e. learn_from_arp_request=false)
solves the problem of case 1), too, I think we don't need this change just
for case 1). As @Dumitru Ceara   mentioned, there is
some cost because it adds extra flows. It would be significant amount of
flows if there are a lot of snat_and_dnat IPs. What do you think?

>
> 2. External Logical Switch Case
>
>10.10.10.0/24
>-+--
> |
>  localnet
>   +-+-+
>   | external  |
>  ++LS1+-+
>  |+-+-+ |
>  |  |   |
>  10.10.10.2 10.10.10.3  10.10.10.4
> SNAT   SNATSNAT
>+-+-+  +-+-+   +---+
>| l3gateway |  | l3gateway |   | l3gateway |
>|   node1   |  |   node2   |   |   node3   |
>+---+  +---+   +---+
>
> In this case, we have some of the IPs in OVN and some in the physical
network. If we fix (1) above, all the ARP requests for the OVN's router IPs
will be unicast. However, all the ARP requests to external IPs, say
10.10.10.1 on the "physical router", will be broadcast. Now, we will see
these ARP broadcasts on all the L3 gateway routers. With
'learn_from_arp_request=false' [a], then the MAC_Binding table will not
explode for both ARP and GARP requests.
>
> So, I don't think GARP requests and replies is the issue here?
Furthermore, learning from the GARP replies are blocked on certain routers.
For example:
https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
 says "By default, updating the ARP cache on GARP replies is disabled on
the router.". So, our NAT addresses mapping will not be learnt.
>
> Regards,
> ~Girish
>
>
> [a] - From Han's mail, the meaning of learn_from_arp_request=false --> if
the TPA is on the router, add a new entry (it means the
> > remote wants to communicate with this node, so it makes sense to
> > learn the remote as well). Otherwise, ignore it and no new entry
added.
>
>
>
> On Mon, May 25, 2020 at 3:55 AM Dumitru Ceara  wrote:
>>
>> On 5/23/20 12:56 AM, Girish Moodalbail wrote:
>> >
>> >
>> > On Fri, May 22, 2020 at 1:51 PM Han Zhou > > > wrote:
>> >
>> >
>> >
>> > On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer
>> > mailto:venugop...@nvidia.com>> wrote:
>> >
>> > A couple of comments below:
>> >
>> >
>> >
>> >
>> >  I suppos

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-26 Thread Girish Moodalbail
On Tue, May 26, 2020 at 12:42 PM Han Zhou  wrote:

> Hi Girish,
>
> Thanks for the summary. I agree with you that GARP request v.s. reply is
> irrelavent to the problem here.
> Please see my comment inline below.
>
> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail 
> wrote:
> >
> > Hello Dumitru,
> >
> > There are several things that are being discussed on this thread. Let me
> see if I can tease them out for clarity.
> >
> > 1. All the router IPs are known to OVN (the join switch case)
> > 2. Some IPs are known and some are not known (the external logical
> switch that connects to physical network case).
> >
> > Let us look at each of the case above:
> >
> > 1. Join Switch Case
> >
> > ++++
> > |   l3gateway||   l3gateway|
> > |router2 ||router3 |
> > +-+--++-+--+
> > IP2,M2 IP3,M3
> >   | |
> >+--+-+---+
> >|join switch |
> >+-+--+
> >  |
> >   IP1,M1
> >  +---++
> >  |  distributed   |
> >  | router |
> >  ++
> >
> >
> > Say, GR router2 wants to send the packet out to DR and that we don't
> have static mappings of MAC to IP in lr_in_arp_resolve table on GR router2
> (with Han's patch of dynamic_neigh_routes=true for all the Gateway
> Routers). With this in mind, when an ARP request is sent out by router2's
> hypervisor the packet should be directly sent to the distributed router
> alone. Your commit 32f5ebb0622 (ovn-northd: Limit ARP/ND broadcast domain
> whenever possible) should have allowed only unicast. However, in
> ls_in_l2_lkup table we have
> >
> >   table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src == { M2
> } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
> >   table=19(ls_in_l2_lkup  ), priority=75   , match=(flags[1] == 0 &&
> arp.op == 1 && arp.tpa == { IP1}), action=(outport = "jtor-router2";
> output;)
> >
> > As you can see, `priority=80` rule will always be hit and sent out to
> all the GRs. The `priority=75` rule is never hit. So, we will see ARP
> packets on the GENEVE tunnel. So, we need to change `priority=80` to match
> GARP request packets. That way, for the known OVN IPs case we don't do
> broadcast.
>
> Since the solution to case 2) below (i.e. learn_from_arp_request=false)
> solves the problem of case 1), too, I think we don't need this change just
> for case 1). As @Dumitru Ceara   mentioned, there is
> some cost because it adds extra flows. It would be significant amount of
> flows if there are a lot of snat_and_dnat IPs. What do you think?
>

Han, yes it will work. However, my only concern is that we would send all
these ARP requests via tunnel to each of 1000 hypervisors and these
hypervisors will just drop them on the floor. when they see
learn_from_arp_request=false.

Han, Dumitru,

Why can't we swap the priorities of the above two flows so that the ARP
request for NexHop IP known to OVN will be always sent via `unicast`?

Regards,
~Girish


> >
> > 2. External Logical Switch Case
> >
> >10.10.10.0/24
> >-+--
> > |
> >  localnet
> >   +-+-+
> >   | external  |
> >  ++LS1+-+
> >  |+-+-+ |
> >  |  |   |
> >  10.10.10.2 10.10.10.3  10.10.10.4
> > SNAT   SNATSNAT
> >+-+-+  +-+-+   +---+
> >| l3gateway |  | l3gateway |   | l3gateway |
> >|   node1   |  |   node2   |   |   node3   |
> >+---+  +---+   +---+
> >
> > In this case, we have some of the IPs in OVN and some in the physical
> network. If we fix (1) above, all the ARP requests for the OVN's router IPs
> will be unicast. However, all the ARP requests to external IPs, say
> 10.10.10.1 on the "physical router", will be broadcast. Now, we will see
> these ARP broadcasts on all the L3 gateway routers. With
> 'learn_from_arp_request=false' [a], then the MAC_Binding table will not
> explode for both ARP and GARP requests.
> >
> > So, I don't think GARP requests and replies is the issue here?
> Furthermore, learning from the GARP replies are blocked on certain routers.
> For example:
> https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
>  says "By default, updating the ARP cache on GARP replies is disabled on
> the router.". So, our NAT addresses mapping will not be learnt.
> >
> > Regards,
> > ~Girish
> >
> >
> > [a] - From Han's mail, the mea

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-26 Thread Han Zhou
On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail 
wrote:
>
>
>
> On Tue, May 26, 2020 at 12:42 PM Han Zhou  wrote:
>>
>> Hi Girish,
>>
>> Thanks for the summary. I agree with you that GARP request v.s. reply is
irrelavent to the problem here.
>> Please see my comment inline below.
>>
>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail 
wrote:
>> >
>> > Hello Dumitru,
>> >
>> > There are several things that are being discussed on this thread. Let
me see if I can tease them out for clarity.
>> >
>> > 1. All the router IPs are known to OVN (the join switch case)
>> > 2. Some IPs are known and some are not known (the external logical
switch that connects to physical network case).
>> >
>> > Let us look at each of the case above:
>> >
>> > 1. Join Switch Case
>> >
>> > ++++
>> > |   l3gateway||   l3gateway|
>> > |router2 ||router3 |
>> > +-+--++-+--+
>> > IP2,M2 IP3,M3
>> >   | |
>> >+--+-+---+
>> >|join switch |
>> >+-+--+
>> >  |
>> >   IP1,M1
>> >  +---++
>> >  |  distributed   |
>> >  | router |
>> >  ++
>> >
>> >
>> > Say, GR router2 wants to send the packet out to DR and that we don't
have static mappings of MAC to IP in lr_in_arp_resolve table on GR router2
(with Han's patch of dynamic_neigh_routes=true for all the Gateway
Routers). With this in mind, when an ARP request is sent out by router2's
hypervisor the packet should be directly sent to the distributed router
alone. Your commit 32f5ebb0622 (ovn-northd: Limit ARP/ND broadcast domain
whenever possible) should have allowed only unicast. However, in
ls_in_l2_lkup table we have
>> >
>> >   table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src == {
M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
>> >   table=19(ls_in_l2_lkup  ), priority=75   , match=(flags[1] == 0
&& arp.op == 1 && arp.tpa == { IP1}), action=(outport = "jtor-router2";
output;)
>> >
>> > As you can see, `priority=80` rule will always be hit and sent out to
all the GRs. The `priority=75` rule is never hit. So, we will see ARP
packets on the GENEVE tunnel. So, we need to change `priority=80` to match
GARP request packets. That way, for the known OVN IPs case we don't do
broadcast.
>>
>> Since the solution to case 2) below (i.e. learn_from_arp_request=false)
solves the problem of case 1), too, I think we don't need this change just
for case 1). As @Dumitru Ceara  mentioned, there is some cost because it
adds extra flows. It would be significant amount of flows if there are a
lot of snat_and_dnat IPs. What do you think?
>
>
> Han, yes it will work. However, my only concern is that we would send all
these ARP requests via tunnel to each of 1000 hypervisors and these
hypervisors will just drop them on the floor. when they see
learn_from_arp_request=false.

I think maybe it is not a problem since it happens only once on the Join
switch. Once the MAC is learned, it won't broadcast again. It may be more
of a problem on the external LS if periodical GARP is required there.
However, I'd suggest to have some test and see if it is really a problem,
before trying to solve it.

>
> Han, Dumitru,
>
> Why can't we swap the priorities of the above two flows so that the ARP
request for NexHop IP known to OVN will be always sent via `unicast`?

If swapped, even GARP won't get broadcasted. Maybe that's not the desired
behavior.

>
> Regards,
> ~Girish
>
>>
>> >
>> > 2. External Logical Switch Case
>> >
>> >10.10.10.0/24
>> >-+--
>> > |
>> >  localnet
>> >   +-+-+
>> >   | external  |
>> >  ++LS1+-+
>> >  |+-+-+ |
>> >  |  |   |
>> >  10.10.10.2 10.10.10.3  10.10.10.4
>> > SNAT   SNATSNAT
>> >+-+-+  +-+-+   +---+
>> >| l3gateway |  | l3gateway |   | l3gateway |
>> >|   node1   |  |   node2   |   |   node3   |
>> >+---+  +---+   +---+
>> >
>> > In this case, we have some of the IPs in OVN and some in the physical
network. If we fix (1) above, all the ARP requests for the OVN's router IPs
will be unicast. However, all the ARP requests to external IPs, say
10.10.10.1 on the "physical router", will be broadcast. Now, we will see
these ARP broadcasts on all the L3 gateway routers. With
'learn_from_arp_request=false' [a], then the MAC_Binding table will not
explode for both ARP and GARP request

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-27 Thread Dumitru Ceara
Hi Girish, Han,

On 5/26/20 11:51 PM, Han Zhou wrote:
> 
> 
> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail  > wrote:
>>
>>
>>
>> On Tue, May 26, 2020 at 12:42 PM Han Zhou  > wrote:
>>>
>>> Hi Girish,
>>>
>>> Thanks for the summary. I agree with you that GARP request v.s. reply
> is irrelavent to the problem here.

Well, actually I think GARP request vs reply is relevant (at least for
case 1 below) because if OVN would be generating GARP replies we
wouldn't need the priority 80 flow to determine if an ARP request packet
is actually an OVN self originated GARP that needs to be flooded in the
L2 broadcast domain.

On the other hand, router3 would be learning mac_binding IP2,M2 from the
GARP reply originated by router2 and vice versa so we'd have to restrict
flooding of GARP replies to non-patch ports.

>>> Please see my comment inline below.
>>>
>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> mailto:gmoodalb...@gmail.com>> wrote:
>>> >
>>> > Hello Dumitru,
>>> >
>>> > There are several things that are being discussed on this thread.
> Let me see if I can tease them out for clarity.
>>> >
>>> > 1. All the router IPs are known to OVN (the join switch case)
>>> > 2. Some IPs are known and some are not known (the external logical
> switch that connects to physical network case).
>>> >
>>> > Let us look at each of the case above:
>>> >
>>> > 1. Join Switch Case
>>> >
>>> > ++        ++
>>> > |   l3gateway    |        |   l3gateway    |
>>> > |    router2     |        |    router3     |
>>> > +-+--+        +-+--+
>>> >             IP2,M2         IP3,M3          
>>> >               |             |                            
>>> >            +--+-+---+          
>>> >            |    join switch     |          
>>> >            +-+--+          
>>> >                      |                      
>>> >                   IP1,M1                    
>>> >              +---++            
>>> >              |  distributed   |            
>>> >              |     router     |            
>>> >              ++      
>>> >
>>> >
>>> > Say, GR router2 wants to send the packet out to DR and that we
> don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
> router2 (with Han's patch of dynamic_neigh_routes=true for all the
> Gateway Routers). With this in mind, when an ARP request is sent out by
> router2's hypervisor the packet should be directly sent to the
> distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
> ARP/ND broadcast domain whenever possible) should have allowed only
> unicast. However, in ls_in_l2_lkup table we have
>>> >
>>> >   table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src ==
> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
>>> >   table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] ==
> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
> "jtor-router2"; output;)
>>> >
>>> > As you can see, `priority=80` rule will always be hit and sent out
> to all the GRs. The `priority=75` rule is never hit. So, we will see ARP
> packets on the GENEVE tunnel. So, we need to change `priority=80` to
> match GARP request packets. That way, for the known OVN IPs case we
> don't do broadcast.
>>>
>>> Since the solution to case 2) below (i.e.
> learn_from_arp_request=false) solves the problem of case 1), too, I
> think we don't need this change just for case 1). As @Dumitru Ceara
>  mentioned, there is some cost because it adds extra flows. It would be
> significant amount of flows if there are a lot of snat_and_dnat IPs.
> What do you think?

I think the following might be a solution, although with the cost of
adding as many flows as dnat_and_snat IPs are configured:

- priority 80: explicitly determine if an ARP request is a self
originated GARP for configured IP addresses and dnat_and_snat IPs (by
matching on all eth.src and arp.tpa pairs) and if so flood on all
non-patch ports.
- priority 75: if arp.tpa is owned by an OVN logical router port,
"unicast" it only on the patch port towards the router.
- priority 1: flood any broadcast packet.

Together with the learn_from_arp_request=false knob this would cover
both case 1 (join switch) and case 2 (external switch).

Wdyt?

>>
>>
>> Han, yes it will work. However, my only concern is that we would send
> all these ARP requests via tunnel to each of 1000 hypervisors and these
> hypervisors will just drop them on the floor. when they see
> learn_from_arp_request=false.
> 
> I think maybe it is not a problem since it happens only once on the Join
> switch. Once the MAC is learned, it won't broadcast again. It may be
> more of a problem on the external LS if periodical GARP is required
> there. However, I'd suggest to have some test and see if it is really a
> problem, before trying to solve it.
> 
>>
>> Han, D

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-27 Thread Han Zhou
On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara  wrote:
>
> Hi Girish, Han,
>
> On 5/26/20 11:51 PM, Han Zhou wrote:
> >
> >
> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail  > > wrote:
> >>
> >>
> >>
> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou  > > wrote:
> >>>
> >>> Hi Girish,
> >>>
> >>> Thanks for the summary. I agree with you that GARP request v.s. reply
> > is irrelavent to the problem here.
>
> Well, actually I think GARP request vs reply is relevant (at least for
> case 1 below) because if OVN would be generating GARP replies we
> wouldn't need the priority 80 flow to determine if an ARP request packet
> is actually an OVN self originated GARP that needs to be flooded in the
> L2 broadcast domain.
>
> On the other hand, router3 would be learning mac_binding IP2,M2 from the
> GARP reply originated by router2 and vice versa so we'd have to restrict
> flooding of GARP replies to non-patch ports.
>

Hi Dumitru, the point was that, on the external LS, the GRs will have to
send ARP requests to resolve unknown IPs (at least for the external GW),
and it has to be broadcasted, which will cause all the GRs learn all MACs
of other GRs. This is regardless of the GARP behavior. You are right that
if we only consider the Join switch then the GARP request v.s. reply does
make a difference. However, GARP request/reply may be really needed only on
the external LS.

> >>> Please see my comment inline below.
> >>>
> >>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> > mailto:gmoodalb...@gmail.com>> wrote:
> >>> >
> >>> > Hello Dumitru,
> >>> >
> >>> > There are several things that are being discussed on this thread.
> > Let me see if I can tease them out for clarity.
> >>> >
> >>> > 1. All the router IPs are known to OVN (the join switch case)
> >>> > 2. Some IPs are known and some are not known (the external logical
> > switch that connects to physical network case).
> >>> >
> >>> > Let us look at each of the case above:
> >>> >
> >>> > 1. Join Switch Case
> >>> >
> >>> > ++++
> >>> > |   l3gateway||   l3gateway|
> >>> > |router2 ||router3 |
> >>> > +-+--++-+--+
> >>> > IP2,M2 IP3,M3
> >>> >   | |
> >>> >+--+-+---+
> >>> >|join switch |
> >>> >+-+--+
> >>> >  |
> >>> >   IP1,M1
> >>> >  +---++
> >>> >  |  distributed   |
> >>> >  | router |
> >>> >  ++
> >>> >
> >>> >
> >>> > Say, GR router2 wants to send the packet out to DR and that we
> > don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
> > router2 (with Han's patch of dynamic_neigh_routes=true for all the
> > Gateway Routers). With this in mind, when an ARP request is sent out by
> > router2's hypervisor the packet should be directly sent to the
> > distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
> > ARP/ND broadcast domain whenever possible) should have allowed only
> > unicast. However, in ls_in_l2_lkup table we have
> >>> >
> >>> >   table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src ==
> > { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood";
output;)
> >>> >   table=19(ls_in_l2_lkup  ), priority=75   , match=(flags[1] ==
> > 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
> > "jtor-router2"; output;)
> >>> >
> >>> > As you can see, `priority=80` rule will always be hit and sent out
> > to all the GRs. The `priority=75` rule is never hit. So, we will see ARP
> > packets on the GENEVE tunnel. So, we need to change `priority=80` to
> > match GARP request packets. That way, for the known OVN IPs case we
> > don't do broadcast.
> >>>
> >>> Since the solution to case 2) below (i.e.
> > learn_from_arp_request=false) solves the problem of case 1), too, I
> > think we don't need this change just for case 1). As @Dumitru Ceara
> >  mentioned, there is some cost because it adds extra flows. It would be
> > significant amount of flows if there are a lot of snat_and_dnat IPs.
> > What do you think?
>
> I think the following might be a solution, although with the cost of
> adding as many flows as dnat_and_snat IPs are configured:
>
> - priority 80: explicitly determine if an ARP request is a self
> originated GARP for configured IP addresses and dnat_and_snat IPs (by
> matching on all eth.src and arp.tpa pairs) and if so flood on all
> non-patch ports.
> - priority 75: if arp.tpa is owned by an OVN logical router port,
> "unicast" it only on the patch port towards the router.
> - priority 1: flood any broadcast packet.
>
> Together with the learn_from_arp_request=false knob this would cover
> both case 1 (join switch) and case 2 (external switch).
>
> Wdyt?
>
Would the "learn_from_a

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-28 Thread Dumitru Ceara
On 5/28/20 8:34 AM, Han Zhou wrote:
> 
> 
> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara  > wrote:
>>
>> Hi Girish, Han,
>>
>> On 5/26/20 11:51 PM, Han Zhou wrote:
>> >
>> >
>> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
> mailto:gmoodalb...@gmail.com>
>> > >> wrote:
>> >>
>> >>
>> >>
>> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou  
>> > >> wrote:
>> >>>
>> >>> Hi Girish,
>> >>>
>> >>> Thanks for the summary. I agree with you that GARP request v.s. reply
>> > is irrelavent to the problem here.
>>
>> Well, actually I think GARP request vs reply is relevant (at least for
>> case 1 below) because if OVN would be generating GARP replies we
>> wouldn't need the priority 80 flow to determine if an ARP request packet
>> is actually an OVN self originated GARP that needs to be flooded in the
>> L2 broadcast domain.
>>
>> On the other hand, router3 would be learning mac_binding IP2,M2 from the
>> GARP reply originated by router2 and vice versa so we'd have to restrict
>> flooding of GARP replies to non-patch ports.
>>
> 
> Hi Dumitru, the point was that, on the external LS, the GRs will have to
> send ARP requests to resolve unknown IPs (at least for the external GW),
> and it has to be broadcasted, which will cause all the GRs learn all
> MACs of other GRs. This is regardless of the GARP behavior. You are
> right that if we only consider the Join switch then the GARP request
> v.s. reply does make a difference. However, GARP request/reply may be
> really needed only on the external LS.
> 

Ok, but do you see an easy way to determine if we need to add the
logical flows that flood self originated GARP packets on a given logical
switch? Right now we add them on all switches.

>> >>> Please see my comment inline below.
>> >>>
>> >>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
>> > mailto:gmoodalb...@gmail.com>
> >> wrote:
>> >>> >
>> >>> > Hello Dumitru,
>> >>> >
>> >>> > There are several things that are being discussed on this thread.
>> > Let me see if I can tease them out for clarity.
>> >>> >
>> >>> > 1. All the router IPs are known to OVN (the join switch case)
>> >>> > 2. Some IPs are known and some are not known (the external logical
>> > switch that connects to physical network case).
>> >>> >
>> >>> > Let us look at each of the case above:
>> >>> >
>> >>> > 1. Join Switch Case
>> >>> >
>> >>> > ++        ++
>> >>> > |   l3gateway    |        |   l3gateway    |
>> >>> > |    router2     |        |    router3     |
>> >>> > +-+--+        +-+--+
>> >>> >             IP2,M2         IP3,M3          
>> >>> >               |             |                            
>> >>> >            +--+-+---+          
>> >>> >            |    join switch     |          
>> >>> >            +-+--+          
>> >>> >                      |                      
>> >>> >                   IP1,M1                    
>> >>> >              +---++            
>> >>> >              |  distributed   |            
>> >>> >              |     router     |            
>> >>> >              ++      
>> >>> >
>> >>> >
>> >>> > Say, GR router2 wants to send the packet out to DR and that we
>> > don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
>> > router2 (with Han's patch of dynamic_neigh_routes=true for all the
>> > Gateway Routers). With this in mind, when an ARP request is sent out by
>> > router2's hypervisor the packet should be directly sent to the
>> > distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
>> > ARP/ND broadcast domain whenever possible) should have allowed only
>> > unicast. However, in ls_in_l2_lkup table we have
>> >>> >
>> >>> >   table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src ==
>> > { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood";
> output;)
>> >>> >   table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] ==
>> > 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
>> > "jtor-router2"; output;)
>> >>> >
>> >>> > As you can see, `priority=80` rule will always be hit and sent out
>> > to all the GRs. The `priority=75` rule is never hit. So, we will see ARP
>> > packets on the GENEVE tunnel. So, we need to change `priority=80` to
>> > match GARP request packets. That way, for the known OVN IPs case we
>> > don't do broadcast.
>> >>>
>> >>> Since the solution to case 2) below (i.e.
>> > learn_from_arp_request=false) solves the problem of case 1), too, I
>> > think we don't need this change just for case 1). As @Dumitru Ceara
>> >  mentioned, there is some cost because it adds extra flows. It would be
>> > significant amount of flows if there are a lot of snat_and_dnat IPs.
>> > What do 

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-28 Thread Daniel Alvarez Sanchez
Hi all

Sorry for top posting. I want to thank you all for the discussion and
give also some feedback from OpenStack perspective which is affected
by the problem described here.

In OpenStack, it's kind of common to have a shared external network
(logical switch with a localnet port) across many tenants. Each tenant
user may create their own router where their instances will be
connected to access the external network.

In such scenario, we are hitting the issue described here. In
particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
300 LS; each LS connected to a LR (ie. 300 LRs) and that router
connected to the public LS. This is creating a huge problem in terms
of performance and tons of events due to the MAC_Binding entries
generated as a consequence of the GARPs sent for the floating IPs.

Thanks,
Daniel


On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara  wrote:
>
> On 5/28/20 8:34 AM, Han Zhou wrote:
> >
> >
> > On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara  > > wrote:
> >>
> >> Hi Girish, Han,
> >>
> >> On 5/26/20 11:51 PM, Han Zhou wrote:
> >> >
> >> >
> >> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
> > mailto:gmoodalb...@gmail.com>
> >> > >> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou  > 
> >> > >> wrote:
> >> >>>
> >> >>> Hi Girish,
> >> >>>
> >> >>> Thanks for the summary. I agree with you that GARP request v.s. reply
> >> > is irrelavent to the problem here.
> >>
> >> Well, actually I think GARP request vs reply is relevant (at least for
> >> case 1 below) because if OVN would be generating GARP replies we
> >> wouldn't need the priority 80 flow to determine if an ARP request packet
> >> is actually an OVN self originated GARP that needs to be flooded in the
> >> L2 broadcast domain.
> >>
> >> On the other hand, router3 would be learning mac_binding IP2,M2 from the
> >> GARP reply originated by router2 and vice versa so we'd have to restrict
> >> flooding of GARP replies to non-patch ports.
> >>
> >
> > Hi Dumitru, the point was that, on the external LS, the GRs will have to
> > send ARP requests to resolve unknown IPs (at least for the external GW),
> > and it has to be broadcasted, which will cause all the GRs learn all
> > MACs of other GRs. This is regardless of the GARP behavior. You are
> > right that if we only consider the Join switch then the GARP request
> > v.s. reply does make a difference. However, GARP request/reply may be
> > really needed only on the external LS.
> >
>
> Ok, but do you see an easy way to determine if we need to add the
> logical flows that flood self originated GARP packets on a given logical
> switch? Right now we add them on all switches.
>
> >> >>> Please see my comment inline below.
> >> >>>
> >> >>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> >> > mailto:gmoodalb...@gmail.com>
> > >> wrote:
> >> >>> >
> >> >>> > Hello Dumitru,
> >> >>> >
> >> >>> > There are several things that are being discussed on this thread.
> >> > Let me see if I can tease them out for clarity.
> >> >>> >
> >> >>> > 1. All the router IPs are known to OVN (the join switch case)
> >> >>> > 2. Some IPs are known and some are not known (the external logical
> >> > switch that connects to physical network case).
> >> >>> >
> >> >>> > Let us look at each of the case above:
> >> >>> >
> >> >>> > 1. Join Switch Case
> >> >>> >
> >> >>> > ++++
> >> >>> > |   l3gateway||   l3gateway|
> >> >>> > |router2 ||router3 |
> >> >>> > +-+--++-+--+
> >> >>> > IP2,M2 IP3,M3
> >> >>> >   | |
> >> >>> >+--+-+---+
> >> >>> >|join switch |
> >> >>> >+-+--+
> >> >>> >  |
> >> >>> >   IP1,M1
> >> >>> >  +---++
> >> >>> >  |  distributed   |
> >> >>> >  | router |
> >> >>> >  ++
> >> >>> >
> >> >>> >
> >> >>> > Say, GR router2 wants to send the packet out to DR and that we
> >> > don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
> >> > router2 (with Han's patch of dynamic_neigh_routes=true for all the
> >> > Gateway Routers). With this in mind, when an ARP request is sent out by
> >> > router2's hypervisor the packet should be directly sent to the
> >> > distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
> >> > ARP/ND broadcast domain whenever possible) should have allowed only
> >> > unicast. However, in ls_in_l2_lkup table we have
> >> >>> >
> >> >>> >   table=19(ls_in_l2_lkup  ), priority=80   , match=(eth.src ==
> >> > { M2 } && (arp.

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-28 Thread Dumitru Ceara
On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> Hi all
> 
> Sorry for top posting. I want to thank you all for the discussion and
> give also some feedback from OpenStack perspective which is affected
> by the problem described here.
> 
> In OpenStack, it's kind of common to have a shared external network
> (logical switch with a localnet port) across many tenants. Each tenant
> user may create their own router where their instances will be
> connected to access the external network.
> 
> In such scenario, we are hitting the issue described here. In
> particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
> 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> connected to the public LS. This is creating a huge problem in terms
> of performance and tons of events due to the MAC_Binding entries
> generated as a consequence of the GARPs sent for the floating IPs.
> 

Just as an addition to this, GARPs wouldn't be the only reason why all
routers would learn the MAC_Binding. Even if we wouldn't be sending
GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
the outside, the router will generate an ARP request for the next hop
using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
connected to the public LS and will trigger them to learn the
FIP-IP:FIP-MAC binding.

> Thanks,
> Daniel
> 
> 
> On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara  wrote:
>>
>> On 5/28/20 8:34 AM, Han Zhou wrote:
>>>
>>>
>>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara >> > wrote:

 Hi Girish, Han,

 On 5/26/20 11:51 PM, Han Zhou wrote:
>
>
> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>>> mailto:gmoodalb...@gmail.com>
> >> wrote:
>>
>>
>>
>> On Tue, May 26, 2020 at 12:42 PM Han Zhou >> 
> >> wrote:
>>>
>>> Hi Girish,
>>>
>>> Thanks for the summary. I agree with you that GARP request v.s. reply
> is irrelavent to the problem here.

 Well, actually I think GARP request vs reply is relevant (at least for
 case 1 below) because if OVN would be generating GARP replies we
 wouldn't need the priority 80 flow to determine if an ARP request packet
 is actually an OVN self originated GARP that needs to be flooded in the
 L2 broadcast domain.

 On the other hand, router3 would be learning mac_binding IP2,M2 from the
 GARP reply originated by router2 and vice versa so we'd have to restrict
 flooding of GARP replies to non-patch ports.

>>>
>>> Hi Dumitru, the point was that, on the external LS, the GRs will have to
>>> send ARP requests to resolve unknown IPs (at least for the external GW),
>>> and it has to be broadcasted, which will cause all the GRs learn all
>>> MACs of other GRs. This is regardless of the GARP behavior. You are
>>> right that if we only consider the Join switch then the GARP request
>>> v.s. reply does make a difference. However, GARP request/reply may be
>>> really needed only on the external LS.
>>>
>>
>> Ok, but do you see an easy way to determine if we need to add the
>> logical flows that flood self originated GARP packets on a given logical
>> switch? Right now we add them on all switches.
>>
>>> Please see my comment inline below.
>>>
>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> mailto:gmoodalb...@gmail.com>
>>> >> wrote:

 Hello Dumitru,

 There are several things that are being discussed on this thread.
> Let me see if I can tease them out for clarity.

 1. All the router IPs are known to OVN (the join switch case)
 2. Some IPs are known and some are not known (the external logical
> switch that connects to physical network case).

 Let us look at each of the case above:

 1. Join Switch Case

 ++++
 |   l3gateway||   l3gateway|
 |router2 ||router3 |
 +-+--++-+--+
 IP2,M2 IP3,M3
   | |
+--+-+---+
|join switch |
+-+--+
  |
   IP1,M1
  +---++
  |  distributed   |
  | router |
  ++


 Say, GR router2 wants to send the packet out to DR and that we
> don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
> router2 (with Han's patch of dynamic_neigh_routes=true for all the
> 

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-05-28 Thread Tim Rozet
On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara  wrote:

> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> > Hi all
> >
> > Sorry for top posting. I want to thank you all for the discussion and
> > give also some feedback from OpenStack perspective which is affected
> > by the problem described here.
> >
> > In OpenStack, it's kind of common to have a shared external network
> > (logical switch with a localnet port) across many tenants. Each tenant
> > user may create their own router where their instances will be
> > connected to access the external network.
> >
> > In such scenario, we are hitting the issue described here. In
> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> > connected to the public LS. This is creating a huge problem in terms
> > of performance and tons of events due to the MAC_Binding entries
> > generated as a consequence of the GARPs sent for the floating IPs.
> >
>
> Just as an addition to this, GARPs wouldn't be the only reason why all
> routers would learn the MAC_Binding. Even if we wouldn't be sending
> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
> the outside, the router will generate an ARP request for the next hop
> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
> connected to the public LS and will trigger them to learn the
> FIP-IP:FIP-MAC binding.
>

Yeah we shouldn't be learning on regular ARP requests.


>
> > Thanks,
> > Daniel
> >
> >
> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara 
> wrote:
> >>
> >> On 5/28/20 8:34 AM, Han Zhou wrote:
> >>>
> >>>
> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara  >>> > wrote:
> 
>  Hi Girish, Han,
> 
>  On 5/26/20 11:51 PM, Han Zhou wrote:
> >
> >
> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
> >>> mailto:gmoodalb...@gmail.com>
> > >>
> wrote:
> >>
> >>
> >>
> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou  >>> 
> > >> wrote:
> >>>
> >>> Hi Girish,
> >>>
> >>> Thanks for the summary. I agree with you that GARP request v.s.
> reply
> > is irrelavent to the problem here.
> 
>  Well, actually I think GARP request vs reply is relevant (at least for
>  case 1 below) because if OVN would be generating GARP replies we
>  wouldn't need the priority 80 flow to determine if an ARP request
> packet
>  is actually an OVN self originated GARP that needs to be flooded in
> the
>  L2 broadcast domain.
> 
>  On the other hand, router3 would be learning mac_binding IP2,M2 from
> the
>  GARP reply originated by router2 and vice versa so we'd have to
> restrict
>  flooding of GARP replies to non-patch ports.
> 
> >>>
> >>> Hi Dumitru, the point was that, on the external LS, the GRs will have
> to
> >>> send ARP requests to resolve unknown IPs (at least for the external
> GW),
> >>> and it has to be broadcasted, which will cause all the GRs learn all
> >>> MACs of other GRs. This is regardless of the GARP behavior. You are
> >>> right that if we only consider the Join switch then the GARP request
> >>> v.s. reply does make a difference. However, GARP request/reply may be
> >>> really needed only on the external LS.
> >>>
> >>
> >> Ok, but do you see an easy way to determine if we need to add the
> >> logical flows that flood self originated GARP packets on a given logical
> >> switch? Right now we add them on all switches.
> >>
> >>> Please see my comment inline below.
> >>>
> >>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> > mailto:gmoodalb...@gmail.com>
> >>> >> wrote:
> 
>  Hello Dumitru,
> 
>  There are several things that are being discussed on this thread.
> > Let me see if I can tease them out for clarity.
> 
>  1. All the router IPs are known to OVN (the join switch case)
>  2. Some IPs are known and some are not known (the external logical
> > switch that connects to physical network case).
> 
>  Let us look at each of the case above:
> 
>  1. Join Switch Case
> 
>  ++++
>  |   l3gateway||   l3gateway|
>  |router2 ||router3 |
>  +-+--++-+--+
>  IP2,M2 IP3,M3
>    | |
> +--+-+---+
> |join switch |
> +-+--+
>   |
>    IP1,M1
>   +---++
>   

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-03 Thread Girish Moodalbail
Hello all,

To kind of proceed with the proposed fixes, with minimal impact, is the
following a reasonable approach?

   1. Add an option, namely dynamic_neigh_routes={true|false}, for a
   gateway router. With this option enabled, the nextHop IP's MAC will be
   learned through a ARP request on the physical network. The ARP request will
   be flooded on the L2 broadcast domain (for both join switch and external
   switch).

   2. Add an option, namely learn_from_arp_request={true|false}, for a
   gateway router. The option is interpreted as below:\
   "true" - learn the MAC/IP binding and add a new MAC_Binding entry
   (default behavior)
   "false" - if there is a MAC_binding for that IP and the MAC is
   different, then update that MAC/IP binding. The external entity might be
   trying to advertise the new MAC for that IP. (If we don't do this, then we
   will never learn External VIP to MAC changes)

   (Irrespective of, learn_from_arp_request is true or false, always do
   this -- if the TPA is on the router, add a new entry (it means the remote
   wants to communicate with this node, so it makes sense to learn the remote
   as well))


For now, I think it is fine for ARP packets to be broadcasted on the tunnel
for the `join` switch case. If it becomes a problem, then we can start
looking around changing the logical flows.

Thanks everyone for the lively discussion.

Regards,
~Girish

On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:

>
>
> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara  wrote:
>
>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>> > Hi all
>> >
>> > Sorry for top posting. I want to thank you all for the discussion and
>> > give also some feedback from OpenStack perspective which is affected
>> > by the problem described here.
>> >
>> > In OpenStack, it's kind of common to have a shared external network
>> > (logical switch with a localnet port) across many tenants. Each tenant
>> > user may create their own router where their instances will be
>> > connected to access the external network.
>> >
>> > In such scenario, we are hitting the issue described here. In
>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
>> > connected to the public LS. This is creating a huge problem in terms
>> > of performance and tons of events due to the MAC_Binding entries
>> > generated as a consequence of the GARPs sent for the floating IPs.
>> >
>>
>> Just as an addition to this, GARPs wouldn't be the only reason why all
>> routers would learn the MAC_Binding. Even if we wouldn't be sending
>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
>> the outside, the router will generate an ARP request for the next hop
>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
>> connected to the public LS and will trigger them to learn the
>> FIP-IP:FIP-MAC binding.
>>
>
> Yeah we shouldn't be learning on regular ARP requests.
>
>
>>
>> > Thanks,
>> > Daniel
>> >
>> >
>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara 
>> wrote:
>> >>
>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
>> >>>
>> >>>
>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara > >>> > wrote:
>> 
>>  Hi Girish, Han,
>> 
>>  On 5/26/20 11:51 PM, Han Zhou wrote:
>> >
>> >
>> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>> >>> mailto:gmoodalb...@gmail.com>
>> > >>
>> wrote:
>> >>
>> >>
>> >>
>> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou > >>> 
>> > >> wrote:
>> >>>
>> >>> Hi Girish,
>> >>>
>> >>> Thanks for the summary. I agree with you that GARP request v.s.
>> reply
>> > is irrelavent to the problem here.
>> 
>>  Well, actually I think GARP request vs reply is relevant (at least
>> for
>>  case 1 below) because if OVN would be generating GARP replies we
>>  wouldn't need the priority 80 flow to determine if an ARP request
>> packet
>>  is actually an OVN self originated GARP that needs to be flooded in
>> the
>>  L2 broadcast domain.
>> 
>>  On the other hand, router3 would be learning mac_binding IP2,M2 from
>> the
>>  GARP reply originated by router2 and vice versa so we'd have to
>> restrict
>>  flooding of GARP replies to non-patch ports.
>> 
>> >>>
>> >>> Hi Dumitru, the point was that, on the external LS, the GRs will have
>> to
>> >>> send ARP requests to resolve unknown IPs (at least for the external
>> GW),
>> >>> and it has to be broadcasted, which will cause all the GRs learn all
>> >>> MACs of other GRs. This is regardless of the GARP behavior. You are
>> >>> right that if we only consider the Join switch then the GARP request
>> >>> v.s. reply does make a difference. However, GARP request/reply may be
>> >>> really n

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-03 Thread Han Zhou
Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
that I forgot to update here.

On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
wrote:
>
> Hello all,
>
> To kind of proceed with the proposed fixes, with minimal impact, is the
following a reasonable approach?
>
> Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
router. With this option enabled, the nextHop IP's MAC will be learned
through a ARP request on the physical network. The ARP request will be
flooded on the L2 broadcast domain (for both join switch and external
switch).
>

The RFC patch fulfils this purpose:
https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
I am working on the formal patch.

> Add an option, namely learn_from_arp_request={true|false}, for a gateway
router. The option is interpreted as below:\
> "true" - learn the MAC/IP binding and add a new MAC_Binding entry
(default behavior)
> "false" - if there is a MAC_binding for that IP and the MAC is different,
then update that MAC/IP binding. The external entity might be trying to
advertise the new MAC for that IP. (If we don't do this, then we will never
learn External VIP to MAC changes)
>
> (Irrespective of, learn_from_arp_request is true or false, always do this
-- if the TPA is on the router, add a new entry (it means the remote wants
to communicate with this node, so it makes sense to learn the remote as
well))
>

I am working on this as well, but delayed a little. I hope to have
something this week.

>
> For now, I think it is fine for ARP packets to be broadcasted on the
tunnel for the `join` switch case. If it becomes a problem, then we can
start looking around changing the logical flows.
>
> Thanks everyone for the lively discussion.
>
> Regards,
> ~Girish
>
> On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
>>
>>
>>
>> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara  wrote:
>>>
>>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>>> > Hi all
>>> >
>>> > Sorry for top posting. I want to thank you all for the discussion and
>>> > give also some feedback from OpenStack perspective which is affected
>>> > by the problem described here.
>>> >
>>> > In OpenStack, it's kind of common to have a shared external network
>>> > (logical switch with a localnet port) across many tenants. Each tenant
>>> > user may create their own router where their instances will be
>>> > connected to access the external network.
>>> >
>>> > In such scenario, we are hitting the issue described here. In
>>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
>>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
>>> > connected to the public LS. This is creating a huge problem in terms
>>> > of performance and tons of events due to the MAC_Binding entries
>>> > generated as a consequence of the GARPs sent for the floating IPs.
>>> >
>>>
>>> Just as an addition to this, GARPs wouldn't be the only reason why all
>>> routers would learn the MAC_Binding. Even if we wouldn't be sending
>>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
>>> the outside, the router will generate an ARP request for the next hop
>>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
>>> connected to the public LS and will trigger them to learn the
>>> FIP-IP:FIP-MAC binding.
>>
>>
>> Yeah we shouldn't be learning on regular ARP requests.
>>
>>>
>>>
>>> > Thanks,
>>> > Daniel
>>> >
>>> >
>>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara 
wrote:
>>> >>
>>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
>>> >>>
>>> >>>
>>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara >> >>> > wrote:
>>> 
>>>  Hi Girish, Han,
>>> 
>>>  On 5/26/20 11:51 PM, Han Zhou wrote:
>>> >
>>> >
>>> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>>> >>> mailto:gmoodalb...@gmail.com>
>>> > >>
wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou >> >>> 
>>> > >> wrote:
>>> >>>
>>> >>> Hi Girish,
>>> >>>
>>> >>> Thanks for the summary. I agree with you that GARP request v.s.
reply
>>> > is irrelavent to the problem here.
>>> 
>>>  Well, actually I think GARP request vs reply is relevant (at least
for
>>>  case 1 below) because if OVN would be generating GARP replies we
>>>  wouldn't need the priority 80 flow to determine if an ARP request
packet
>>>  is actually an OVN self originated GARP that needs to be flooded
in the
>>>  L2 broadcast domain.
>>> 
>>>  On the other hand, router3 would be learning mac_binding IP2,M2
from the
>>>  GARP reply originated by router2 and vice versa so we'd have to
restrict
>>>  flooding of GARP replies to non-patch ports.
>>> 
>>> >>>
>>> >>> Hi Dumitru, th

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-03 Thread Girish Moodalbail
No worries, thanks for the update Han.

Once you have the patch, we can test your changes on our cluster and
provide you an update.

Regards,
~Girish

On Wed, Jun 3, 2020 at 4:27 PM Han Zhou  wrote:

> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
> that I forgot to update here.
>
> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
> wrote:
> >
> > Hello all,
> >
> > To kind of proceed with the proposed fixes, with minimal impact, is the
> following a reasonable approach?
> >
> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
> router. With this option enabled, the nextHop IP's MAC will be learned
> through a ARP request on the physical network. The ARP request will be
> flooded on the L2 broadcast domain (for both join switch and external
> switch).
> >
>
> The RFC patch fulfils this purpose:
> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
> I am working on the formal patch.
>
> > Add an option, namely learn_from_arp_request={true|false}, for a gateway
> router. The option is interpreted as below:\
> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
> (default behavior)
> > "false" - if there is a MAC_binding for that IP and the MAC is
> different, then update that MAC/IP binding. The external entity might be
> trying to advertise the new MAC for that IP. (If we don't do this, then we
> will never learn External VIP to MAC changes)
> >
> > (Irrespective of, learn_from_arp_request is true or false, always do
> this -- if the TPA is on the router, add a new entry (it means the remote
> wants to communicate with this node, so it makes sense to learn the remote
> as well))
> >
>
> I am working on this as well, but delayed a little. I hope to have
> something this week.
>
> >
> > For now, I think it is fine for ARP packets to be broadcasted on the
> tunnel for the `join` switch case. If it becomes a problem, then we can
> start looking around changing the logical flows.
> >
> > Thanks everyone for the lively discussion.
> >
> > Regards,
> > ~Girish
> >
> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
> >>
> >>
> >>
> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
> wrote:
> >>>
> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> >>> > Hi all
> >>> >
> >>> > Sorry for top posting. I want to thank you all for the discussion and
> >>> > give also some feedback from OpenStack perspective which is affected
> >>> > by the problem described here.
> >>> >
> >>> > In OpenStack, it's kind of common to have a shared external network
> >>> > (logical switch with a localnet port) across many tenants. Each
> tenant
> >>> > user may create their own router where their instances will be
> >>> > connected to access the external network.
> >>> >
> >>> > In such scenario, we are hitting the issue described here. In
> >>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each
> spanning
> >>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> >>> > connected to the public LS. This is creating a huge problem in terms
> >>> > of performance and tons of events due to the MAC_Binding entries
> >>> > generated as a consequence of the GARPs sent for the floating IPs.
> >>> >
> >>>
> >>> Just as an addition to this, GARPs wouldn't be the only reason why all
> >>> routers would learn the MAC_Binding. Even if we wouldn't be sending
> >>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
> >>> the outside, the router will generate an ARP request for the next hop
> >>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
> >>> connected to the public LS and will trigger them to learn the
> >>> FIP-IP:FIP-MAC binding.
> >>
> >>
> >> Yeah we shouldn't be learning on regular ARP requests.
> >>
> >>>
> >>>
> >>> > Thanks,
> >>> > Daniel
> >>> >
> >>> >
> >>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara 
> wrote:
> >>> >>
> >>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
> >>> >>>
> >>> >>>
> >>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara  >>> >>> > wrote:
> >>> 
> >>>  Hi Girish, Han,
> >>> 
> >>>  On 5/26/20 11:51 PM, Han Zhou wrote:
> >>> >
> >>> >
> >>> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
> >>> >>> mailto:gmoodalb...@gmail.com>
> >>> > >>
> wrote:
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou  >>> >>> 
> >>> > >> wrote:
> >>> >>>
> >>> >>> Hi Girish,
> >>> >>>
> >>> >>> Thanks for the summary. I agree with you that GARP request
> v.s. reply
> >>> > is irrelavent to the problem here.
> >>> 
> >>>  Well, actually I think GARP request vs reply is relevant (at
> least for
> >>>  case 1 below) because if OVN would be generating GARP replies we
> >>

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-09 Thread Venugopal Iyer
Sorry for the delay, Han, a quick question below:

From: ovn-kuberne...@googlegroups.com  On 
Behalf Of Han Zhou
Sent: Wednesday, June 3, 2020 4:27 PM
To: Girish Moodalbail 
Cc: Tim Rozet ; Dumitru Ceara ; Daniel 
Alvarez Sanchez ; Dan Winship ; 
ovn-kuberne...@googlegroups.com; ovs-discuss ; 
Michael Cambria ; Venugopal Iyer 
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry that I 
forgot to update here.

On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
mailto:gmoodalb...@gmail.com>> wrote:
>
> Hello all,
>
> To kind of proceed with the proposed fixes, with minimal impact, is the 
> following a reasonable approach?
>
> Add an option, namely dynamic_neigh_routes={true|false}, for a gateway 
> router. With this option enabled, the nextHop IP's MAC will be learned 
> through a ARP request on the physical network. The ARP request will be 
> flooded on the L2 broadcast domain (for both join switch and external switch).
>

The RFC patch fulfils this purpose: 
https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
I am working on the formal patch.

> Add an option, namely learn_from_arp_request={true|false}, for a gateway 
> router. The option is interpreted as below:\
> "true" - learn the MAC/IP binding and add a new MAC_Binding entry (default 
> behavior)
> "false" - if there is a MAC_binding for that IP and the MAC is different, 
> then update that MAC/IP binding. The external entity might be trying to 
> advertise the new MAC for that IP. (If we don't do this, then we will never 
> learn External VIP to MAC changes)
>
> (Irrespective of, learn_from_arp_request is true or false, always do this -- 
> if the TPA is on the router, add a new entry (it means the remote wants to 
> communicate with this node, so it makes sense to learn the remote as well))
>

I am working on this as well, but delayed a little. I hope to have something 
this week.
[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp 
(unsolicited ARP request or reply) instead of learn_from_arp_request? This is 
just to protect from potential rogue usage of  GARP reply flooding the MAC 
bindings.?

Thanks,

-venu

>
> For now, I think it is fine for ARP packets to be broadcasted on the tunnel 
> for the `join` switch case. If it becomes a problem, then we can start 
> looking around changing the logical flows.
>
> Thanks everyone for the lively discussion.
>
> Regards,
> ~Girish
>
> On Thu, May 28, 2020 at 7:33 AM Tim Rozet 
> mailto:tro...@redhat.com>> wrote:
>>
>>
>>
>> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
>> mailto:dce...@redhat.com>> wrote:
>>>
>>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>>> > Hi all
>>> >
>>> > Sorry for top posting. I want to thank you all for the discussion and
>>> > give also some feedback from OpenStack perspective which is affected
>>> > by the problem described here.
>>> >
>>> > In OpenStack, it's kind of common to have a shared external network
>>> > (logical switch with a localnet port) across many tenants. Each tenant
>>> > user may create their own router where their instances will be
>>> > connected to access the external network.
>>> >
>>> > In such scenario, we are hitting the issue described here. In
>>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
>>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
>>> > connected to the public LS. This is creating a huge problem in terms
>>> > of performance and tons of events due to the MAC_Binding entries
>>> > generated as a consequence of the GARPs sent for the floating IPs.
>>> >
>>>
>>> Just as an addition to this, GARPs wouldn't be the only reason why all
>>> routers would learn the MAC_Binding. Even if we wouldn't be sending
>>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
>>> the outside, the router will generate an ARP request for the next hop
>>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
>>> connected to the public LS and will trigger them to learn the
>>> FIP-IP:FIP-MAC binding.
>>
>>
>> Yeah we shouldn't be learning on regular ARP requests.
>>
>>>
>>>
>>> > Thanks,
>>> > Daniel
>>> >
>>> >
>>> > O

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-09 Thread Han Zhou
On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer  wrote:

> Sorry for the delay, Han, a quick question below:
>
>
>
> *From:* ovn-kuberne...@googlegroups.com  *On
> Behalf Of *Han Zhou
> *Sent:* Wednesday, June 3, 2020 4:27 PM
> *To:* Girish Moodalbail 
> *Cc:* Tim Rozet ; Dumitru Ceara ;
> Daniel Alvarez Sanchez ; Dan Winship <
> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
> ovs-discuss@openvswitch.org>; Michael Cambria ;
> Venugopal Iyer 
> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
> table
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
> that I forgot to update here.
>
>
> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
> wrote:
> >
> > Hello all,
> >
> > To kind of proceed with the proposed fixes, with minimal impact, is the
> following a reasonable approach?
> >
> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
> router. With this option enabled, the nextHop IP's MAC will be learned
> through a ARP request on the physical network. The ARP request will be
> flooded on the L2 broadcast domain (for both join switch and external
> switch).
>
> >
>
>
>
> The RFC patch fulfils this purpose:
> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>
> I am working on the formal patch.
>
>
>
> > Add an option, namely learn_from_arp_request={true|false}, for a gateway
> router. The option is interpreted as below:\
> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
> (default behavior)
> > "false" - if there is a MAC_binding for that IP and the MAC is
> different, then update that MAC/IP binding. The external entity might be
> trying to advertise the new MAC for that IP. (If we don't do this, then we
> will never learn External VIP to MAC changes)
> >
> > (Irrespective of, learn_from_arp_request is true or false, always do
> this -- if the TPA is on the router, add a new entry (it means the remote
> wants to communicate with this node, so it makes sense to learn the remote
> as well))
>
> >
>
>
>
> I am working on this as well, but delayed a little. I hope to have
> something this week.
>
> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
> is just to protect from potential rogue usage of  GARP reply flooding the
> MAC bindings.?*
>
>
>

Hi Venu, as discussed earlier in this thread it is hard to check if it is
GARP in OVN from the router ingress pipeline. The proposal here cares about
ARP request only. It seems the best option so far.


> *Thanks,*
>
>
>
> *-venu*
>
>
>
> >
> > For now, I think it is fine for ARP packets to be broadcasted on the
> tunnel for the `join` switch case. If it becomes a problem, then we can
> start looking around changing the logical flows.
> >
> > Thanks everyone for the lively discussion.
> >
> > Regards,
> > ~Girish
> >
> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
> >>
> >>
> >>
> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
> wrote:
> >>>
> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> >>> > Hi all
> >>> >
> >>> > Sorry for top posting. I want to thank you all for the discussion and
> >>> > give also some feedback from OpenStack perspective which is affected
> >>> > by the problem described here.
> >>> >
> >>> > In OpenStack, it's kind of common to have a shared external network
> >>> > (logical switch with a localnet port) across many tenants. Each
> tenant
> >>> > user may create their own router where their instances will be
> >>> > connected to access the external network.
> >>> >
> >>> > In such scenario, we are hitting the issue described here. In
> >>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each
> spanning
> >>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> >>> > connected to the public LS. This is creating a huge problem in terms
> >>> > of performance and tons of events due to the MAC_Binding entries
> >>> > generated as a consequence of the GARPs sent for the floating IPs.
> >>> >
> >>>
> >>> Just a

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-10 Thread Han Zhou
Hi Girish, Venu,

I sent a RFC patch series for the solution discussed. Could you give it a
try when you get the chance?

Thanks,
Han

On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:

>
>
> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
> wrote:
>
>> Sorry for the delay, Han, a quick question below:
>>
>>
>>
>> *From:* ovn-kuberne...@googlegroups.com 
>> *On Behalf Of *Han Zhou
>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>> *To:* Girish Moodalbail 
>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>> Daniel Alvarez Sanchez ; Dan Winship <
>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>> Venugopal Iyer 
>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>> table
>>
>>
>>
>> *External email: Use caution opening links or attachments*
>>
>>
>>
>> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
>> that I forgot to update here.
>>
>>
>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
>> wrote:
>> >
>> > Hello all,
>> >
>> > To kind of proceed with the proposed fixes, with minimal impact, is the
>> following a reasonable approach?
>> >
>> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
>> router. With this option enabled, the nextHop IP's MAC will be learned
>> through a ARP request on the physical network. The ARP request will be
>> flooded on the L2 broadcast domain (for both join switch and external
>> switch).
>>
>> >
>>
>>
>>
>> The RFC patch fulfils this purpose:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>
>> I am working on the formal patch.
>>
>>
>>
>> > Add an option, namely learn_from_arp_request={true|false}, for a
>> gateway router. The option is interpreted as below:\
>> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
>> (default behavior)
>> > "false" - if there is a MAC_binding for that IP and the MAC is
>> different, then update that MAC/IP binding. The external entity might be
>> trying to advertise the new MAC for that IP. (If we don't do this, then we
>> will never learn External VIP to MAC changes)
>> >
>> > (Irrespective of, learn_from_arp_request is true or false, always do
>> this -- if the TPA is on the router, add a new entry (it means the remote
>> wants to communicate with this node, so it makes sense to learn the remote
>> as well))
>>
>> >
>>
>>
>>
>> I am working on this as well, but delayed a little. I hope to have
>> something this week.
>>
>> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
>> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
>> is just to protect from potential rogue usage of  GARP reply flooding the
>> MAC bindings.?*
>>
>>
>>
>
> Hi Venu, as discussed earlier in this thread it is hard to check if it is
> GARP in OVN from the router ingress pipeline. The proposal here cares about
> ARP request only. It seems the best option so far.
>
>
>> *Thanks,*
>>
>>
>>
>> *-venu*
>>
>>
>>
>> >
>> > For now, I think it is fine for ARP packets to be broadcasted on the
>> tunnel for the `join` switch case. If it becomes a problem, then we can
>> start looking around changing the logical flows.
>> >
>> > Thanks everyone for the lively discussion.
>> >
>> > Regards,
>> > ~Girish
>> >
>> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
>> >>
>> >>
>> >>
>> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
>> wrote:
>> >>>
>> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>> >>> > Hi all
>> >>> >
>> >>> > Sorry for top posting. I want to thank you all for the discussion
>> and
>> >>> > give also some feedback from OpenStack perspective which is affected
>> >>> > by the problem described here.
>> >>> >
>> >>> > In OpenStack, it's kind of common to have a shared external network
>> >>> > (logical switch with a localnet port) across many tenants. Each
>> tenant
>> >>> > user may create their own router where their instan

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-10 Thread Han Zhou
On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:

> Hi Girish, Venu,
>
> I sent a RFC patch series for the solution discussed. Could you give it a
> try when you get the chance?
>

Oops, I forgot the link:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602

>
> Thanks,
> Han
>
> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>
>>
>>
>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>> wrote:
>>
>>> Sorry for the delay, Han, a quick question below:
>>>
>>>
>>>
>>> *From:* ovn-kuberne...@googlegroups.com 
>>> *On Behalf Of *Han Zhou
>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>> *To:* Girish Moodalbail 
>>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>>> Daniel Alvarez Sanchez ; Dan Winship <
>>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>>> Venugopal Iyer 
>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>>> table
>>>
>>>
>>>
>>> *External email: Use caution opening links or attachments*
>>>
>>>
>>>
>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
>>> that I forgot to update here.
>>>
>>>
>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
>>> wrote:
>>> >
>>> > Hello all,
>>> >
>>> > To kind of proceed with the proposed fixes, with minimal impact, is
>>> the following a reasonable approach?
>>> >
>>> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
>>> router. With this option enabled, the nextHop IP's MAC will be learned
>>> through a ARP request on the physical network. The ARP request will be
>>> flooded on the L2 broadcast domain (for both join switch and external
>>> switch).
>>>
>>> >
>>>
>>>
>>>
>>> The RFC patch fulfils this purpose:
>>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>>
>>> I am working on the formal patch.
>>>
>>>
>>>
>>> > Add an option, namely learn_from_arp_request={true|false}, for a
>>> gateway router. The option is interpreted as below:\
>>> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
>>> (default behavior)
>>> > "false" - if there is a MAC_binding for that IP and the MAC is
>>> different, then update that MAC/IP binding. The external entity might be
>>> trying to advertise the new MAC for that IP. (If we don't do this, then we
>>> will never learn External VIP to MAC changes)
>>> >
>>> > (Irrespective of, learn_from_arp_request is true or false, always do
>>> this -- if the TPA is on the router, add a new entry (it means the remote
>>> wants to communicate with this node, so it makes sense to learn the remote
>>> as well))
>>>
>>> >
>>>
>>>
>>>
>>> I am working on this as well, but delayed a little. I hope to have
>>> something this week.
>>>
>>> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
>>> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
>>> is just to protect from potential rogue usage of  GARP reply flooding the
>>> MAC bindings.?*
>>>
>>>
>>>
>>
>> Hi Venu, as discussed earlier in this thread it is hard to check if it is
>> GARP in OVN from the router ingress pipeline. The proposal here cares about
>> ARP request only. It seems the best option so far.
>>
>>
>>> *Thanks,*
>>>
>>>
>>>
>>> *-venu*
>>>
>>>
>>>
>>> >
>>> > For now, I think it is fine for ARP packets to be broadcasted on the
>>> tunnel for the `join` switch case. If it becomes a problem, then we can
>>> start looking around changing the logical flows.
>>> >
>>> > Thanks everyone for the lively discussion.
>>> >
>>> > Regards,
>>> > ~Girish
>>> >
>>> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet  wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara 
>>> wrote:
>>> >>>
>>> >>> On 5/28/20 12:48 PM, Da

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-06-10 Thread Girish Moodalbail
Thanks Han. We will give this a try on our cluster and get back to you soon.

Regards,
~Girish

On Wed, Jun 10, 2020 at 12:04 PM Han Zhou  wrote:

>
>
> On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:
>
>> Hi Girish, Venu,
>>
>> I sent a RFC patch series for the solution discussed. Could you give it a
>> try when you get the chance?
>>
>
> Oops, I forgot the link:
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602
>
>>
>> Thanks,
>> Han
>>
>> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>>
>>>
>>>
>>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>>> wrote:
>>>
>>>> Sorry for the delay, Han, a quick question below:
>>>>
>>>>
>>>>
>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>>> *To:* Girish Moodalbail 
>>>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>>>> Daniel Alvarez Sanchez ; Dan Winship <
>>>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>>>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>>>> Venugopal Iyer 
>>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>>>> table
>>>>
>>>>
>>>>
>>>> *External email: Use caution opening links or attachments*
>>>>
>>>>
>>>>
>>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
>>>> that I forgot to update here.
>>>>
>>>>
>>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
>>>> wrote:
>>>> >
>>>> > Hello all,
>>>> >
>>>> > To kind of proceed with the proposed fixes, with minimal impact, is
>>>> the following a reasonable approach?
>>>> >
>>>> > Add an option, namely dynamic_neigh_routes={true|false}, for a
>>>> gateway router. With this option enabled, the nextHop IP's MAC will be
>>>> learned through a ARP request on the physical network. The ARP request will
>>>> be flooded on the L2 broadcast domain (for both join switch and external
>>>> switch).
>>>>
>>>> >
>>>>
>>>>
>>>>
>>>> The RFC patch fulfils this purpose:
>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>>>
>>>> I am working on the formal patch.
>>>>
>>>>
>>>>
>>>> > Add an option, namely learn_from_arp_request={true|false}, for a
>>>> gateway router. The option is interpreted as below:\
>>>> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
>>>> (default behavior)
>>>> > "false" - if there is a MAC_binding for that IP and the MAC is
>>>> different, then update that MAC/IP binding. The external entity might be
>>>> trying to advertise the new MAC for that IP. (If we don't do this, then we
>>>> will never learn External VIP to MAC changes)
>>>> >
>>>> > (Irrespective of, learn_from_arp_request is true or false, always do
>>>> this -- if the TPA is on the router, add a new entry (it means the remote
>>>> wants to communicate with this node, so it makes sense to learn the remote
>>>> as well))
>>>>
>>>> >
>>>>
>>>>
>>>>
>>>> I am working on this as well, but delayed a little. I hope to have
>>>> something this week.
>>>>
>>>> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
>>>> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
>>>> is just to protect from potential rogue usage of  GARP reply flooding the
>>>> MAC bindings.?*
>>>>
>>>>
>>>>
>>>
>>> Hi Venu, as discussed earlier in this thread it is hard to check if it
>>> is GARP in OVN from the router ingress pipeline. The proposal here cares
>>> about ARP request only. It seems the best option so far.
>>>
>>>
>>>> *Thanks,*
>>>>
>>>>
>>>>
>>>> *-venu*
>>>>
>>>>
>>>>
>>>> >
>>>> > For now, I thin

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-07-13 Thread Girish Moodalbail
Hello Han,

On the #openvswitch IRC channel I had provided an update on your patch
working great on our test setup. That update was for the L3 Gateway Router
option called* learn_from_arp_request="true|false".* With that option in
place, the number of entries in the MAC binding table has significantly
reduced.

However, I had not provided an update on the single join switch tests.
Sincere apologies for the delay. We just got that code to work last week,
and we have an update. This is for the option called
*dynamic_neigh_routers="true|false"* on the L3 Gateway Router. It works as
expected.  With that option in place, for all of the L3 Gateway Routers I
see just 3 entries as expected:

  table=12(lr_in_arp_resolve  ), priority=500  , match=(ip4.mcast ||
ip6.mcast), action=(next;)
  table=12(lr_in_arp_resolve  ), priority=0, match=(ip4),
action=(get_arp(outport, reg0); next;)
  table=12(lr_in_arp_resolve  ), priority=0, match=(ip6),
action=(get_nd(outport, xxreg0); next;)

Before, on a 1000 node cluster with 1000 Gateway Routers we would see 1000
entries per Gateway Router and therefore a total of 1M entries in the
cluster. Now, that is not the case.

Thank you!

Regards,
~Girish


On Wed, Jun 10, 2020 at 12:04 PM Han Zhou  wrote:

>
>
> On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:
>
>> Hi Girish, Venu,
>>
>> I sent a RFC patch series for the solution discussed. Could you give it a
>> try when you get the chance?
>>
>
> Oops, I forgot the link:
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602
>
>>
>> Thanks,
>> Han
>>
>> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>>
>>>
>>>
>>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>>> wrote:
>>>
>>>> Sorry for the delay, Han, a quick question below:
>>>>
>>>>
>>>>
>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>>> *To:* Girish Moodalbail 
>>>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>>>> Daniel Alvarez Sanchez ; Dan Winship <
>>>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>>>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>>>> Venugopal Iyer 
>>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
>>>> table
>>>>
>>>>
>>>>
>>>> *External email: Use caution opening links or attachments*
>>>>
>>>>
>>>>
>>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
>>>> that I forgot to update here.
>>>>
>>>>
>>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail 
>>>> wrote:
>>>> >
>>>> > Hello all,
>>>> >
>>>> > To kind of proceed with the proposed fixes, with minimal impact, is
>>>> the following a reasonable approach?
>>>> >
>>>> > Add an option, namely dynamic_neigh_routes={true|false}, for a
>>>> gateway router. With this option enabled, the nextHop IP's MAC will be
>>>> learned through a ARP request on the physical network. The ARP request will
>>>> be flooded on the L2 broadcast domain (for both join switch and external
>>>> switch).
>>>>
>>>> >
>>>>
>>>>
>>>>
>>>> The RFC patch fulfils this purpose:
>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>>>
>>>> I am working on the formal patch.
>>>>
>>>>
>>>>
>>>> > Add an option, namely learn_from_arp_request={true|false}, for a
>>>> gateway router. The option is interpreted as below:\
>>>> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
>>>> (default behavior)
>>>> > "false" - if there is a MAC_binding for that IP and the MAC is
>>>> different, then update that MAC/IP binding. The external entity might be
>>>> trying to advertise the new MAC for that IP. (If we don't do this, then we
>>>> will never learn External VIP to MAC changes)
>>>> >
>>>> > (Irrespective of, learn_from_arp_request is true or false, always do
>>>> this -- if the TPA is on the router, add a new entry (it means the remote
>>>> wants to communicate with this node, so it m

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-07-14 Thread Tim Rozet
Thanks for the update Girish. Are you planning on submitting an
ovn-k8s patch to enable these?

Tim Rozet
Red Hat CTO Networking Team


On Mon, Jul 13, 2020 at 9:37 PM Girish Moodalbail 
wrote:

> Hello Han,
>
> On the #openvswitch IRC channel I had provided an update on your patch
> working great on our test setup. That update was for the L3 Gateway Router
> option called* learn_from_arp_request="true|false".* With that option in
> place, the number of entries in the MAC binding table has significantly
> reduced.
>
> However, I had not provided an update on the single join switch tests.
> Sincere apologies for the delay. We just got that code to work last week,
> and we have an update. This is for the option called
> *dynamic_neigh_routers="true|false"* on the L3 Gateway Router. It works
> as expected.  With that option in place, for all of the L3 Gateway Routers
> I see just 3 entries as expected:
>
>   table=12(lr_in_arp_resolve  ), priority=500  , match=(ip4.mcast ||
> ip6.mcast), action=(next;)
>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip4),
> action=(get_arp(outport, reg0); next;)
>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip6),
> action=(get_nd(outport, xxreg0); next;)
>
> Before, on a 1000 node cluster with 1000 Gateway Routers we would see 1000
> entries per Gateway Router and therefore a total of 1M entries in the
> cluster. Now, that is not the case.
>
> Thank you!
>
> Regards,
> ~Girish
>
>
> On Wed, Jun 10, 2020 at 12:04 PM Han Zhou  wrote:
>
>>
>>
>> On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:
>>
>>> Hi Girish, Venu,
>>>
>>> I sent a RFC patch series for the solution discussed. Could you give it
>>> a try when you get the chance?
>>>
>>
>> Oops, I forgot the link:
>> https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602
>>
>>>
>>> Thanks,
>>> Han
>>>
>>> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>>>> wrote:
>>>>
>>>>> Sorry for the delay, Han, a quick question below:
>>>>>
>>>>>
>>>>>
>>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>>>> *To:* Girish Moodalbail 
>>>>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>>>>> Daniel Alvarez Sanchez ; Dan Winship <
>>>>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss <
>>>>> ovs-discuss@openvswitch.org>; Michael Cambria ;
>>>>> Venugopal Iyer 
>>>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in
>>>>> lr_in_arp_resolve table
>>>>>
>>>>>
>>>>>
>>>>> *External email: Use caution opening links or attachments*
>>>>>
>>>>>
>>>>>
>>>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but
>>>>> sorry that I forgot to update here.
>>>>>
>>>>>
>>>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <
>>>>> gmoodalb...@gmail.com> wrote:
>>>>> >
>>>>> > Hello all,
>>>>> >
>>>>> > To kind of proceed with the proposed fixes, with minimal impact, is
>>>>> the following a reasonable approach?
>>>>> >
>>>>> > Add an option, namely dynamic_neigh_routes={true|false}, for a
>>>>> gateway router. With this option enabled, the nextHop IP's MAC will be
>>>>> learned through a ARP request on the physical network. The ARP request 
>>>>> will
>>>>> be flooded on the L2 broadcast domain (for both join switch and external
>>>>> switch).
>>>>>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> The RFC patch fulfils this purpose:
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hz...@ovn.org/
>>>>>
>>>>> I am working on the formal patch.
>>>>>
>>>>>
>>>>>
>>>>> > Add an option, namely learn_from_arp_request={true|false}, for a
>>>>> gateway router. The option is interpreted as below:\
>>>>> > "t

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-07-14 Thread Girish Moodalbail
Yes, we are going to submit the patch to enable those options on L3 Gateway
Routers to ovn-k8s repo.  I am going to wait until these changes make it to
OVN repo and then submit since I don't know if these options will be
renamed and such.

Regards,
~Girish

On Tue, Jul 14, 2020 at 7:33 AM Tim Rozet  wrote:

> Thanks for the update Girish. Are you planning on submitting an
> ovn-k8s patch to enable these?
>
> Tim Rozet
> Red Hat CTO Networking Team
>
>
> On Mon, Jul 13, 2020 at 9:37 PM Girish Moodalbail 
> wrote:
>
>> Hello Han,
>>
>> On the #openvswitch IRC channel I had provided an update on your patch
>> working great on our test setup. That update was for the L3 Gateway Router
>> option called* learn_from_arp_request="true|false".* With that option in
>> place, the number of entries in the MAC binding table has significantly
>> reduced.
>>
>> However, I had not provided an update on the single join switch tests.
>> Sincere apologies for the delay. We just got that code to work last week,
>> and we have an update. This is for the option called
>> *dynamic_neigh_routers="true|false"* on the L3 Gateway Router. It works
>> as expected.  With that option in place, for all of the L3 Gateway Routers
>> I see just 3 entries as expected:
>>
>>   table=12(lr_in_arp_resolve  ), priority=500  , match=(ip4.mcast ||
>> ip6.mcast), action=(next;)
>>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip4),
>> action=(get_arp(outport, reg0); next;)
>>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip6),
>> action=(get_nd(outport, xxreg0); next;)
>>
>> Before, on a 1000 node cluster with 1000 Gateway Routers we would see
>> 1000 entries per Gateway Router and therefore a total of 1M entries in the
>> cluster. Now, that is not the case.
>>
>> Thank you!
>>
>> Regards,
>> ~Girish
>>
>>
>> On Wed, Jun 10, 2020 at 12:04 PM Han Zhou  wrote:
>>
>>>
>>>
>>> On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:
>>>
>>>> Hi Girish, Venu,
>>>>
>>>> I sent a RFC patch series for the solution discussed. Could you give it
>>>> a try when you get the chance?
>>>>
>>>
>>> Oops, I forgot the link:
>>> https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602
>>>
>>>>
>>>> Thanks,
>>>> Han
>>>>
>>>> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>>>>> wrote:
>>>>>
>>>>>> Sorry for the delay, Han, a quick question below:
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>>>>> *To:* Girish Moodalbail 
>>>>>> *Cc:* Tim Rozet ; Dumitru Ceara ;
>>>>>> Daniel Alvarez Sanchez ; Dan Winship <
>>>>>> danwins...@redhat.com>; ovn-kuberne...@googlegroups.com; ovs-discuss
>>>>>> ; Michael Cambria ;
>>>>>> Venugopal Iyer 
>>>>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in
>>>>>> lr_in_arp_resolve table
>>>>>>
>>>>>>
>>>>>>
>>>>>> *External email: Use caution opening links or attachments*
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but
>>>>>> sorry that I forgot to update here.
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <
>>>>>> gmoodalb...@gmail.com> wrote:
>>>>>> >
>>>>>> > Hello all,
>>>>>> >
>>>>>> > To kind of proceed with the proposed fixes, with minimal impact, is
>>>>>> the following a reasonable approach?
>>>>>> >
>>>>>> > Add an option, namely dynamic_neigh_routes={true|false}, for a
>>>>>> gateway router. With this option enabled, the nextHop IP's MAC will be
>>>>>> learned through a ARP request on the physical network. The ARP request 
>>>>>> will
>>>>>> be fl

Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

2020-07-14 Thread Han Zhou
Thanks Girish for the update! I will submit formal patches for ovn.

Regards,
Han

On Tue, Jul 14, 2020 at 8:32 AM Girish Moodalbail 
wrote:

> Yes, we are going to submit the patch to enable those options on L3
> Gateway Routers to ovn-k8s repo.  I am going to wait until these changes
> make it to OVN repo and then submit since I don't know if these options
> will be renamed and such.
>
> Regards,
> ~Girish
>
> On Tue, Jul 14, 2020 at 7:33 AM Tim Rozet  wrote:
>
>> Thanks for the update Girish. Are you planning on submitting an
>> ovn-k8s patch to enable these?
>>
>> Tim Rozet
>> Red Hat CTO Networking Team
>>
>>
>> On Mon, Jul 13, 2020 at 9:37 PM Girish Moodalbail 
>> wrote:
>>
>>> Hello Han,
>>>
>>> On the #openvswitch IRC channel I had provided an update on your patch
>>> working great on our test setup. That update was for the L3 Gateway Router
>>> option called* learn_from_arp_request="true|false".* With that option
>>> in place, the number of entries in the MAC binding table has significantly
>>> reduced.
>>>
>>> However, I had not provided an update on the single join switch tests.
>>> Sincere apologies for the delay. We just got that code to work last week,
>>> and we have an update. This is for the option called
>>> *dynamic_neigh_routers="true|false"* on the L3 Gateway Router. It works
>>> as expected.  With that option in place, for all of the L3 Gateway Routers
>>> I see just 3 entries as expected:
>>>
>>>   table=12(lr_in_arp_resolve  ), priority=500  , match=(ip4.mcast ||
>>> ip6.mcast), action=(next;)
>>>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip4),
>>> action=(get_arp(outport, reg0); next;)
>>>   table=12(lr_in_arp_resolve  ), priority=0, match=(ip6),
>>> action=(get_nd(outport, xxreg0); next;)
>>>
>>> Before, on a 1000 node cluster with 1000 Gateway Routers we would see
>>> 1000 entries per Gateway Router and therefore a total of 1M entries in the
>>> cluster. Now, that is not the case.
>>>
>>> Thank you!
>>>
>>> Regards,
>>> ~Girish
>>>
>>>
>>> On Wed, Jun 10, 2020 at 12:04 PM Han Zhou  wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 10, 2020 at 12:03 PM Han Zhou  wrote:
>>>>
>>>>> Hi Girish, Venu,
>>>>>
>>>>> I sent a RFC patch series for the solution discussed. Could you give
>>>>> it a try when you get the chance?
>>>>>
>>>>
>>>> Oops, I forgot the link:
>>>> https://patchwork.ozlabs.org/project/openvswitch/list/?series=182602
>>>>
>>>>>
>>>>> Thanks,
>>>>> Han
>>>>>
>>>>> On Tue, Jun 9, 2020 at 10:04 AM Han Zhou  wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer 
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry for the delay, Han, a quick question below:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* ovn-kuberne...@googlegroups.com <
>>>>>>> ovn-kuberne...@googlegroups.com> *On Behalf Of *Han Zhou
>>>>>>> *Sent:* Wednesday, June 3, 2020 4:27 PM
>>>>>>> *To:* Girish Moodalbail 
>>>>>>> *Cc:* Tim Rozet ; Dumitru Ceara <
>>>>>>> dce...@redhat.com>; Daniel Alvarez Sanchez ;
>>>>>>> Dan Winship ; ovn-kuberne...@googlegroups.com;
>>>>>>> ovs-discuss ; Michael Cambria <
>>>>>>> mcamb...@redhat.com>; Venugopal Iyer 
>>>>>>> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in
>>>>>>> lr_in_arp_resolve table
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *External email: Use caution opening links or attachments*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Girish, yes, that's what we concluded in last OVN meeting, but
>>>>>>> sorry that I forgot to update here.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <
>>>>>>> gmoodalb...@gmail.com> wrote:
>>>>>>> >
>>>>>>&g