Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-29 Thread Felix Huettner via discuss
Hi Robin and everyone,

now onto mail number two :)
I will try to just add some Pros and Cons for the Ideas and maybe some
comments below that.

On Thu, Sep 28, 2023 at 06:28:46PM +0200, Robin Jarry via discuss wrote:
> Hello OVN community,
>
> This is a follow up on the message I have sent today [1]. That second
> part focuses on some ideas I have to remove the limitations that were
> mentioned in the previous email.
>
> [1] 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html
>
> If you didn't read it, my goal is to start a discussion about how we
> could improve OVN on the following topics:
>
> - Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
> - Support scaling of L2 connectivity across larger clusters.
> - Simplify CMS interoperability.
> - Allow support for alternative datapath implementations.
>
> Disclaimer:
>
> This message does not mention anything about L3/L4 features of OVN.
> I didn't have time to work on these, yet. I hope we can discuss how
> these fit with my ideas.

I tried to add some L3 implications below as well. But i am by no means
an expert in these and just added my current understanding there.

>
> Distributed mac learning
> 
>
> Use one OVS bridge per logical switch with mac learning enabled. Only
> create the bridge if the logical switch has a port bound to the local
> chassis.
>
> Pros:
>
> - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> - No central mac binding table required.
> - Mac table aging comes for free.
> - Zero access to southbound DB for learned addresses nor for aging.
- Uses the native switching feature of ovs, instead of reimplementing it
  in ovn

>
> Cons:
>
> - How to manage seamless upgrades?
> - Requires ovn-controller to move/plug ports in the correct bridge.
> - Multiple openflow connections (one per managed bridge).
> - Requires ovn-trace to be reimplemented differently (maybe other tools
>   as well).
- No central information anymore on mac bindings. All nodes need to
  update their data individually
- Each bridge generates also a linux network interface. I do not know if
  there is some kind of limit to the linux interfaces or the ovs bridges
  somewhere.

Would you still preprovision static mac addresses on the bridge for all
port_bindings we know the mac address from, or would you rather leave
that up for learning as well?

I do not know if there is some kind of performance/optimization penality
for moving packets between different bridges.

You can also not only use the logical switch that have a local port
bound. Assume the following topology:
+---+ +---+ +---+ +---+ +---+ +---+ +---+
|vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
+---+ +---+ +---+ +---+ +---+ +---+ +---+
vm1 and vm2 are both running on the same hypervisor. Creating only local
logical switches would mean only ls1 and ls3 are available on that
hypervisor. This would break the connection between the two vms which
would in the current implementation just traverse the two logical
routers.
I guess we would need to create bridges for each locally reachable
logical switch. I am concerned about the potentially significant
increase in bridges and openflow connections this brings.

>
> Use multicast for overlay networks
> ==
>
> Use a unique 24bit VNI per overlay network. Derive a multicast group
> address from that VNI. Use VXLAN address learning [2] to remove the need
> for ovn-controller to know the destination chassis for every mac address
> in advance.
>
> [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
>
> Pros:
>
> - Nodes do not need to know about others in advance. The control plane
>   load is distributed across the cluster.
> - 24bit VNI allows for more than 16 million logical switches. No need
>   for extended GENEVE tunnel options.
Note that using vxlan at the moment significantly reduces the ovn
featureset. This is because the geneve header options are currently used
for data that would not fit into the vxlan vni.

From ovn-architecture.7.xml:
```
The maximum number of networks is reduced to 4096.
The maximum number of ports per network is reduced to 2048.
ACLs matching against logical ingress port identifiers are not supported.
OVN interconnection feature is not supported.
```
> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
>   top-of-rack switches. Multicast is only used for BUM traffic.
> - Only one VXLAN output port per implemented logical switch on a given
>   chassis.
Would this actually work with one VXLAN output port? Would you not need
one port per target node to send unicast traffic (as you otherwise flood
all packets to all participating nodes)?

>
> Cons:
>
> - OVS does not support VXLAN address learning yet.
> - The number of usable multicast groups in a fabric network may be
>   limited?
> - How to manage seamless upgrades and interoperability with older OVN
>   versions?
- This pushes all logic related to cha

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-29 Thread Robin Jarry via discuss
Felix Huettner, Sep 29, 2023 at 15:23:
> > Distributed mac learning
> > 
[snip]
> >
> > Cons:
> >
> > - How to manage seamless upgrades?
> > - Requires ovn-controller to move/plug ports in the correct bridge.
> > - Multiple openflow connections (one per managed bridge).
> > - Requires ovn-trace to be reimplemented differently (maybe other tools
> >   as well).
>
> - No central information anymore on mac bindings. All nodes need to
>   update their data individually
> - Each bridge generates also a linux network interface. I do not know if
>   there is some kind of limit to the linux interfaces or the ovs bridges
>   somewhere.

That's a good point. However, only the bridges related to one
implemented logical network would need to be created on a single
chassis. Even with the largest OVN deployments, I doubt this would be
a limitation.

> Would you still preprovision static mac addresses on the bridge for all
> port_bindings we know the mac address from, or would you rather leave
> that up for learning as well?

I would leave everything dynamic.

> I do not know if there is some kind of performance/optimization penality
> for moving packets between different bridges.

As far as I know, once the openflow pipeline has been resolved into
a datapath flow, there is no penalty.

> You can also not only use the logical switch that have a local port
> bound. Assume the following topology:
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> vm1 and vm2 are both running on the same hypervisor. Creating only local
> logical switches would mean only ls1 and ls3 are available on that
> hypervisor. This would break the connection between the two vms which
> would in the current implementation just traverse the two logical
> routers.
> I guess we would need to create bridges for each locally reachable
> logical switch. I am concerned about the potentially significant
> increase in bridges and openflow connections this brings.

That is one of the concerns I raised in the last point. In my opinion
this is a trade off. You remove centralization and require more local
processing. But overall, the processing cost should remain equivalent.

> > Use multicast for overlay networks
> > ==
[snip]
> > - 24bit VNI allows for more than 16 million logical switches. No need
> >   for extended GENEVE tunnel options.
> Note that using vxlan at the moment significantly reduces the ovn
> featureset. This is because the geneve header options are currently used
> for data that would not fit into the vxlan vni.
>
> From ovn-architecture.7.xml:
> ```
> The maximum number of networks is reduced to 4096.
> The maximum number of ports per network is reduced to 2048.
> ACLs matching against logical ingress port identifiers are not supported.
> OVN interconnection feature is not supported.
> ```

In my understanding, the main reason why GENEVE replaced VXLAN is
because Openstack uses full mesh point to point tunnels and that the
sender needs to know behind which chassis any mac address is to send it
into the correct tunnel. GENEVE allowed to reduce the lookup time both
on the sender and receiver thanks to ingress/egress port metadata.

https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/

If VXLAN + multicast and address learning was used, the "correct" tunnel
would be established ad-hoc and both sender and receiver lookups would
only be a simple mac forwarding with learning. The ingress pipeline
would probably cost a little more.

Maybe multicast + address learning could be implemented for GENEVE as
well. But it would not be interoperable with other VTEPs.

> > - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> >   top-of-rack switches. Multicast is only used for BUM traffic.
> > - Only one VXLAN output port per implemented logical switch on a given
> >   chassis.
>
> Would this actually work with one VXLAN output port? Would you not need
> one port per target node to send unicast traffic (as you otherwise flood
> all packets to all participating nodes)?

You would need one VXLAN output port per implemented logical switch on
a given chassis. The port would have a VNI (unique per logical switch)
and an associated multicast IP address. Any chassis that implement this
logical switch would subscribe to that multicast group. The flooding
would be limited to first packets and broadcast/multicast traffic (ARP
requests, mostly). Once the receiver node replies, all communication
will happen with unicast.

https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic

> > Cons:
> >
> > - OVS does not support VXLAN address learning yet.
> > - The number of usable multicast groups in a fabric network may be
> >   limited?
> > - How to manage seamless upgrades and interoperabilit

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Vladislav Odintsov via discuss
Hi Robin,

Please, see inline.

regards,
Vladislav Odintsov

> On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
>  wrote:
> 
> Felix Huettner, Sep 29, 2023 at 15:23:
>>> Distributed mac learning
>>> 
> [snip]
>>> 
>>> Cons:
>>> 
>>> - How to manage seamless upgrades?
>>> - Requires ovn-controller to move/plug ports in the correct bridge.
>>> - Multiple openflow connections (one per managed bridge).
>>> - Requires ovn-trace to be reimplemented differently (maybe other tools
>>>  as well).
>> 
>> - No central information anymore on mac bindings. All nodes need to
>>  update their data individually
>> - Each bridge generates also a linux network interface. I do not know if
>>  there is some kind of limit to the linux interfaces or the ovs bridges
>>  somewhere.
> 
> That's a good point. However, only the bridges related to one
> implemented logical network would need to be created on a single
> chassis. Even with the largest OVN deployments, I doubt this would be
> a limitation.
> 
>> Would you still preprovision static mac addresses on the bridge for all
>> port_bindings we know the mac address from, or would you rather leave
>> that up for learning as well?
> 
> I would leave everything dynamic.
> 
>> I do not know if there is some kind of performance/optimization penality
>> for moving packets between different bridges.
> 
> As far as I know, once the openflow pipeline has been resolved into
> a datapath flow, there is no penalty.
> 
>> You can also not only use the logical switch that have a local port
>> bound. Assume the following topology:
>> +---+ +---+ +---+ +---+ +---+ +---+ +---+
>> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
>> +---+ +---+ +---+ +---+ +---+ +---+ +---+
>> vm1 and vm2 are both running on the same hypervisor. Creating only local
>> logical switches would mean only ls1 and ls3 are available on that
>> hypervisor. This would break the connection between the two vms which
>> would in the current implementation just traverse the two logical
>> routers.
>> I guess we would need to create bridges for each locally reachable
>> logical switch. I am concerned about the potentially significant
>> increase in bridges and openflow connections this brings.
> 
> That is one of the concerns I raised in the last point. In my opinion
> this is a trade off. You remove centralization and require more local
> processing. But overall, the processing cost should remain equivalent.

Just want to clarify.
For topology described by Felix above, you propose to create 2 OVS bridges, 
right? How will the packet traverse from vm1 to vm2? 

Currently when the packet enters OVS all the logical switching and routing 
openflow calculation is done with no packet re-entering OVS, and this results 
in one DP flow match to deliver this packet from vm1 to vm2 (if no conntrack 
used, which could introduce recirculations).
Do I understand correctly, that in this proposal OVS needs to receive packet 
from “ls1” bridge, next run through lrouter “lr1” OpenFlow pipelines, then 
output packet to “ls2” OVS bridge for mac learning between logical routers 
(should we have here OF flow with learn action?), then send packet again to 
OVS, calculate “lr2” OpenFlow pipeline and finally reach destination OVS bridge 
“ls3” to send packet to a vm2? 

Also, will such behavior be compatible with HW-offload-capable to 
smartnics/DPUs?

> 
>>> Use multicast for overlay networks
>>> ==
> [snip]
>>> - 24bit VNI allows for more than 16 million logical switches. No need
>>>  for extended GENEVE tunnel options.
>> Note that using vxlan at the moment significantly reduces the ovn
>> featureset. This is because the geneve header options are currently used
>> for data that would not fit into the vxlan vni.
>> 
>> From ovn-architecture.7.xml:
>> ```
>> The maximum number of networks is reduced to 4096.
>> The maximum number of ports per network is reduced to 2048.
>> ACLs matching against logical ingress port identifiers are not supported.
>> OVN interconnection feature is not supported.
>> ```
> 
> In my understanding, the main reason why GENEVE replaced VXLAN is
> because Openstack uses full mesh point to point tunnels and that the
> sender needs to know behind which chassis any mac address is to send it
> into the correct tunnel. GENEVE allowed to reduce the lookup time both
> on the sender and receiver thanks to ingress/egress port metadata.
> 
> https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
> 
> If VXLAN + multicast and address learning was used, the "correct" tunnel
> would be established ad-hoc and both sender and receiver lookups would
> only be a simple mac forwarding with learning. The ingress pipeline
> would probably cost a little more.
> 
> Maybe multicast + address learning could be implemented for GENEVE as
> well. But it would not be interoperable with other VTEPs.
> 
>>> - Limited 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Frode Nordahl via discuss
Thanks alot for starting this discussion.

On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
 wrote:
>
> Hi Robin,
>
> Please, see inline.
>
> regards,
> Vladislav Odintsov
>
> > On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
> >  wrote:
> >
> > Felix Huettner, Sep 29, 2023 at 15:23:
> >>> Distributed mac learning
> >>> 
> > [snip]
> >>>
> >>> Cons:
> >>>
> >>> - How to manage seamless upgrades?
> >>> - Requires ovn-controller to move/plug ports in the correct bridge.
> >>> - Multiple openflow connections (one per managed bridge).
> >>> - Requires ovn-trace to be reimplemented differently (maybe other tools
> >>>  as well).
> >>
> >> - No central information anymore on mac bindings. All nodes need to
> >>  update their data individually
> >> - Each bridge generates also a linux network interface. I do not know if
> >>  there is some kind of limit to the linux interfaces or the ovs bridges
> >>  somewhere.
> >
> > That's a good point. However, only the bridges related to one
> > implemented logical network would need to be created on a single
> > chassis. Even with the largest OVN deployments, I doubt this would be
> > a limitation.
> >
> >> Would you still preprovision static mac addresses on the bridge for all
> >> port_bindings we know the mac address from, or would you rather leave
> >> that up for learning as well?
> >
> > I would leave everything dynamic.
> >
> >> I do not know if there is some kind of performance/optimization penality
> >> for moving packets between different bridges.
> >
> > As far as I know, once the openflow pipeline has been resolved into
> > a datapath flow, there is no penalty.
> >
> >> You can also not only use the logical switch that have a local port
> >> bound. Assume the following topology:
> >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> >> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> >> vm1 and vm2 are both running on the same hypervisor. Creating only local
> >> logical switches would mean only ls1 and ls3 are available on that
> >> hypervisor. This would break the connection between the two vms which
> >> would in the current implementation just traverse the two logical
> >> routers.
> >> I guess we would need to create bridges for each locally reachable
> >> logical switch. I am concerned about the potentially significant
> >> increase in bridges and openflow connections this brings.
> >
> > That is one of the concerns I raised in the last point. In my opinion
> > this is a trade off. You remove centralization and require more local
> > processing. But overall, the processing cost should remain equivalent.
>
> Just want to clarify.
> For topology described by Felix above, you propose to create 2 OVS bridges, 
> right? How will the packet traverse from vm1 to vm2?
>
> Currently when the packet enters OVS all the logical switching and routing 
> openflow calculation is done with no packet re-entering OVS, and this results 
> in one DP flow match to deliver this packet from vm1 to vm2 (if no conntrack 
> used, which could introduce recirculations).
> Do I understand correctly, that in this proposal OVS needs to receive packet 
> from “ls1” bridge, next run through lrouter “lr1” OpenFlow pipelines, then 
> output packet to “ls2” OVS bridge for mac learning between logical routers 
> (should we have here OF flow with learn action?), then send packet again to 
> OVS, calculate “lr2” OpenFlow pipeline and finally reach destination OVS 
> bridge “ls3” to send packet to a vm2?
>
> Also, will such behavior be compatible with HW-offload-capable to 
> smartnics/DPUs?

I am also a bit concerned about this, what would be the typical number
of bridges supported by hardware?

> >
> >>> Use multicast for overlay networks
> >>> ==
> > [snip]
> >>> - 24bit VNI allows for more than 16 million logical switches. No need
> >>>  for extended GENEVE tunnel options.
> >> Note that using vxlan at the moment significantly reduces the ovn
> >> featureset. This is because the geneve header options are currently used
> >> for data that would not fit into the vxlan vni.
> >>
> >> From ovn-architecture.7.xml:
> >> ```
> >> The maximum number of networks is reduced to 4096.
> >> The maximum number of ports per network is reduced to 2048.
> >> ACLs matching against logical ingress port identifiers are not supported.
> >> OVN interconnection feature is not supported.
> >> ```
> >
> > In my understanding, the main reason why GENEVE replaced VXLAN is
> > because Openstack uses full mesh point to point tunnels and that the
> > sender needs to know behind which chassis any mac address is to send it
> > into the correct tunnel. GENEVE allowed to reduce the lookup time both
> > on the sender and receiver thanks to ingress/egress port metadata.
> >
> > https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> > https://dani.foroselectronica.es/ovn-geneve-encapsulation-54

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Robin Jarry via discuss
Hi Vladislav, Frode,

Thanks for your replies.

Frode Nordahl, Sep 30, 2023 at 10:55:
> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>  wrote:
> > > On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
> > >  wrote:
> > >
> > > Felix Huettner, Sep 29, 2023 at 15:23:
> > >>> Distributed mac learning
> > >>> 
> > > [snip]
> > >>>
> > >>> Cons:
> > >>>
> > >>> - How to manage seamless upgrades?
> > >>> - Requires ovn-controller to move/plug ports in the correct bridge.
> > >>> - Multiple openflow connections (one per managed bridge).
> > >>> - Requires ovn-trace to be reimplemented differently (maybe other tools
> > >>>  as well).
> > >>
> > >> - No central information anymore on mac bindings. All nodes need to
> > >>  update their data individually
> > >> - Each bridge generates also a linux network interface. I do not know if
> > >>  there is some kind of limit to the linux interfaces or the ovs bridges
> > >>  somewhere.
> > >
> > > That's a good point. However, only the bridges related to one
> > > implemented logical network would need to be created on a single
> > > chassis. Even with the largest OVN deployments, I doubt this would be
> > > a limitation.
> > >
> > >> Would you still preprovision static mac addresses on the bridge for all
> > >> port_bindings we know the mac address from, or would you rather leave
> > >> that up for learning as well?
> > >
> > > I would leave everything dynamic.
> > >
> > >> I do not know if there is some kind of performance/optimization penality
> > >> for moving packets between different bridges.
> > >
> > > As far as I know, once the openflow pipeline has been resolved into
> > > a datapath flow, there is no penalty.
> > >
> > >> You can also not only use the logical switch that have a local port
> > >> bound. Assume the following topology:
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> vm1 and vm2 are both running on the same hypervisor. Creating only local
> > >> logical switches would mean only ls1 and ls3 are available on that
> > >> hypervisor. This would break the connection between the two vms which
> > >> would in the current implementation just traverse the two logical
> > >> routers.
> > >> I guess we would need to create bridges for each locally reachable
> > >> logical switch. I am concerned about the potentially significant
> > >> increase in bridges and openflow connections this brings.
> > >
> > > That is one of the concerns I raised in the last point. In my opinion
> > > this is a trade off. You remove centralization and require more local
> > > processing. But overall, the processing cost should remain equivalent.
> >
> > Just want to clarify.
> >
> > For topology described by Felix above, you propose to create 2 OVS
> > bridges, right? How will the packet traverse from vm1 to vm2?

In this particular case, there would be 3 OVS bridges, one for each
logical switch.

> > Currently when the packet enters OVS all the logical switching and
> > routing openflow calculation is done with no packet re-entering OVS,
> > and this results in one DP flow match to deliver this packet from
> > vm1 to vm2 (if no conntrack used, which could introduce
> > recirculations).
> >
> > Do I understand correctly, that in this proposal OVS needs to
> > receive packet from “ls1” bridge, next run through lrouter “lr1”
> > OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> > learning between logical routers (should we have here OF flow with
> > learn action?), then send packet again to OVS, calculate “lr2”
> > OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> > send packet to a vm2?

What I am proposing is to implement the northbound L2 network intent
with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
constructs and ACLs would require patch ports and specific OF pipelines.

We could even think of adding more advanced L3 capabilities (RIB) into
OVS to simplify the OF pipelines.

> > Also, will such behavior be compatible with HW-offload-capable to
> > smartnics/DPUs?
>
> I am also a bit concerned about this, what would be the typical number
> of bridges supported by hardware?

As far as I understand, only the datapath flows are offloaded to
hardware. The OF pipeline is only parsed when there is an upcall for the
first packet. Once resolved, the datapath flow is reused. OVS bridges
are only logical constructs, they are neither reflected in the datapath
nor in hardware.

> > >>> Use multicast for overlay networks
> > >>> ==
> > > [snip]
> > >>> - 24bit VNI allows for more than 16 million logical switches. No need
> > >>>  for extended GENEVE tunnel options.
> > >> Note that using vxlan at the moment significantly reduces the ovn
> > >> featureset. This is because the geneve header options are currently used
> > >> for data that would not fit into t

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Daniel Alvarez via discuss
Hi all,


> On 30 Sep 2023, at 15:51, Robin Jarry via discuss 
>  wrote:
> 
> Hi Vladislav, Frode,
> 
> Thanks for your replies.
> 
> Frode Nordahl, Sep 30, 2023 at 10:55:
>> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>>  wrote:
 On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
  wrote:
 
 Felix Huettner, Sep 29, 2023 at 15:23:
>> Distributed mac learning
>> 
 [snip]
>> 
>> Cons:
>> 
>> - How to manage seamless upgrades?
>> - Requires ovn-controller to move/plug ports in the correct bridge.
>> - Multiple openflow connections (one per managed bridge).
>> - Requires ovn-trace to be reimplemented differently (maybe other tools
>> as well).
> 
> - No central information anymore on mac bindings. All nodes need to
> update their data individually
> - Each bridge generates also a linux network interface. I do not know if
> there is some kind of limit to the linux interfaces or the ovs bridges
> somewhere.
 
 That's a good point. However, only the bridges related to one
 implemented logical network would need to be created on a single
 chassis. Even with the largest OVN deployments, I doubt this would be
 a limitation.
 
> Would you still preprovision static mac addresses on the bridge for all
> port_bindings we know the mac address from, or would you rather leave
> that up for learning as well?
 
 I would leave everything dynamic.
 
> I do not know if there is some kind of performance/optimization penality
> for moving packets between different bridges.
 
 As far as I know, once the openflow pipeline has been resolved into
 a datapath flow, there is no penalty.
 
> You can also not only use the logical switch that have a local port
> bound. Assume the following topology:
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> vm1 and vm2 are both running on the same hypervisor. Creating only local
> logical switches would mean only ls1 and ls3 are available on that
> hypervisor. This would break the connection between the two vms which
> would in the current implementation just traverse the two logical
> routers.
> I guess we would need to create bridges for each locally reachable
> logical switch. I am concerned about the potentially significant
> increase in bridges and openflow connections this brings.
 
 That is one of the concerns I raised in the last point. In my opinion
 this is a trade off. You remove centralization and require more local
 processing. But overall, the processing cost should remain equivalent.
>>> 
>>> Just want to clarify.
>>> 
>>> For topology described by Felix above, you propose to create 2 OVS
>>> bridges, right? How will the packet traverse from vm1 to vm2?
> 
> In this particular case, there would be 3 OVS bridges, one for each
> logical switch.
> 
>>> Currently when the packet enters OVS all the logical switching and
>>> routing openflow calculation is done with no packet re-entering OVS,
>>> and this results in one DP flow match to deliver this packet from
>>> vm1 to vm2 (if no conntrack used, which could introduce
>>> recirculations).
>>> 
>>> Do I understand correctly, that in this proposal OVS needs to
>>> receive packet from “ls1” bridge, next run through lrouter “lr1”
>>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
>>> learning between logical routers (should we have here OF flow with
>>> learn action?), then send packet again to OVS, calculate “lr2”
>>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
>>> send packet to a vm2?
> 
> What I am proposing is to implement the northbound L2 network intent
> with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> constructs and ACLs would require patch ports and specific OF pipelines.
> 
> We could even think of adding more advanced L3 capabilities (RIB) into
> OVS to simplify the OF pipelines.
> 
>>> Also, will such behavior be compatible with HW-offload-capable to
>>> smartnics/DPUs?
>> 
>> I am also a bit concerned about this, what would be the typical number
>> of bridges supported by hardware?
> 
> As far as I understand, only the datapath flows are offloaded to
> hardware. The OF pipeline is only parsed when there is an upcall for the
> first packet. Once resolved, the datapath flow is reused. OVS bridges
> are only logical constructs, they are neither reflected in the datapath
> nor in hardware.
> 
>> Use multicast for overlay networks
>> ==
 [snip]
>> - 24bit VNI allows for more than 16 million logical switches. No need
>> for extended GENEVE tunnel options.
> Note that using vxlan at the moment significantly reduces the ovn
> featureset. This is because the genev

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Vladislav Odintsov via discuss


regards,
Vladislav Odintsov

> On 30 Sep 2023, at 16:50, Robin Jarry  wrote:
> 
> Hi Vladislav, Frode,
> 
> Thanks for your replies.
> 
> Frode Nordahl, Sep 30, 2023 at 10:55:
>> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>>  wrote:
 On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
  wrote:
 
 Felix Huettner, Sep 29, 2023 at 15:23:
>> Distributed mac learning
>> 
 [snip]
>> 
>> Cons:
>> 
>> - How to manage seamless upgrades?
>> - Requires ovn-controller to move/plug ports in the correct bridge.
>> - Multiple openflow connections (one per managed bridge).
>> - Requires ovn-trace to be reimplemented differently (maybe other tools
>> as well).
> 
> - No central information anymore on mac bindings. All nodes need to
> update their data individually
> - Each bridge generates also a linux network interface. I do not know if
> there is some kind of limit to the linux interfaces or the ovs bridges
> somewhere.
 
 That's a good point. However, only the bridges related to one
 implemented logical network would need to be created on a single
 chassis. Even with the largest OVN deployments, I doubt this would be
 a limitation.
 
> Would you still preprovision static mac addresses on the bridge for all
> port_bindings we know the mac address from, or would you rather leave
> that up for learning as well?
 
 I would leave everything dynamic.
 
> I do not know if there is some kind of performance/optimization penality
> for moving packets between different bridges.
 
 As far as I know, once the openflow pipeline has been resolved into
 a datapath flow, there is no penalty.
 
> You can also not only use the logical switch that have a local port
> bound. Assume the following topology:
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> vm1 and vm2 are both running on the same hypervisor. Creating only local
> logical switches would mean only ls1 and ls3 are available on that
> hypervisor. This would break the connection between the two vms which
> would in the current implementation just traverse the two logical
> routers.
> I guess we would need to create bridges for each locally reachable
> logical switch. I am concerned about the potentially significant
> increase in bridges and openflow connections this brings.
 
 That is one of the concerns I raised in the last point. In my opinion
 this is a trade off. You remove centralization and require more local
 processing. But overall, the processing cost should remain equivalent.
>>> 
>>> Just want to clarify.
>>> 
>>> For topology described by Felix above, you propose to create 2 OVS
>>> bridges, right? How will the packet traverse from vm1 to vm2?
> 
> In this particular case, there would be 3 OVS bridges, one for each
> logical switch.

Yeah, agree, this is typo. Below I named three bridges :).

> 
>>> Currently when the packet enters OVS all the logical switching and
>>> routing openflow calculation is done with no packet re-entering OVS,
>>> and this results in one DP flow match to deliver this packet from
>>> vm1 to vm2 (if no conntrack used, which could introduce
>>> recirculations).
>>> 
>>> Do I understand correctly, that in this proposal OVS needs to
>>> receive packet from “ls1” bridge, next run through lrouter “lr1”
>>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
>>> learning between logical routers (should we have here OF flow with
>>> learn action?), then send packet again to OVS, calculate “lr2”
>>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
>>> send packet to a vm2?
> 
> What I am proposing is to implement the northbound L2 network intent
> with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> constructs and ACLs would require patch ports and specific OF pipelines.
> 
> We could even think of adding more advanced L3 capabilities (RIB) into
> OVS to simplify the OF pipelines.

But this will make OVS<->kernel interaction more complex. Even if we forget 
about dpdk environments…

> 
>>> Also, will such behavior be compatible with HW-offload-capable to
>>> smartnics/DPUs?
>> 
>> I am also a bit concerned about this, what would be the typical number
>> of bridges supported by hardware?
> 
> As far as I understand, only the datapath flows are offloaded to
> hardware. The OF pipeline is only parsed when there is an upcall for the
> first packet. Once resolved, the datapath flow is reused. OVS bridges
> are only logical constructs, they are neither reflected in the datapath
> nor in hardware.

As far as I remember from my tests against ConnectX-5/6 SmartNICs in ASAP^2 
mode, HW-offload is not capable with offloading OVS patch ports. At least it 
was so 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
On Thu, Sep 28, 2023 at 9:28 AM Robin Jarry  wrote:
>
> Hello OVN community,
>
> This is a follow up on the message I have sent today [1]. That second
> part focuses on some ideas I have to remove the limitations that were
> mentioned in the previous email.
>
> [1]
https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html
>
> If you didn't read it, my goal is to start a discussion about how we
> could improve OVN on the following topics:
>
> - Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
> - Support scaling of L2 connectivity across larger clusters.
> - Simplify CMS interoperability.
> - Allow support for alternative datapath implementations.
>
> Disclaimer:
>
> This message does not mention anything about L3/L4 features of OVN.
> I didn't have time to work on these, yet. I hope we can discuss how
> these fit with my ideas.
>

Hi Robin and folks, thanks for the great discussions!
I read the replies of two other threads of this email, but I am replying
directly here to comment on some of the original statements in this email.
I will reply to the other threads for some specific points.

> Distributed mac learning
> 
>
> Use one OVS bridge per logical switch with mac learning enabled. Only
> create the bridge if the logical switch has a port bound to the local
> chassis.
>
> Pros:
>
> - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> - No central mac binding table required.

Firstly to clarify the terminology of "mac binding" to avoid confusion, the
mac_binding table currently in SB DB has nothing to do with L2 MAC
learning. It is actually the ARP/Neighbor table of distributed logical
routers. We should probably call it IP_MAC_binding table, or just Neighbor
table.
Here what you mean is actually L2 MAC learning, which today is implemented
by the FDB table in SB DB, and it is only for uncommon use cases when the
NB doesn't have the knowledge of a MAC address of a VIF.

> - Mac table aging comes for free.
> - Zero access to southbound DB for learned addresses nor for aging.
>
> Cons:
>
> - How to manage seamless upgrades?
> - Requires ovn-controller to move/plug ports in the correct bridge.
> - Multiple openflow connections (one per managed bridge).
> - Requires ovn-trace to be reimplemented differently (maybe other tools
>   as well).
>

The purpose of this proposal is clear - to avoid using a central table in
DB for L2 information but instead using L2 MAC learning to populate such
information on chassis, which is a reasonable alternative with pros and
cons.
However, I don't think it is necessary to use separate OVS bridges for this
purpose. L2 MAC learning can be easily implemented in the br-int bridge
with OVS flows, which is much simpler than managing dynamic number of OVS
bridges just for the purpose of using the builtin OVS mac-learning.

Now back to the distributed MAC learning idea itself. Essentially for two
VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
In OVN today this information is conveyed through:
- MAC and LSP mapping (NB -> northd -> SB -> Chassis)
- LSP and Chassis mapping (Chassis -> SB -> Chassis)

In your proposal:
- MAC and Chassis mapping (can be learned through initial L2
broadcast/flood)

This indeed would avoid the control plane cost through the centralized
components (for this L2 binding part). Given that today's SB OVSDB is a
bottleneck, this idea may sound attractive. But please also take into
consideration the below improvement that could mitigate the OVN central
scale issue:
- For MAC and LSP mapping, northd is now capable of incrementally
processing VIF related L2/L3 changes, so the cost of NB -> northd -> SB is
very small. For SB -> Chassis, a more scalable DB deployment, such as the
OVSDB relays, may largely help.
- For LSP and Chassis mapping, the round trip through a central DB
obviously costs higher than a direct L2 broadcast (the targets are the
same). But this can be optimized if the MAC and Chassis is known by the CMS
system (which is true for most openstack/k8s env I believe). Instead of
updating the binding from each Chassis, CMS can tell this information
through the same NB -> northd -> SB -> Chassis path, and the Chassis can
just read the SB without updating it.

On the other hand, the dynamic MAC learning approach has its own drawbacks.
- It is simple to consider L2 only, but if considering more SDB features, a
central DB is more flexible to extend and implement new features than a
network protocol based approach.
- It is more predictable and easier to debug with pre-populated information
through CMS than states learned dynamically in data-plane.
- With the DB approach we can suppress most of L2 broadcast/flood, while
with the distributed MAC learning broadcast/flood can't be avoided.
Although it may hap

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Han Zhou via discuss
On Sat, Sep 30, 2023 at 9:56 AM Vladislav Odintsov 
wrote:
>
>
>
> regards,
> Vladislav Odintsov
>
> > On 30 Sep 2023, at 16:50, Robin Jarry  wrote:
> >
> > Hi Vladislav, Frode,
> >
> > Thanks for your replies.
> >
> > Frode Nordahl, Sep 30, 2023 at 10:55:
> >> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
> >>  wrote:
>  On 29 Sep 2023, at 18:14, Robin Jarry via discuss <
ovs-discuss@openvswitch.org> wrote:
> 
>  Felix Huettner, Sep 29, 2023 at 15:23:
> >> Distributed mac learning
> >> 
>  [snip]
> >>
> >> Cons:
> >>
> >> - How to manage seamless upgrades?
> >> - Requires ovn-controller to move/plug ports in the correct bridge.
> >> - Multiple openflow connections (one per managed bridge).
> >> - Requires ovn-trace to be reimplemented differently (maybe other
tools
> >> as well).
> >
> > - No central information anymore on mac bindings. All nodes need to
> > update their data individually
> > - Each bridge generates also a linux network interface. I do not
know if
> > there is some kind of limit to the linux interfaces or the ovs
bridges
> > somewhere.
> 
>  That's a good point. However, only the bridges related to one
>  implemented logical network would need to be created on a single
>  chassis. Even with the largest OVN deployments, I doubt this would be
>  a limitation.
> 
> > Would you still preprovision static mac addresses on the bridge for
all
> > port_bindings we know the mac address from, or would you rather
leave
> > that up for learning as well?
> 
>  I would leave everything dynamic.
> 
> > I do not know if there is some kind of performance/optimization
penality
> > for moving packets between different bridges.
> 
>  As far as I know, once the openflow pipeline has been resolved into
>  a datapath flow, there is no penalty.
> 
> > You can also not only use the logical switch that have a local port
> > bound. Assume the following topology:
> > +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> > +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > vm1 and vm2 are both running on the same hypervisor. Creating only
local
> > logical switches would mean only ls1 and ls3 are available on that
> > hypervisor. This would break the connection between the two vms
which
> > would in the current implementation just traverse the two logical
> > routers.
> > I guess we would need to create bridges for each locally reachable
> > logical switch. I am concerned about the potentially significant
> > increase in bridges and openflow connections this brings.
> 
>  That is one of the concerns I raised in the last point. In my opinion
>  this is a trade off. You remove centralization and require more local
>  processing. But overall, the processing cost should remain
equivalent.
> >>>
> >>> Just want to clarify.
> >>>
> >>> For topology described by Felix above, you propose to create 2 OVS
> >>> bridges, right? How will the packet traverse from vm1 to vm2?
> >
> > In this particular case, there would be 3 OVS bridges, one for each
> > logical switch.
>
> Yeah, agree, this is typo. Below I named three bridges :).
>
> >
> >>> Currently when the packet enters OVS all the logical switching and
> >>> routing openflow calculation is done with no packet re-entering OVS,
> >>> and this results in one DP flow match to deliver this packet from
> >>> vm1 to vm2 (if no conntrack used, which could introduce
> >>> recirculations).
> >>>
> >>> Do I understand correctly, that in this proposal OVS needs to
> >>> receive packet from “ls1” bridge, next run through lrouter “lr1”
> >>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> >>> learning between logical routers (should we have here OF flow with
> >>> learn action?), then send packet again to OVS, calculate “lr2”
> >>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> >>> send packet to a vm2?
> >
> > What I am proposing is to implement the northbound L2 network intent
> > with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
> > constructs and ACLs would require patch ports and specific OF pipelines.
> >
> > We could even think of adding more advanced L3 capabilities (RIB) into
> > OVS to simplify the OF pipelines.
>
> But this will make OVS<->kernel interaction more complex. Even if we
forget about dpdk environments…
>
> >
> >>> Also, will such behavior be compatible with HW-offload-capable to
> >>> smartnics/DPUs?
> >>
> >> I am also a bit concerned about this, what would be the typical number
> >> of bridges supported by hardware?
> >
> > As far as I understand, only the datapath flows are offloaded to
> > hardware. The OF pipeline is only parsed when there is an upcall for the
> > first packet. Once resolved, the datapath flo

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Vladislav Odintsov via discuss
regards,Vladislav OdintsovOn 30 Sep 2023, at 23:24, Han Zhou  wrote:On Sat, Sep 30, 2023 at 9:56 AM Vladislav Odintsov  wrote: regards,> Vladislav Odintsov>> > On 30 Sep 2023, at 16:50, Robin Jarry  wrote:> >> > Hi Vladislav, Frode,> >> > Thanks for your replies.> >> > Frode Nordahl, Sep 30, 2023 at 10:55:> >> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss> >>  wrote:>  On 29 Sep 2023, at 18:14, Robin Jarry via discuss  wrote:> >  Felix Huettner, Sep 29, 2023 at 15:23:> >> Distributed mac learning> >> >  [snip]> >>> >> Cons:> >>> >> - How to manage seamless upgrades?> >> - Requires ovn-controller to move/plug ports in the correct bridge.> >> - Multiple openflow connections (one per managed bridge).> >> - Requires ovn-trace to be reimplemented differently (maybe other tools> >> as well).> >> > - No central information anymore on mac bindings. All nodes need to> > update their data individually> > - Each bridge generates also a linux network interface. I do not know if> > there is some kind of limit to the linux interfaces or the ovs bridges> > somewhere.> >  That's a good point. However, only the bridges related to one>  implemented logical network would need to be created on a single>  chassis. Even with the largest OVN deployments, I doubt this would be>  a limitation.> > > Would you still preprovision static mac addresses on the bridge for all> > port_bindings we know the mac address from, or would you rather leave> > that up for learning as well?> >  I would leave everything dynamic.> > > I do not know if there is some kind of performance/optimization penality> > for moving packets between different bridges.> >  As far as I know, once the openflow pipeline has been resolved into>  a datapath flow, there is no penalty.> > > You can also not only use the logical switch that have a local port> > bound. Assume the following topology:> > +---+ +---+ +---+ +---+ +---+ +---+ +---+> > |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|> > +---+ +---+ +---+ +---+ +---+ +---+ +---+> > vm1 and vm2 are both running on the same hypervisor. Creating only local> > logical switches would mean only ls1 and ls3 are available on that> > hypervisor. This would break the connection between the two vms which> > would in the current implementation just traverse the two logical> > routers.> > I guess we would need to create bridges for each locally reachable> > logical switch. I am concerned about the potentially significant> > increase in bridges and openflow connections this brings.> >  That is one of the concerns I raised in the last point. In my opinion>  this is a trade off. You remove centralization and require more local>  processing. But overall, the processing cost should remain equivalent.>  >>> Just want to clarify.>  >>> For topology described by Felix above, you propose to create 2 OVS> >>> bridges, right? How will the packet traverse from vm1 to vm2?> >> > In this particular case, there would be 3 OVS bridges, one for each> > logical switch.>> Yeah, agree, this is typo. Below I named three bridges :).>> >> >>> Currently when the packet enters OVS all the logical switching and> >>> routing openflow calculation is done with no packet re-entering OVS,> >>> and this results in one DP flow match to deliver this packet from> >>> vm1 to vm2 (if no conntrack used, which could introduce> >>> recirculations).>  >>> Do I understand correctly, that in this proposal OVS needs to> >>> receive packet from “ls1” bridge, next run through lrouter “lr1”> >>> OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac> >>> learning between logical routers (should we have here OF flow with> >>> learn action?), then send packet again to OVS, calculate “lr2”> >>> OpenFlow pipeline and finally reach destination OVS bridge “ls3” to> >>> send packet to a vm2?> >> > What I am proposing is to implement the northbound L2 network intent> > with actual OVS bridges and builtin OVS mac learning. The L3/L4 network> > constructs and ACLs would require patch ports and specific OF pipelines.> >> > We could even think of adding more advanced L3 capabilities (RIB) into> > OVS to simplify the OF pipelines.>> But this will make OVS<->kernel interaction more complex. Even if we forget about dpdk environments…>> >> >>> Also, will such behavior be compatible with HW-offload-capable to> >>> smartnics/DPUs?> >>> >> I am also a bit concerned about this, what would be the typical number> >> of bridges supported by hardware?> >> > As far as I understand, only the datapath flows are offloaded to> > hardware. The OF pipeline is only parsed when there is an upcall for the> > first packet. Once resolv

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-01 Thread Robin Jarry via discuss
Hi Han,

Please see my comments/questions inline.

Han Zhou, Sep 30, 2023 at 21:59:
> > Distributed mac learning
> > 
> >
> > Use one OVS bridge per logical switch with mac learning enabled. Only
> > create the bridge if the logical switch has a port bound to the local
> > chassis.
> >
> > Pros:
> >
> > - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> > - No central mac binding table required.
>
> Firstly to clarify the terminology of "mac binding" to avoid confusion, the
> mac_binding table currently in SB DB has nothing to do with L2 MAC
> learning. It is actually the ARP/Neighbor table of distributed logical
> routers. We should probably call it IP_MAC_binding table, or just Neighbor
> table.

Yes sorry about the confusion. I actually meant the FDB table.

> Here what you mean is actually L2 MAC learning, which today is implemented
> by the FDB table in SB DB, and it is only for uncommon use cases when the
> NB doesn't have the knowledge of a MAC address of a VIF.

This is not that uncommon in telco use cases where VNFs can send packets
from mac addresses unknown to OVN.

> The purpose of this proposal is clear - to avoid using a central table in
> DB for L2 information but instead using L2 MAC learning to populate such
> information on chassis, which is a reasonable alternative with pros and
> cons.
> However, I don't think it is necessary to use separate OVS bridges for this
> purpose. L2 MAC learning can be easily implemented in the br-int bridge
> with OVS flows, which is much simpler than managing dynamic number of OVS
> bridges just for the purpose of using the builtin OVS mac-learning.

I agree that this could also be implemented with VLAN tags on the
appropriate ports. But since OVS does not support trunk ports, it may
require complicated OF pipelines. My intent with this idea was two fold:

1) Avoid a central point of failure for mac learning/aging.
2) Simplify the OF pipeline by making all FDB operations dynamic.

> Now back to the distributed MAC learning idea itself. Essentially for two
> VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
> VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
> this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
>
> In OVN today this information is conveyed through:
>
> - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> - LSP and Chassis mapping (Chassis -> SB -> Chassis)
>
> In your proposal:
>
> - MAC and Chassis mapping (can be learned through initial L2
>   broadcast/flood)
>
> This indeed would avoid the control plane cost through the centralized
> components (for this L2 binding part). Given that today's SB OVSDB is a
> bottleneck, this idea may sound attractive. But please also take into
> consideration the below improvement that could mitigate the OVN central
> scale issue:
>
> - For MAC and LSP mapping, northd is now capable of incrementally
>   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
>   SB is very small. For SB -> Chassis, a more scalable DB deployment,
>   such as the OVSDB relays, may largely help.

But using relays will only help with read-only operations (SB ->
chassis). Write operations (from dynamically learned mac addresses) will
be equivalent.

> - For LSP and Chassis mapping, the round trip through a central DB
>   obviously costs higher than a direct L2 broadcast (the targets are
>   the same). But this can be optimized if the MAC and Chassis is known
>   by the CMS system (which is true for most openstack/k8s env
>   I believe). Instead of updating the binding from each Chassis, CMS
>   can tell this information through the same NB -> northd -> SB ->
>   Chassis path, and the Chassis can just read the SB without updating
>   it.

This is only applicable for known L2 addresses.  Maybe telco use cases
are very specific, but being able to have ports that send packets from
unknown addresses is a strong requirement.

> On the other hand, the dynamic MAC learning approach has its own drawbacks.
>
> - It is simple to consider L2 only, but if considering more SDB
>   features, a central DB is more flexible to extend and implement new
>   features than a network protocol based approach.
> - It is more predictable and easier to debug with pre-populated
>   information through CMS than states learned dynamically in
>   data-plane.
> - With the DB approach we can suppress most of L2 broadcast/flood,
>   while with the distributed MAC learning broadcast/flood can't be
>   avoided. Although it may happen mostly when a new workload is
>   launched, it can also happen when aging. The cost of broadcast in
>   large L2 is also a potential threat to scale.

I may lack the field experience of operating large datacenter networks
but I was not aware of any scaling issues because of ARP and/or other L2
broadcasts.  Is this an actual problem that was reported by cloud/telco
operators and which influenced the cen

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-01 Thread Han Zhou via discuss
On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
>
> Hi Han,
>
> Please see my comments/questions inline.
>
> Han Zhou, Sep 30, 2023 at 21:59:
> > > Distributed mac learning
> > > 
> > >
> > > Use one OVS bridge per logical switch with mac learning enabled. Only
> > > create the bridge if the logical switch has a port bound to the local
> > > chassis.
> > >
> > > Pros:
> > >
> > > - Minimal openflow rules required in each bridge (ACLs and NAT
mostly).
> > > - No central mac binding table required.
> >
> > Firstly to clarify the terminology of "mac binding" to avoid confusion,
the
> > mac_binding table currently in SB DB has nothing to do with L2 MAC
> > learning. It is actually the ARP/Neighbor table of distributed logical
> > routers. We should probably call it IP_MAC_binding table, or just
Neighbor
> > table.
>
> Yes sorry about the confusion. I actually meant the FDB table.
>
> > Here what you mean is actually L2 MAC learning, which today is
implemented
> > by the FDB table in SB DB, and it is only for uncommon use cases when
the
> > NB doesn't have the knowledge of a MAC address of a VIF.
>
> This is not that uncommon in telco use cases where VNFs can send packets
> from mac addresses unknown to OVN.
>
Understand, but VNFs contributes a very small portion of the workloads,
right? Maybe I should rephrase that: it is uncommon to have "unknown"
addresses for the majority of ports in a large scale cloud. Is this
understanding correct?

> > The purpose of this proposal is clear - to avoid using a central table
in
> > DB for L2 information but instead using L2 MAC learning to populate such
> > information on chassis, which is a reasonable alternative with pros and
> > cons.
> > However, I don't think it is necessary to use separate OVS bridges for
this
> > purpose. L2 MAC learning can be easily implemented in the br-int bridge
> > with OVS flows, which is much simpler than managing dynamic number of
OVS
> > bridges just for the purpose of using the builtin OVS mac-learning.
>
> I agree that this could also be implemented with VLAN tags on the
> appropriate ports. But since OVS does not support trunk ports, it may
> require complicated OF pipelines. My intent with this idea was two fold:
>
> 1) Avoid a central point of failure for mac learning/aging.
> 2) Simplify the OF pipeline by making all FDB operations dynamic.

IMHO, the L2 pipeline is not really complex. It is probably the simplest
part (compared with other features for L3, NAT, ACL, LB, etc.).
Adding dynamic learning to this part probably makes it *a little* more
complex, but should still be straightforward. We don't need any VLAN tag
because the incoming packet has geneve VNI in the metadata. We just need a
flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a
new flow (with related tunnel endpont information) if the src MAC is not
found, with the help of the "learn" action. The entries are
per-logical_switch (VNI). This would serve your purpose of avoiding a
central DB for L2. At least this looks much simpler to me than managing
dynamic number of OVS bridges and the patch pairs between them.

>
> > Now back to the distributed MAC learning idea itself. Essentially for
two
> > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet
to
> > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will
discuss
> > this later), Chassis1 needs to know that VM2's MAC is located on
Chassis2.
> >
> > In OVN today this information is conveyed through:
> >
> > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> >
> > In your proposal:
> >
> > - MAC and Chassis mapping (can be learned through initial L2
> >   broadcast/flood)
> >
> > This indeed would avoid the control plane cost through the centralized
> > components (for this L2 binding part). Given that today's SB OVSDB is a
> > bottleneck, this idea may sound attractive. But please also take into
> > consideration the below improvement that could mitigate the OVN central
> > scale issue:
> >
> > - For MAC and LSP mapping, northd is now capable of incrementally
> >   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
> >   SB is very small. For SB -> Chassis, a more scalable DB deployment,
> >   such as the OVSDB relays, may largely help.
>
> But using relays will only help with read-only operations (SB ->
> chassis). Write operations (from dynamically learned mac addresses) will
> be equivalent.
>
OVSDB relay supports write operations, too. It scales better because each
ovsdb-server process handles smaller number of clients/connections. It may
still perform worse when there are too many write operations from many
clients, but I think it should scale better than without relay. This is
only based on my knowledge of the ovsdb-server relay, but I haven't tested
it at scale, yet. People who actually deployed it may comment more.

> > - For LSP and Chassis mapping, the 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-02 Thread Felix Huettner via discuss
Hi everyone,

just want to add my experience below
On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> >
> > Hi Han,
> >
> > Please see my comments/questions inline.
> >
> > Han Zhou, Sep 30, 2023 at 21:59:
> > > > Distributed mac learning
> > > > 
> > > >
> > > > Use one OVS bridge per logical switch with mac learning enabled. Only
> > > > create the bridge if the logical switch has a port bound to the local
> > > > chassis.
> > > >
> > > > Pros:
> > > >
> > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> mostly).
> > > > - No central mac binding table required.
> > >
> > > Firstly to clarify the terminology of "mac binding" to avoid confusion,
> the
> > > mac_binding table currently in SB DB has nothing to do with L2 MAC
> > > learning. It is actually the ARP/Neighbor table of distributed logical
> > > routers. We should probably call it IP_MAC_binding table, or just
> Neighbor
> > > table.
> >
> > Yes sorry about the confusion. I actually meant the FDB table.
> >
> > > Here what you mean is actually L2 MAC learning, which today is
> implemented
> > > by the FDB table in SB DB, and it is only for uncommon use cases when
> the
> > > NB doesn't have the knowledge of a MAC address of a VIF.
> >
> > This is not that uncommon in telco use cases where VNFs can send packets
> > from mac addresses unknown to OVN.
> >
> Understand, but VNFs contributes a very small portion of the workloads,
> right? Maybe I should rephrase that: it is uncommon to have "unknown"
> addresses for the majority of ports in a large scale cloud. Is this
> understanding correct?

I can only share numbers for our usecase with ~650 chassis we have the
following distribution of "unknown" in the `addresses` field of
Logical_Switch_Port:
* 23000 with a mac address + ip and without "unknown"
* 250 with a mac address + ip and with "unknown"
* 30 with just "unknown"

The usecase is a generic public cloud and we do not have any telco
related things.

>
> > > The purpose of this proposal is clear - to avoid using a central table
> in
> > > DB for L2 information but instead using L2 MAC learning to populate such
> > > information on chassis, which is a reasonable alternative with pros and
> > > cons.
> > > However, I don't think it is necessary to use separate OVS bridges for
> this
> > > purpose. L2 MAC learning can be easily implemented in the br-int bridge
> > > with OVS flows, which is much simpler than managing dynamic number of
> OVS
> > > bridges just for the purpose of using the builtin OVS mac-learning.
> >
> > I agree that this could also be implemented with VLAN tags on the
> > appropriate ports. But since OVS does not support trunk ports, it may
> > require complicated OF pipelines. My intent with this idea was two fold:
> >
> > 1) Avoid a central point of failure for mac learning/aging.
> > 2) Simplify the OF pipeline by making all FDB operations dynamic.
>
> IMHO, the L2 pipeline is not really complex. It is probably the simplest
> part (compared with other features for L3, NAT, ACL, LB, etc.).
> Adding dynamic learning to this part probably makes it *a little* more
> complex, but should still be straightforward. We don't need any VLAN tag
> because the incoming packet has geneve VNI in the metadata. We just need a
> flow that resubmits to lookup a MAC-tunnelSrc mapping table, and inject a
> new flow (with related tunnel endpont information) if the src MAC is not
> found, with the help of the "learn" action. The entries are
> per-logical_switch (VNI). This would serve your purpose of avoiding a
> central DB for L2. At least this looks much simpler to me than managing
> dynamic number of OVS bridges and the patch pairs between them.
>
> >
> > > Now back to the distributed MAC learning idea itself. Essentially for
> two
> > > VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet
> to
> > > VM2@chassis2, assuming VM1 already has VM2's MAC address (we will
> discuss
> > > this later), Chassis1 needs to know that VM2's MAC is located on
> Chassis2.
> > >
> > > In OVN today this information is conveyed through:
> > >
> > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > >
> > > In your proposal:
> > >
> > > - MAC and Chassis mapping (can be learned through initial L2
> > >   broadcast/flood)
> > >
> > > This indeed would avoid the control plane cost through the centralized
> > > components (for this L2 binding part). Given that today's SB OVSDB is a
> > > bottleneck, this idea may sound attractive. But please also take into
> > > consideration the below improvement that could mitigate the OVN central
> > > scale issue:
> > >
> > > - For MAC and LSP mapping, northd is now capable of incrementally
> > >   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
> > >   SB is very small. For SB -> Chassis, a more scalable DB deployment,
> > 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-02 Thread Frode Nordahl via discuss
On Sat, Sep 30, 2023 at 3:50 PM Robin Jarry  wrote:

[ snip ]

> > > Also, will such behavior be compatible with HW-offload-capable to
> > > smartnics/DPUs?
> >
> > I am also a bit concerned about this, what would be the typical number
> > of bridges supported by hardware?
>
> As far as I understand, only the datapath flows are offloaded to
> hardware. The OF pipeline is only parsed when there is an upcall for the
> first packet. Once resolved, the datapath flow is reused. OVS bridges
> are only logical constructs, they are neither reflected in the datapath
> nor in hardware.

True, but you never know what odd bugs might pop out when doing things
like this, hence my concern :)

> > > >>> Use multicast for overlay networks
> > > >>> ==
> > > > [snip]
> > > >>> - 24bit VNI allows for more than 16 million logical switches. No need
> > > >>>  for extended GENEVE tunnel options.
> > > >> Note that using vxlan at the moment significantly reduces the ovn
> > > >> featureset. This is because the geneve header options are currently 
> > > >> used
> > > >> for data that would not fit into the vxlan vni.
> > > >>
> > > >> From ovn-architecture.7.xml:
> > > >> ```
> > > >> The maximum number of networks is reduced to 4096.
> > > >> The maximum number of ports per network is reduced to 2048.
> > > >> ACLs matching against logical ingress port identifiers are not 
> > > >> supported.
> > > >> OVN interconnection feature is not supported.
> > > >> ```
> > > >
> > > > In my understanding, the main reason why GENEVE replaced VXLAN is
> > > > because Openstack uses full mesh point to point tunnels and that the
> > > > sender needs to know behind which chassis any mac address is to send it
> > > > into the correct tunnel. GENEVE allowed to reduce the lookup time both
> > > > on the sender and receiver thanks to ingress/egress port metadata.
> > > >
> > > > https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> > > > https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
> > > >
> > > > If VXLAN + multicast and address learning was used, the "correct" tunnel
> > > > would be established ad-hoc and both sender and receiver lookups would
> > > > only be a simple mac forwarding with learning. The ingress pipeline
> > > > would probably cost a little more.
> > > >
> > > > Maybe multicast + address learning could be implemented for GENEVE as
> > > > well. But it would not be interoperable with other VTEPs.
> >
> > While it is true that it takes time before switch hardware picks up
> > support for emerging protocols, I do not think it is a valid argument
> > for limiting the development of OVN. Most hardware offload capable
> > NICs already have GENEVE support, and if you survey recent or upcoming
> > releases from top of rack switch vendors you will also find that they
> > have added support for using GENEVE for hardware VTEPs. The fact that
> > SDNs with a large customer footprint (such as NSX and OVN) make use of
> > GENEVE is most likely a deciding factor for their adoption, and I see
> > no reason why we should stop defining the edge of development in this
> > space.
>
> GENEVE could perfectly be suitable with a multicast based control plane
> to establish ad-hoc tunnels without any centralized involvement.
>
> I was only proposing VXLAN since this multicast group system was part of
> the original RFC (supported in Linux since 3.12).
>
> > > >>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> > > >>>  top-of-rack switches. Multicast is only used for BUM traffic.
> > > >>> - Only one VXLAN output port per implemented logical switch on a given
> > > >>>  chassis.
> > > >>
> > > >> Would this actually work with one VXLAN output port? Would you not need
> > > >> one port per target node to send unicast traffic (as you otherwise 
> > > >> flood
> > > >> all packets to all participating nodes)?
> > > >
> > > > You would need one VXLAN output port per implemented logical switch on
> > > > a given chassis. The port would have a VNI (unique per logical switch)
> > > > and an associated multicast IP address. Any chassis that implement this
> > > > logical switch would subscribe to that multicast group. The flooding
> > > > would be limited to first packets and broadcast/multicast traffic (ARP
> > > > requests, mostly). Once the receiver node replies, all communication
> > > > will happen with unicast.
> > > >
> > > > https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic
> > > >
> > > >>> Cons:
> > > >>>
> > > >>> - OVS does not support VXLAN address learning yet.
> > > >>> - The number of usable multicast groups in a fabric network may be
> > > >>>  limited?
> > > >>> - How to manage seamless upgrades and interoperability with older OVN
> > > >>>  versions?
> > > >> - This pushes all logic related to chassis management to the
> > > >>  underlying networking fabric. It thereby places additional
> > > >>

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-03 Thread Robin Jarry via discuss
Hi all,

Felix Huettner, Oct 02, 2023 at 09:35:
> Hi everyone,
>
> just want to add my experience below
> On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > >
> > > Hi Han,
> > >
> > > Please see my comments/questions inline.
> > >
> > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > Distributed mac learning
> > > > > 
> > > > >
> > > > > Use one OVS bridge per logical switch with mac learning
> > > > > enabled. Only create the bridge if the logical switch has
> > > > > a port bound to the local chassis.
> > > > >
> > > > > Pros:
> > > > >
> > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > >   mostly).
> > > > > - No central mac binding table required.
> > > >
> > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > table of distributed logical routers. We should probably call it
> > > > IP_MAC_binding table, or just Neighbor table.
> > >
> > > Yes sorry about the confusion. I actually meant the FDB table.
> > >
> > > > Here what you mean is actually L2 MAC learning, which today is
> > > > implemented by the FDB table in SB DB, and it is only for
> > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > a MAC address of a VIF.
> > >
> > > This is not that uncommon in telco use cases where VNFs can send
> > > packets from mac addresses unknown to OVN.
> > >
> > Understand, but VNFs contributes a very small portion of the
> > workloads, right? Maybe I should rephrase that: it is uncommon to
> > have "unknown" addresses for the majority of ports in a large scale
> > cloud. Is this understanding correct?
>
> I can only share numbers for our usecase with ~650 chassis we have the
> following distribution of "unknown" in the `addresses` field of
> Logical_Switch_Port:
> * 23000 with a mac address + ip and without "unknown"
> * 250 with a mac address + ip and with "unknown"
> * 30 with just "unknown"
>
> The usecase is a generic public cloud and we do not have any telco
> related things.

I don't have any numbers from telco deployments at hand but I will poke
around.

> > > > The purpose of this proposal is clear - to avoid using a central
> > > > table in DB for L2 information but instead using L2 MAC learning
> > > > to populate such information on chassis, which is a reasonable
> > > > alternative with pros and cons.
> > > > However, I don't think it is necessary to use separate OVS
> > > > bridges for this purpose. L2 MAC learning can be easily
> > > > implemented in the br-int bridge with OVS flows, which is much
> > > > simpler than managing dynamic number of OVS bridges just for the
> > > > purpose of using the builtin OVS mac-learning.
> > >
> > > I agree that this could also be implemented with VLAN tags on the
> > > appropriate ports. But since OVS does not support trunk ports, it
> > > may require complicated OF pipelines. My intent with this idea was
> > > two fold:
> > >
> > > 1) Avoid a central point of failure for mac learning/aging.
> > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> >
> > IMHO, the L2 pipeline is not really complex. It is probably the
> > simplest part (compared with other features for L3, NAT, ACL, LB,
> > etc.). Adding dynamic learning to this part probably makes it *a
> > little* more complex, but should still be straightforward. We don't
> > need any VLAN tag because the incoming packet has geneve VNI in the
> > metadata. We just need a flow that resubmits to lookup
> > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > tunnel endpont information) if the src MAC is not found, with the
> > help of the "learn" action. The entries are per-logical_switch
> > (VNI). This would serve your purpose of avoiding a central DB for
> > L2. At least this looks much simpler to me than managing dynamic
> > number of OVS bridges and the patch pairs between them.

Would that work for non GENEVE networks (localnet) when there is no VNI?
Does that apply as well?


> >
> > >
> > > > Now back to the distributed MAC learning idea itself.
> > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > >
> > > > In OVN today this information is conveyed through:
> > > >
> > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > > >
> > > > In your proposal:
> > > >
> > > > - MAC and Chassis mapping (can be learned through initial L2
> > > >   broadcast/flood)
> > > >
> > > > This indeed would avoid the control plane cost through the
> > > > centralized components (for this L2 binding part)

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-03 Thread Robin Jarry via discuss
Frode Nordahl, Oct 02, 2023 at 10:05:
> > I must admit I don't have enough field experience operating large
> > network fabrics to state what issues multicast can cause with these.
> > This is why I raised this in the cons list :)
> >
> > What specific issues did you have in mind?
>
> It has been a while since I was overseeing the operations of large
> metro networks, but I have vivid memories of multicast routing being a
> recurring issue. A fabric supporting 10k computes would most likely
> not be one large L2, there would be L3 routing involved and as a
> consequence your proposal imposes configuration and scale requirements
> on the fabric.
>
> Datapoints that suggest other people see this as an issue too can be
> found in the fact that popular top of rack vendors have chosen control
> plane based MAC learning for their EVPN implementations (RFC 7432).
> There are also multiple papers discussing the scaling issues of
> Multicast.

Thanks, I will try to educate myself better about this :)

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-04 Thread Felix Huettner via discuss
Hi Robin,

i'll try to answer what i can.

On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> Hi all,
>
> Felix Huettner, Oct 02, 2023 at 09:35:
> > Hi everyone,
> >
> > just want to add my experience below
> > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > > >
> > > > Hi Han,
> > > >
> > > > Please see my comments/questions inline.
> > > >
> > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > Distributed mac learning
> > > > > > 
> > > > > >
> > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > a port bound to the local chassis.
> > > > > >
> > > > > > Pros:
> > > > > >
> > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > >   mostly).
> > > > > > - No central mac binding table required.
> > > > >
> > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > table of distributed logical routers. We should probably call it
> > > > > IP_MAC_binding table, or just Neighbor table.
> > > >
> > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > >
> > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > a MAC address of a VIF.
> > > >
> > > > This is not that uncommon in telco use cases where VNFs can send
> > > > packets from mac addresses unknown to OVN.
> > > >
> > > Understand, but VNFs contributes a very small portion of the
> > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > have "unknown" addresses for the majority of ports in a large scale
> > > cloud. Is this understanding correct?
> >
> > I can only share numbers for our usecase with ~650 chassis we have the
> > following distribution of "unknown" in the `addresses` field of
> > Logical_Switch_Port:
> > * 23000 with a mac address + ip and without "unknown"
> > * 250 with a mac address + ip and with "unknown"
> > * 30 with just "unknown"
> >
> > The usecase is a generic public cloud and we do not have any telco
> > related things.
>
> I don't have any numbers from telco deployments at hand but I will poke
> around.
>
> > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > to populate such information on chassis, which is a reasonable
> > > > > alternative with pros and cons.
> > > > > However, I don't think it is necessary to use separate OVS
> > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > purpose of using the builtin OVS mac-learning.
> > > >
> > > > I agree that this could also be implemented with VLAN tags on the
> > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > may require complicated OF pipelines. My intent with this idea was
> > > > two fold:
> > > >
> > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > >
> > > IMHO, the L2 pipeline is not really complex. It is probably the
> > > simplest part (compared with other features for L3, NAT, ACL, LB,
> > > etc.). Adding dynamic learning to this part probably makes it *a
> > > little* more complex, but should still be straightforward. We don't
> > > need any VLAN tag because the incoming packet has geneve VNI in the
> > > metadata. We just need a flow that resubmits to lookup
> > > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > > tunnel endpont information) if the src MAC is not found, with the
> > > help of the "learn" action. The entries are per-logical_switch
> > > (VNI). This would serve your purpose of avoiding a central DB for
> > > L2. At least this looks much simpler to me than managing dynamic
> > > number of OVS bridges and the patch pairs between them.
>
> Would that work for non GENEVE networks (localnet) when there is no VNI?
> Does that apply as well?
>
>
> > >
> > > >
> > > > > Now back to the distributed MAC learning idea itself.
> > > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > > >
> > > > > In OVN today this information is conveyed through:
> > > > >
> > > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > >

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-04 Thread Robin Jarry via discuss
Hi Felix,

Felix Huettner, Oct 04, 2023 at 09:24:
> Hi Robin,
>
> i'll try to answer what i can.
>
> On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> > Hi all,
> >
> > Felix Huettner, Oct 02, 2023 at 09:35:
> > > Hi everyone,
> > >
> > > just want to add my experience below
> > > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > > > >
> > > > > Hi Han,
> > > > >
> > > > > Please see my comments/questions inline.
> > > > >
> > > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > > Distributed mac learning
> > > > > > > 
> > > > > > >
> > > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > > a port bound to the local chassis.
> > > > > > >
> > > > > > > Pros:
> > > > > > >
> > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > > >   mostly).
> > > > > > > - No central mac binding table required.
> > > > > >
> > > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > > table of distributed logical routers. We should probably call it
> > > > > > IP_MAC_binding table, or just Neighbor table.
> > > > >
> > > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > > >
> > > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > > a MAC address of a VIF.
> > > > >
> > > > > This is not that uncommon in telco use cases where VNFs can send
> > > > > packets from mac addresses unknown to OVN.
> > > > >
> > > > Understand, but VNFs contributes a very small portion of the
> > > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > > have "unknown" addresses for the majority of ports in a large scale
> > > > cloud. Is this understanding correct?
> > >
> > > I can only share numbers for our usecase with ~650 chassis we have the
> > > following distribution of "unknown" in the `addresses` field of
> > > Logical_Switch_Port:
> > > * 23000 with a mac address + ip and without "unknown"
> > > * 250 with a mac address + ip and with "unknown"
> > > * 30 with just "unknown"
> > >
> > > The usecase is a generic public cloud and we do not have any telco
> > > related things.
> >
> > I don't have any numbers from telco deployments at hand but I will poke
> > around.
> >
> > > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > > to populate such information on chassis, which is a reasonable
> > > > > > alternative with pros and cons.
> > > > > > However, I don't think it is necessary to use separate OVS
> > > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > > purpose of using the builtin OVS mac-learning.
> > > > >
> > > > > I agree that this could also be implemented with VLAN tags on the
> > > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > > may require complicated OF pipelines. My intent with this idea was
> > > > > two fold:
> > > > >
> > > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > > >
> > > > IMHO, the L2 pipeline is not really complex. It is probably the
> > > > simplest part (compared with other features for L3, NAT, ACL, LB,
> > > > etc.). Adding dynamic learning to this part probably makes it *a
> > > > little* more complex, but should still be straightforward. We don't
> > > > need any VLAN tag because the incoming packet has geneve VNI in the
> > > > metadata. We just need a flow that resubmits to lookup
> > > > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > > > tunnel endpont information) if the src MAC is not found, with the
> > > > help of the "learn" action. The entries are per-logical_switch
> > > > (VNI). This would serve your purpose of avoiding a central DB for
> > > > L2. At least this looks much simpler to me than managing dynamic
> > > > number of OVS bridges and the patch pairs between them.
> >
> > Would that work for non GENEVE networks (localnet) when there is no VNI?
> > Does that apply as well?
> >
> >
> > > >
> > > > >
> > > > > > Now back to the distributed MAC learning idea itself.
> > > > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > > > VM1 already has VM2's MAC 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-23 Thread Felix Huettner via discuss
On Wed, Oct 04, 2023 at 04:41:31PM +0200, Robin Jarry via discuss wrote:
> Hi Felix,
>
> Felix Huettner, Oct 04, 2023 at 09:24:
> > Hi Robin,
> >
> > i'll try to answer what i can.
> >
> > On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> > > Hi all,
> > >
> > > Felix Huettner, Oct 02, 2023 at 09:35:
> > > > Hi everyone,
> > > >
> > > > just want to add my experience below
> > > > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > > > > >
> > > > > > Hi Han,
> > > > > >
> > > > > > Please see my comments/questions inline.
> > > > > >
> > > > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > > > Distributed mac learning
> > > > > > > > 
> > > > > > > >
> > > > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > > > a port bound to the local chassis.
> > > > > > > >
> > > > > > > > Pros:
> > > > > > > >
> > > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > > > >   mostly).
> > > > > > > > - No central mac binding table required.
> > > > > > >
> > > > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > > > table of distributed logical routers. We should probably call it
> > > > > > > IP_MAC_binding table, or just Neighbor table.
> > > > > >
> > > > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > > > >
> > > > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > > > a MAC address of a VIF.
> > > > > >
> > > > > > This is not that uncommon in telco use cases where VNFs can send
> > > > > > packets from mac addresses unknown to OVN.
> > > > > >
> > > > > Understand, but VNFs contributes a very small portion of the
> > > > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > > > have "unknown" addresses for the majority of ports in a large scale
> > > > > cloud. Is this understanding correct?
> > > >
> > > > I can only share numbers for our usecase with ~650 chassis we have the
> > > > following distribution of "unknown" in the `addresses` field of
> > > > Logical_Switch_Port:
> > > > * 23000 with a mac address + ip and without "unknown"
> > > > * 250 with a mac address + ip and with "unknown"
> > > > * 30 with just "unknown"
> > > >
> > > > The usecase is a generic public cloud and we do not have any telco
> > > > related things.
> > >
> > > I don't have any numbers from telco deployments at hand but I will poke
> > > around.
> > >
> > > > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > > > to populate such information on chassis, which is a reasonable
> > > > > > > alternative with pros and cons.
> > > > > > > However, I don't think it is necessary to use separate OVS
> > > > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > > > purpose of using the builtin OVS mac-learning.
> > > > > >
> > > > > > I agree that this could also be implemented with VLAN tags on the
> > > > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > > > may require complicated OF pipelines. My intent with this idea was
> > > > > > two fold:
> > > > > >
> > > > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > > > >
> > > > > IMHO, the L2 pipeline is not really complex. It is probably the
> > > > > simplest part (compared with other features for L3, NAT, ACL, LB,
> > > > > etc.). Adding dynamic learning to this part probably makes it *a
> > > > > little* more complex, but should still be straightforward. We don't
> > > > > need any VLAN tag because the incoming packet has geneve VNI in the
> > > > > metadata. We just need a flow that resubmits to lookup
> > > > > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > > > > tunnel endpont information) if the src MAC is not found, with the
> > > > > help of the "learn" action. The entries are per-logical_switch
> > > > > (VNI). This would serve your purpose of avoiding a central DB for
> > > > > L2. At least this looks much simpler to me than managing dynamic
> > > > > number of OVS bridges and the patch pairs between them.
> > >
> > > Would that work for non GENEVE networks (localnet) when there is no VNI?
> > > Does that apply as we