Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-04 Thread Robin Jarry via discuss

Hi Mark,

Mark Michelson, Oct 03, 2023 at 23:09:

Hi Robin,

Thanks a bunch for putting these two emails together. I've read through 
them and the replies.


I think there's one major issue: a lack of data.


That's my concern as well... The problem is, it is very hard to get 
reliable and actionable data when it comes to that level of scale. 
I have been trying to collect such data and put together realistic 
scenarios but failed until now.


I think the four bullet points you listed below are admirable goals. The 
problem is that I think we're putting the cart before the horse with 
both the issues and proposals.


In other words, before being able to properly evaluate these emails, we 
need to see a scenario that

1) Has clear goals for what scalability metrics are desired.
2) Shows evidence that these scalability goals are not being met.
3) Shows evidence that one or more of the issues listed in this email 
are the cause for the scalability issues in the scenario.
4) Shows evidence that the proposed changes would fix the scalability 
issues in the scenario.


I hope that the ongoing work on ovn-heater will help in that regard.

I listed them in this order because without a failing scenario, we can't 
claim the scalability is poor. Then if we have a failing scenario, it's 
possible that the problem and solution is much simpler than any of the 
issues or proposals that have been brought up here. Then, it's also 
possible that maybe only a subset of the issues listed in this email are 
contributing to the failure. Even if the issues identified here are 
directly causing the scenario to fail, there may still be simpler 
solutions than what has been proposed. And finally, it's possible that 
the proposed solutions don't actually result in the expected scale increase.


I want to make sure my tone is coming across clearly here. I don't think 
the current OVN architecture is perfect, and I don't want to be 
dismissive of the issues you've raised. If there are changes we can make 
to simplify OVN and scale better at the same time, I'm all for it. The 
problem is that, as you pointed out in your proposal email, most of 
these proposals result in difficulties for upgrades/downgrades, as well 
as code maintenance. Therefore, if we are going to do any of these, we 
need to first be certain that we aren't scaling as well as we would 
like, and that there are not simpler paths to reach our scalability targets.


I get your point and this is specifically why I did split the 
conversation in two. I did not want my proposals to be mixed up with 
the issues.


I will see if I can get hard data that can demonstrate what I claim.

Thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-04 Thread Robin Jarry via discuss
Hi Felix,

Felix Huettner, Oct 04, 2023 at 09:24:
> Hi Robin,
>
> i'll try to answer what i can.
>
> On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> > Hi all,
> >
> > Felix Huettner, Oct 02, 2023 at 09:35:
> > > Hi everyone,
> > >
> > > just want to add my experience below
> > > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > > > >
> > > > > Hi Han,
> > > > >
> > > > > Please see my comments/questions inline.
> > > > >
> > > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > > Distributed mac learning
> > > > > > > 
> > > > > > >
> > > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > > a port bound to the local chassis.
> > > > > > >
> > > > > > > Pros:
> > > > > > >
> > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > > >   mostly).
> > > > > > > - No central mac binding table required.
> > > > > >
> > > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > > table of distributed logical routers. We should probably call it
> > > > > > IP_MAC_binding table, or just Neighbor table.
> > > > >
> > > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > > >
> > > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > > a MAC address of a VIF.
> > > > >
> > > > > This is not that uncommon in telco use cases where VNFs can send
> > > > > packets from mac addresses unknown to OVN.
> > > > >
> > > > Understand, but VNFs contributes a very small portion of the
> > > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > > have "unknown" addresses for the majority of ports in a large scale
> > > > cloud. Is this understanding correct?
> > >
> > > I can only share numbers for our usecase with ~650 chassis we have the
> > > following distribution of "unknown" in the `addresses` field of
> > > Logical_Switch_Port:
> > > * 23000 with a mac address + ip and without "unknown"
> > > * 250 with a mac address + ip and with "unknown"
> > > * 30 with just "unknown"
> > >
> > > The usecase is a generic public cloud and we do not have any telco
> > > related things.
> >
> > I don't have any numbers from telco deployments at hand but I will poke
> > around.
> >
> > > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > > to populate such information on chassis, which is a reasonable
> > > > > > alternative with pros and cons.
> > > > > > However, I don't think it is necessary to use separate OVS
> > > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > > purpose of using the builtin OVS mac-learning.
> > > > >
> > > > > I agree that this could also be implemented with VLAN tags on the
> > > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > > may require complicated OF pipelines. My intent with this idea was
> > > > > two fold:
> > > > >
> > > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > > >
> > > > IM

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-03 Thread Robin Jarry via discuss
Frode Nordahl, Oct 02, 2023 at 10:05:
> > I must admit I don't have enough field experience operating large
> > network fabrics to state what issues multicast can cause with these.
> > This is why I raised this in the cons list :)
> >
> > What specific issues did you have in mind?
>
> It has been a while since I was overseeing the operations of large
> metro networks, but I have vivid memories of multicast routing being a
> recurring issue. A fabric supporting 10k computes would most likely
> not be one large L2, there would be L3 routing involved and as a
> consequence your proposal imposes configuration and scale requirements
> on the fabric.
>
> Datapoints that suggest other people see this as an issue too can be
> found in the fact that popular top of rack vendors have chosen control
> plane based MAC learning for their EVPN implementations (RFC 7432).
> There are also multiple papers discussing the scaling issues of
> Multicast.

Thanks, I will try to educate myself better about this :)

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-03 Thread Robin Jarry via discuss
Hi all,

Felix Huettner, Oct 02, 2023 at 09:35:
> Hi everyone,
>
> just want to add my experience below
> On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > >
> > > Hi Han,
> > >
> > > Please see my comments/questions inline.
> > >
> > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > Distributed mac learning
> > > > > 
> > > > >
> > > > > Use one OVS bridge per logical switch with mac learning
> > > > > enabled. Only create the bridge if the logical switch has
> > > > > a port bound to the local chassis.
> > > > >
> > > > > Pros:
> > > > >
> > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > >   mostly).
> > > > > - No central mac binding table required.
> > > >
> > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > table of distributed logical routers. We should probably call it
> > > > IP_MAC_binding table, or just Neighbor table.
> > >
> > > Yes sorry about the confusion. I actually meant the FDB table.
> > >
> > > > Here what you mean is actually L2 MAC learning, which today is
> > > > implemented by the FDB table in SB DB, and it is only for
> > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > a MAC address of a VIF.
> > >
> > > This is not that uncommon in telco use cases where VNFs can send
> > > packets from mac addresses unknown to OVN.
> > >
> > Understand, but VNFs contributes a very small portion of the
> > workloads, right? Maybe I should rephrase that: it is uncommon to
> > have "unknown" addresses for the majority of ports in a large scale
> > cloud. Is this understanding correct?
>
> I can only share numbers for our usecase with ~650 chassis we have the
> following distribution of "unknown" in the `addresses` field of
> Logical_Switch_Port:
> * 23000 with a mac address + ip and without "unknown"
> * 250 with a mac address + ip and with "unknown"
> * 30 with just "unknown"
>
> The usecase is a generic public cloud and we do not have any telco
> related things.

I don't have any numbers from telco deployments at hand but I will poke
around.

> > > > The purpose of this proposal is clear - to avoid using a central
> > > > table in DB for L2 information but instead using L2 MAC learning
> > > > to populate such information on chassis, which is a reasonable
> > > > alternative with pros and cons.
> > > > However, I don't think it is necessary to use separate OVS
> > > > bridges for this purpose. L2 MAC learning can be easily
> > > > implemented in the br-int bridge with OVS flows, which is much
> > > > simpler than managing dynamic number of OVS bridges just for the
> > > > purpose of using the builtin OVS mac-learning.
> > >
> > > I agree that this could also be implemented with VLAN tags on the
> > > appropriate ports. But since OVS does not support trunk ports, it
> > > may require complicated OF pipelines. My intent with this idea was
> > > two fold:
> > >
> > > 1) Avoid a central point of failure for mac learning/aging.
> > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> >
> > IMHO, the L2 pipeline is not really complex. It is probably the
> > simplest part (compared with other features for L3, NAT, ACL, LB,
> > etc.). Adding dynamic learning to this part probably makes it *a
> > little* more complex, but should still be straightforward. We don't
> > need any VLAN tag because the incoming packet has geneve VNI in the
> > metadata. We just need a flow that resubmits to lookup
> > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > tunnel endpont information) if the src MAC is not found, with the
> > help of the "learn" action. The entries are per-logical_switch
> > (VNI). This would serve your purpose of avoiding a central DB for
> > L2. At least this looks much simpler to me than managing dynamic
> > number of OVS bridges and the patch pairs between them.

Would that work for non GENEVE networks (localnet) when there is no VNI?
Does that apply as well?


> >
> > >
> > > > Now back to the distributed MAC learning idea itself.
> > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > >
> > > > In OVN today this information is conveyed through:
> > > >
> > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > > >
> > > > In your proposal:
> > > >
> > > > - MAC and Chassis mapping (can be learned through initial L2
> > > >   broadcast/flood)
> > > >
> > > > This indeed would avoid the control plane cost through the
> > > > centralized components (for this L2 binding 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-02 Thread Robin Jarry via discuss
Han Zhou, Oct 01, 2023 at 21:30:
> Please note that tunnels are needed not only between nodes related to same
> logical switches, but also when they are related to different logical
> switches connected by logical routers (even multiple LR+LS hops away).

Yep.

> To clarify a little more, openstack deployment can have different logical
> topologies. So to evaluate the impact of monitor_all settings there should
> be different test cases to capture different types of deployment, e.g.
> full-mesh topology (monitor_all=true is better) v.s. "small islands"
> toplogy (monitor_all=false is reasonable).

This is one thing to note for the recent ovn-heater work that adds
openstack test cases.

> FDB and MAC_binding tables are used by ovn-controllers. They are
> essentially the central storage for MAC tables of the distributed logical
> switches (FDB) and ARP/Neighbour tables for distributed logical routers
> (MAC_binding). A record can be populate by one chassis and consumed by many
> other chassis.
>
> monitor_all should work the same way for these tables: if monitor_all =
> false, only rows related to "local datapaths" should be downloaded to the
> chassis. However, for FDB table, the condition is not set for now (which
> may have been a miss in the initial implementation). Perhaps this is not
> noticed because MAC learning is not a very widely used feature and no scale
> impact noticed, but I just proposed a patch to enable the conditional
> monitoring:
> https://patchwork.ozlabs.org/project/ovn/patch/20231001192658.1012806-1-hz...@ovn.org/

Ok thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-01 Thread Robin Jarry via discuss
Hi Han,

Please see my comments/questions inline.

Han Zhou, Sep 30, 2023 at 21:59:
> > Distributed mac learning
> > 
> >
> > Use one OVS bridge per logical switch with mac learning enabled. Only
> > create the bridge if the logical switch has a port bound to the local
> > chassis.
> >
> > Pros:
> >
> > - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> > - No central mac binding table required.
>
> Firstly to clarify the terminology of "mac binding" to avoid confusion, the
> mac_binding table currently in SB DB has nothing to do with L2 MAC
> learning. It is actually the ARP/Neighbor table of distributed logical
> routers. We should probably call it IP_MAC_binding table, or just Neighbor
> table.

Yes sorry about the confusion. I actually meant the FDB table.

> Here what you mean is actually L2 MAC learning, which today is implemented
> by the FDB table in SB DB, and it is only for uncommon use cases when the
> NB doesn't have the knowledge of a MAC address of a VIF.

This is not that uncommon in telco use cases where VNFs can send packets
from mac addresses unknown to OVN.

> The purpose of this proposal is clear - to avoid using a central table in
> DB for L2 information but instead using L2 MAC learning to populate such
> information on chassis, which is a reasonable alternative with pros and
> cons.
> However, I don't think it is necessary to use separate OVS bridges for this
> purpose. L2 MAC learning can be easily implemented in the br-int bridge
> with OVS flows, which is much simpler than managing dynamic number of OVS
> bridges just for the purpose of using the builtin OVS mac-learning.

I agree that this could also be implemented with VLAN tags on the
appropriate ports. But since OVS does not support trunk ports, it may
require complicated OF pipelines. My intent with this idea was two fold:

1) Avoid a central point of failure for mac learning/aging.
2) Simplify the OF pipeline by making all FDB operations dynamic.

> Now back to the distributed MAC learning idea itself. Essentially for two
> VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
> VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
> this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
>
> In OVN today this information is conveyed through:
>
> - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> - LSP and Chassis mapping (Chassis -> SB -> Chassis)
>
> In your proposal:
>
> - MAC and Chassis mapping (can be learned through initial L2
>   broadcast/flood)
>
> This indeed would avoid the control plane cost through the centralized
> components (for this L2 binding part). Given that today's SB OVSDB is a
> bottleneck, this idea may sound attractive. But please also take into
> consideration the below improvement that could mitigate the OVN central
> scale issue:
>
> - For MAC and LSP mapping, northd is now capable of incrementally
>   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
>   SB is very small. For SB -> Chassis, a more scalable DB deployment,
>   such as the OVSDB relays, may largely help.

But using relays will only help with read-only operations (SB ->
chassis). Write operations (from dynamically learned mac addresses) will
be equivalent.

> - For LSP and Chassis mapping, the round trip through a central DB
>   obviously costs higher than a direct L2 broadcast (the targets are
>   the same). But this can be optimized if the MAC and Chassis is known
>   by the CMS system (which is true for most openstack/k8s env
>   I believe). Instead of updating the binding from each Chassis, CMS
>   can tell this information through the same NB -> northd -> SB ->
>   Chassis path, and the Chassis can just read the SB without updating
>   it.

This is only applicable for known L2 addresses.  Maybe telco use cases
are very specific, but being able to have ports that send packets from
unknown addresses is a strong requirement.

> On the other hand, the dynamic MAC learning approach has its own drawbacks.
>
> - It is simple to consider L2 only, but if considering more SDB
>   features, a central DB is more flexible to extend and implement new
>   features than a network protocol based approach.
> - It is more predictable and easier to debug with pre-populated
>   information through CMS than states learned dynamically in
>   data-plane.
> - With the DB approach we can suppress most of L2 broadcast/flood,
>   while with the distributed MAC learning broadcast/flood can't be
>   avoided. Although it may happen mostly when a new workload is
>   launched, it can also happen when aging. The cost of broadcast in
>   large L2 is also a potential threat to scale.

I may lack the field experience of operating large datacenter networks
but I was not aware of any scaling issues because of ARP and/or other L2
broadcasts.  Is this an actual problem that was reported by cloud/telco
operators and which influenced the 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-01 Thread Robin Jarry via discuss
Hi Han,

thanks a lot for your detailed answer.

Han Zhou, Sep 30, 2023 at 01:03:
> > I think ovn-controller only consumes the logical flows. The chassis and
> > port bindings tables are used by northd to updated these logical flows.
>
> Felix was right. For example, port-binding is firstly a configuration from
> north-bound, but the states such as its physical location (the chassis
> column) are populated by ovn-controller of the owning chassis and consumed
> by other ovn-controllers that are interested in that port-binding.

I was not aware of this. Thanks.

> > Exactly, but was the signaling between the nodes ever an issue?
>
> I am not an expert of BGP, but at least for what I am aware of, there are
> scaling issues in things like BGP full mesh signaling, and there are
> solutions such as route reflector (which is again centralized) to solve
> such issues.

I am not familiar with BGP full mesh signaling. But from what can tell,
it looks like the same concept than the full mesh GENEVE tunnels. Except
that the tunnels are only used when the same logical switch is
implemented between two nodes.

> > So you have enabled monitor_all=true as well? Or did you test at scale
> > with monitor_all=false.
> >
> We do use monitor_all=false, primarily to reduce memory footprint (and also
> CPU cost of IDL processing) on each chassis. There are trade-offs to the SB
> DB server performance:
>
> - On one hand it increases the cost of conditional monitoring, which
>   is expensive for sure
> - On the other hand, it reduces the total amount of data for the
>   server to propagate to clients
>
> It really depends on your topology for making the choice. If most of the
> nodes would anyway monitor most of the DB data (something similar to a
> full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
> topology like ovn-kubernetes where each node has its dedicated part of the
> data, or in topologies where you have lots of small "island" such as a
> cloud with many small tenants that never talks to each other, using
> monitor_all=false could make sense (but still need to be carefully
> evaluated and tested for your own use cases).

I didn't see recent scale testing for openstack, but in past testing we
had to set monitor_all=true because the CPU usage of the SB ovsdb was
a bottleneck.

> > The memory usage would be reduced but I don't know to which point. One
> > of the main consumers is the logical flows table which is required
> > everywhere. Unless there is a way to only sync a portion of this table
> > depending on the chassis, disabling monitor_all would save syncing the
> > unneeded tables for ovn-controller: chassis, port bindings, etc.
>
> Probably it wasn't what you meant, but I'd like to clarify that it is not
> about unneeded tables, but unneeded rows in those tables (mainly
> logical_flow and port_binding).
> It indeed syncs only a portion of the tables. It is not depending directly
> on chassis, but depending on what port-bindings are on the chassis and what
> logical connectivity those port-bindings have. So, again, the choice really
> depends on your use cases.

What about the FDB (mac-port) and MAC binding (ip-mac) tables? I thought
ovn-controller does not need them. If that is the case, I thought that
by default, the whole tables (not only some of their rows) were excluded
from the synchronized data.

Thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-30 Thread Robin Jarry via discuss
Hi Vladislav, Frode,

Thanks for your replies.

Frode Nordahl, Sep 30, 2023 at 10:55:
> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
>  wrote:
> > > On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
> > >  wrote:
> > >
> > > Felix Huettner, Sep 29, 2023 at 15:23:
> > >>> Distributed mac learning
> > >>> 
> > > [snip]
> > >>>
> > >>> Cons:
> > >>>
> > >>> - How to manage seamless upgrades?
> > >>> - Requires ovn-controller to move/plug ports in the correct bridge.
> > >>> - Multiple openflow connections (one per managed bridge).
> > >>> - Requires ovn-trace to be reimplemented differently (maybe other tools
> > >>>  as well).
> > >>
> > >> - No central information anymore on mac bindings. All nodes need to
> > >>  update their data individually
> > >> - Each bridge generates also a linux network interface. I do not know if
> > >>  there is some kind of limit to the linux interfaces or the ovs bridges
> > >>  somewhere.
> > >
> > > That's a good point. However, only the bridges related to one
> > > implemented logical network would need to be created on a single
> > > chassis. Even with the largest OVN deployments, I doubt this would be
> > > a limitation.
> > >
> > >> Would you still preprovision static mac addresses on the bridge for all
> > >> port_bindings we know the mac address from, or would you rather leave
> > >> that up for learning as well?
> > >
> > > I would leave everything dynamic.
> > >
> > >> I do not know if there is some kind of performance/optimization penality
> > >> for moving packets between different bridges.
> > >
> > > As far as I know, once the openflow pipeline has been resolved into
> > > a datapath flow, there is no penalty.
> > >
> > >> You can also not only use the logical switch that have a local port
> > >> bound. Assume the following topology:
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> vm1 and vm2 are both running on the same hypervisor. Creating only local
> > >> logical switches would mean only ls1 and ls3 are available on that
> > >> hypervisor. This would break the connection between the two vms which
> > >> would in the current implementation just traverse the two logical
> > >> routers.
> > >> I guess we would need to create bridges for each locally reachable
> > >> logical switch. I am concerned about the potentially significant
> > >> increase in bridges and openflow connections this brings.
> > >
> > > That is one of the concerns I raised in the last point. In my opinion
> > > this is a trade off. You remove centralization and require more local
> > > processing. But overall, the processing cost should remain equivalent.
> >
> > Just want to clarify.
> >
> > For topology described by Felix above, you propose to create 2 OVS
> > bridges, right? How will the packet traverse from vm1 to vm2?

In this particular case, there would be 3 OVS bridges, one for each
logical switch.

> > Currently when the packet enters OVS all the logical switching and
> > routing openflow calculation is done with no packet re-entering OVS,
> > and this results in one DP flow match to deliver this packet from
> > vm1 to vm2 (if no conntrack used, which could introduce
> > recirculations).
> >
> > Do I understand correctly, that in this proposal OVS needs to
> > receive packet from “ls1” bridge, next run through lrouter “lr1”
> > OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> > learning between logical routers (should we have here OF flow with
> > learn action?), then send packet again to OVS, calculate “lr2”
> > OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> > send packet to a vm2?

What I am proposing is to implement the northbound L2 network intent
with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
constructs and ACLs would require patch ports and specific OF pipelines.

We could even think of adding more advanced L3 capabilities (RIB) into
OVS to simplify the OF pipelines.

> > Also, will such behavior be compatible with HW-offload-capable to
> > smartnics/DPUs?
>
> I am also a bit concerned about this, what woul

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-29 Thread Robin Jarry via discuss
Felix Huettner, Sep 29, 2023 at 15:23:
> > Distributed mac learning
> > 
[snip]
> >
> > Cons:
> >
> > - How to manage seamless upgrades?
> > - Requires ovn-controller to move/plug ports in the correct bridge.
> > - Multiple openflow connections (one per managed bridge).
> > - Requires ovn-trace to be reimplemented differently (maybe other tools
> >   as well).
>
> - No central information anymore on mac bindings. All nodes need to
>   update their data individually
> - Each bridge generates also a linux network interface. I do not know if
>   there is some kind of limit to the linux interfaces or the ovs bridges
>   somewhere.

That's a good point. However, only the bridges related to one
implemented logical network would need to be created on a single
chassis. Even with the largest OVN deployments, I doubt this would be
a limitation.

> Would you still preprovision static mac addresses on the bridge for all
> port_bindings we know the mac address from, or would you rather leave
> that up for learning as well?

I would leave everything dynamic.

> I do not know if there is some kind of performance/optimization penality
> for moving packets between different bridges.

As far as I know, once the openflow pipeline has been resolved into
a datapath flow, there is no penalty.

> You can also not only use the logical switch that have a local port
> bound. Assume the following topology:
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> vm1 and vm2 are both running on the same hypervisor. Creating only local
> logical switches would mean only ls1 and ls3 are available on that
> hypervisor. This would break the connection between the two vms which
> would in the current implementation just traverse the two logical
> routers.
> I guess we would need to create bridges for each locally reachable
> logical switch. I am concerned about the potentially significant
> increase in bridges and openflow connections this brings.

That is one of the concerns I raised in the last point. In my opinion
this is a trade off. You remove centralization and require more local
processing. But overall, the processing cost should remain equivalent.

> > Use multicast for overlay networks
> > ==
[snip]
> > - 24bit VNI allows for more than 16 million logical switches. No need
> >   for extended GENEVE tunnel options.
> Note that using vxlan at the moment significantly reduces the ovn
> featureset. This is because the geneve header options are currently used
> for data that would not fit into the vxlan vni.
>
> From ovn-architecture.7.xml:
> ```
> The maximum number of networks is reduced to 4096.
> The maximum number of ports per network is reduced to 2048.
> ACLs matching against logical ingress port identifiers are not supported.
> OVN interconnection feature is not supported.
> ```

In my understanding, the main reason why GENEVE replaced VXLAN is
because Openstack uses full mesh point to point tunnels and that the
sender needs to know behind which chassis any mac address is to send it
into the correct tunnel. GENEVE allowed to reduce the lookup time both
on the sender and receiver thanks to ingress/egress port metadata.

https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/

If VXLAN + multicast and address learning was used, the "correct" tunnel
would be established ad-hoc and both sender and receiver lookups would
only be a simple mac forwarding with learning. The ingress pipeline
would probably cost a little more.

Maybe multicast + address learning could be implemented for GENEVE as
well. But it would not be interoperable with other VTEPs.

> > - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> >   top-of-rack switches. Multicast is only used for BUM traffic.
> > - Only one VXLAN output port per implemented logical switch on a given
> >   chassis.
>
> Would this actually work with one VXLAN output port? Would you not need
> one port per target node to send unicast traffic (as you otherwise flood
> all packets to all participating nodes)?

You would need one VXLAN output port per implemented logical switch on
a given chassis. The port would have a VNI (unique per logical switch)
and an associated multicast IP address. Any chassis that implement this
logical switch would subscribe to that multicast group. The flooding
would be limited to first packets and broadcast/multicast traffic (ARP
requests, mostly). Once the receiver node replies, all communication
will happen with unicast.

https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic

> > Cons:
> >
> > - OVS does not support VXLAN address learning yet.
> > - The number of usable multicast groups in a fabric network may be
> >   limited?
> > - How to manage seamless upgrades and 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-29 Thread Robin Jarry via discuss
Hi Felix,

Thanks a lot for your message.

Felix Huettner, Sep 29, 2023 at 14:35:
> I can get that when running 10k ovn-controllers the benefits of
> optimizing cpu and memory load are quite significant. However i am
> unsure about reducing the footprint of ovn-northd.
> When running so many nodes i would have assumed that having an
> additional (or maybe two) dedicated machines for ovn-northd would
> be completely acceptable, as long as it can still actually do what
> it should in a reasonable timeframe.
> Would the goal for ovn-northd be more like "Reduce the full/incremental
> recompute time" then?

The main goal of this thread is to get a consensus on the actual issues
that prevent scaling at the moment. We can discuss solutions in the
other thread.

> > * Allow support for alternative datapath implementations.
>
> Does this mean ovs datapths (e.g. dpdk) or something different?

See the other thread.

> > Southbound Design
> > =
...
> Note that also ovn-controller consumes the "state" of other chassis to
> e.g build the tunnels to other chassis. To visualize my understanding
>
> ++---++
> || configuration |   state|
> ++---++
> |   ovn-northd   |  write-only   | read-only  |
> ++---++
> | ovn-controller |   read-only   | read-write |
> ++---++
> |some cms|  no access?   | read-only  |
> ++---++

I think ovn-controller only consumes the logical flows. The chassis and
port bindings tables are used by northd to updated these logical flows.

> > Centralized decisions
> > =
> >
> > Every chassis needs to be "aware" of all other chassis in the cluster.
>
> I think we need to accept this as fundamental truth. Indepentent if you
> look at centralized designs like ovn or the neutron-l2 implementation
> or if you look at decentralized designs like bgp or spanning tree. In
> all cases if we need some kind of organized communication we need to
> know all relevant peers.
> Designs might diverge if you need to be "aware" of all peers or just
> some of them, but that is just a tradeoff between data size and options
> you have to forward data.
>
> > This requirement mainly comes from overlay networks that are implemented
> > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> > limitations). It is not a scaling issue by itself, but it implies
> > a centralized decision which in turn puts pressure on the central node
> > at scale.
>
> +1. On the other hand it removes signaling needs between the nodes (like
> you would have with bgp).

Exactly, but was the signaling between the nodes ever an issue?

> > Due to ovsdb monitoring and caching, any change in the southbound DB
> > (either by northd or by any of the chassis controllers) is replicated on
> > every chassis. The monitor_all option is often enabled on large clusters
> > to avoid the conditional monitoring CPU cost on the central node.
>
> This is, i guess, something that should be possible to fix. We have also
> enabled this setting as it gave us stability improvements and we do not
> yet see performance issues with it

So you have enabled monitor_all=true as well? Or did you test at scale
with monitor_all=false.

What I am saying is that without monitor_all=true, the southbound
ovsdb-server needs to do checks to determine what updates to send to
which client. Since the server is single threaded, it becomes an issue
at scale. I know that there were some significant improvements made
recently but it will only push the limit further. I don't have hard data
to prove my point yet unfortunately.

> > This leads to high memory usage on all chassis, control plane traffic
> > and possible disruptions in the ovs-vswitchd datapath flow cache.
> > Unfortunately, I don't have any hard data to back this claim. This is
> > mainly coming from discussions I had with neutron contributors and from
> > brainstorming sessions with colleagues.
>
> Could you maybe elaborate on the datapath flow cache issue, as it sounds
> like it might affect actual live traffic and i am not aware of details
> there.

I may have had a wrong understanding of the mechanisms of OVS here.
I was under the impression that any update of the openflow rules would
invalidate of all datapath flows. It is far more subtle than this [1].
So unless there is an actual change in the packet pipeline, live traffic
should not be affected.

[1] 
https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained

> The memory usage and the traffic would be fixed by not having to rely on
> monitor_all, right?

The memory usage would be reduced but I don't know to which point. One
of the main consumers is the logical flows table which is required
everywhere. Unless there is a way to only sync a portion of this table

[ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-09-28 Thread Robin Jarry via discuss
Hello OVN community,

This is a follow up on the message I have sent today [1]. That second
part focuses on some ideas I have to remove the limitations that were
mentioned in the previous email.

[1] 
https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html

If you didn't read it, my goal is to start a discussion about how we
could improve OVN on the following topics:

- Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
- Support scaling of L2 connectivity across larger clusters.
- Simplify CMS interoperability.
- Allow support for alternative datapath implementations.

Disclaimer:

This message does not mention anything about L3/L4 features of OVN.
I didn't have time to work on these, yet. I hope we can discuss how
these fit with my ideas.

Distributed mac learning


Use one OVS bridge per logical switch with mac learning enabled. Only
create the bridge if the logical switch has a port bound to the local
chassis.

Pros:

- Minimal openflow rules required in each bridge (ACLs and NAT mostly).
- No central mac binding table required.
- Mac table aging comes for free.
- Zero access to southbound DB for learned addresses nor for aging.

Cons:

- How to manage seamless upgrades?
- Requires ovn-controller to move/plug ports in the correct bridge.
- Multiple openflow connections (one per managed bridge).
- Requires ovn-trace to be reimplemented differently (maybe other tools
  as well).

Use multicast for overlay networks
==

Use a unique 24bit VNI per overlay network. Derive a multicast group
address from that VNI. Use VXLAN address learning [2] to remove the need
for ovn-controller to know the destination chassis for every mac address
in advance.

[2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2

Pros:

- Nodes do not need to know about others in advance. The control plane
  load is distributed across the cluster.
- 24bit VNI allows for more than 16 million logical switches. No need
  for extended GENEVE tunnel options.
- Limited and scoped "flooding" with IGMP/MLD snooping enabled in
  top-of-rack switches. Multicast is only used for BUM traffic.
- Only one VXLAN output port per implemented logical switch on a given
  chassis.

Cons:

- OVS does not support VXLAN address learning yet.
- The number of usable multicast groups in a fabric network may be
  limited?
- How to manage seamless upgrades and interoperability with older OVN
  versions?

Connect ovn-controller to the northbound DB
===

This idea extends on a previous proposal to migrate the logical flows
creation in ovn-controller [3].

[3] 
https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/

If the first two proposals are implemented, the southbound database can
be removed from the picture. ovn-controller can directly translate the
northbound schema into OVS configuration bridges, ports and flow rules.

For other components that require access to the southbound DB (e.g.
neutron metadata agent), ovn-controller should provide an interface to
expose state and configuration data for local consumption.

All state information present in the NB DB should be moved to a separate
state database [4] for CMS consumption.

[4] https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html

For those who like visuals, I have started working on basic use cases
and how they would be implemented without a southbound database [5].

[5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3

Pros:

- The northbound DB is smaller by design: reduced network bandwidth and
  memory usage in all chassis.
- If we keep the northbound read-only for ovn-controller, it removes
  scaling issues when one controller updates one row that needs to be
  replicated everywhere.
- The northbound schema knows nothing about flows. We could introduce
  alternative dataplane backends configured by ovn-controller via
  plugins. I have done a minimal PoC to check if it could work with the
  linux network stack [6].

[6] https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go

Cons:

- This would be a serious API breakage for systems that depend on the
  southbound DB.
- Can all OVN constructs be implemented without a southbound DB?
- Is the community interested in alternative datapaths?

Closing thoughts


I mainly focused on OpenStack use cases for now, but I think these
propositions could benefit Kubernetes as well.

I hope I didn't bore everyone to death. Let me know what you think.

Cheers!

-- 
Robin Jarry
Red Hat, Telco/NFV

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-09-28 Thread Robin Jarry via discuss
Hello OVN community,

I'm glad the subject of this message has caught your attention :-)

I would like to start a discussion about how we could improve OVN on the
following topics:

* Reduce the memory and CPU footprint of ovn-controller, ovn-northd.
* Support scaling of L2 connectivity across larger clusters.
* Simplify CMS interoperability.
* Allow support for alternative datapath implementations.

This first email will focus on the current issues that (in my view) are
preventing OVN from scaling L2 networks on larger clusters. I will send
another message with some change proposals to remove or fix these
issues.

Disclaimer:

I am fairly new to this project and my perception and understanding may
be incorrect in some aspects. Please forgive me in advance if I use the
wrong terms and/or make invalid statements. My intent is only to make
things better and not to put the blame on anyone for the current design
choices.

Southbound Design
=

In the current architecture, both databases contain a mix of state and
configuration. While this does not seem to cause any scaling issues for
the northbound DB, it can become a bottleneck for the southbound with
large numbers of chassis and logical network constructs.

The southbound database contains a mix of configuration (logical flows
transformed from the logical network topology) and state (chassis, port
bindings, mac bindings, FDB entries, etc.).

The "configuration" part is consumed by ovn-controller to implement the
network on every chassis and the "state" part is consumed by ovn-northd
to update the northbound "state" entries and to update logical flows.
Some CMS's [1] also depend on the southbound "state" in order to
function properly.

[1] 
https://opendev.org/openstack/neutron/src/tag/22.0.0/neutron/agent/ovn/metadata/ovsdb.py#L39-L40

Centralized decisions
=

Every chassis needs to be "aware" of all other chassis in the cluster.
This requirement mainly comes from overlay networks that are implemented
over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
limitations). It is not a scaling issue by itself, but it implies
a centralized decision which in turn puts pressure on the central node
at scale.

Due to ovsdb monitoring and caching, any change in the southbound DB
(either by northd or by any of the chassis controllers) is replicated on
every chassis. The monitor_all option is often enabled on large clusters
to avoid the conditional monitoring CPU cost on the central node.

This leads to high memory usage on all chassis, control plane traffic
and possible disruptions in the ovs-vswitchd datapath flow cache.
Unfortunately, I don't have any hard data to back this claim. This is
mainly coming from discussions I had with neutron contributors and from
brainstorming sessions with colleagues.

I hope that the current work on OVN heater to integrate openstack
support [2] will allow getting more insight.

[2] https://github.com/ovn-org/ovn-heater/pull/179

Dynamic mac learning


Logical switch ports on a given chassis are all connected to the same
OVS bridge, in the same VLAN. This prevents from using local mac address
learning and shifts the responsibility to a centralized ovn-northd to
create all the required logical flows to properly segment the network.

When using mac_address=unknown ports, centralized mac learning is
enabled and when a new address is seen entering a port, OVS sends it to
the local controller which updates the FDB table and recomputes flow
rules accordingly. With logical switches spanning across a large number
of chassis, this centralized mac address learning and aging can have an
impact on control plane and dataplane performance.

Closing thoughts


My understanding of L3 and L4 capabilities of OVN are too limited to
discuss if there are other issues that would prevent scaling to
thousands of nodes. My point was mainly focused on L2 network scaling.

I would love to get other opinions on these statements.

Cheers!

-- 
Robin Jarry
Red Hat, Telco/NFV

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss