Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

2023-10-04 Thread Robin Jarry via discuss

Hi Mark,

Mark Michelson, Oct 03, 2023 at 23:09:

Hi Robin,

Thanks a bunch for putting these two emails together. I've read through 
them and the replies.


I think there's one major issue: a lack of data.


That's my concern as well... The problem is, it is very hard to get 
reliable and actionable data when it comes to that level of scale. 
I have been trying to collect such data and put together realistic 
scenarios but failed until now.


I think the four bullet points you listed below are admirable goals. The 
problem is that I think we're putting the cart before the horse with 
both the issues and proposals.


In other words, before being able to properly evaluate these emails, we 
need to see a scenario that

1) Has clear goals for what scalability metrics are desired.
2) Shows evidence that these scalability goals are not being met.
3) Shows evidence that one or more of the issues listed in this email 
are the cause for the scalability issues in the scenario.
4) Shows evidence that the proposed changes would fix the scalability 
issues in the scenario.


I hope that the ongoing work on ovn-heater will help in that regard.

I listed them in this order because without a failing scenario, we can't 
claim the scalability is poor. Then if we have a failing scenario, it's 
possible that the problem and solution is much simpler than any of the 
issues or proposals that have been brought up here. Then, it's also 
possible that maybe only a subset of the issues listed in this email are 
contributing to the failure. Even if the issues identified here are 
directly causing the scenario to fail, there may still be simpler 
solutions than what has been proposed. And finally, it's possible that 
the proposed solutions don't actually result in the expected scale increase.


I want to make sure my tone is coming across clearly here. I don't think 
the current OVN architecture is perfect, and I don't want to be 
dismissive of the issues you've raised. If there are changes we can make 
to simplify OVN and scale better at the same time, I'm all for it. The 
problem is that, as you pointed out in your proposal email, most of 
these proposals result in difficulties for upgrades/downgrades, as well 
as code maintenance. Therefore, if we are going to do any of these, we 
need to first be certain that we aren't scaling as well as we would 
like, and that there are not simpler paths to reach our scalability targets.


I get your point and this is specifically why I did split the 
conversation in two. I did not want my proposals to be mixed up with 
the issues.


I will see if I can get hard data that can demonstrate what I claim.

Thanks!

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-04 Thread Robin Jarry via discuss
Hi Felix,

Felix Huettner, Oct 04, 2023 at 09:24:
> Hi Robin,
>
> i'll try to answer what i can.
>
> On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> > Hi all,
> >
> > Felix Huettner, Oct 02, 2023 at 09:35:
> > > Hi everyone,
> > >
> > > just want to add my experience below
> > > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > > > >
> > > > > Hi Han,
> > > > >
> > > > > Please see my comments/questions inline.
> > > > >
> > > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > > Distributed mac learning
> > > > > > > 
> > > > > > >
> > > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > > a port bound to the local chassis.
> > > > > > >
> > > > > > > Pros:
> > > > > > >
> > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > > >   mostly).
> > > > > > > - No central mac binding table required.
> > > > > >
> > > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > > table of distributed logical routers. We should probably call it
> > > > > > IP_MAC_binding table, or just Neighbor table.
> > > > >
> > > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > > >
> > > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > > a MAC address of a VIF.
> > > > >
> > > > > This is not that uncommon in telco use cases where VNFs can send
> > > > > packets from mac addresses unknown to OVN.
> > > > >
> > > > Understand, but VNFs contributes a very small portion of the
> > > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > > have "unknown" addresses for the majority of ports in a large scale
> > > > cloud. Is this understanding correct?
> > >
> > > I can only share numbers for our usecase with ~650 chassis we have the
> > > following distribution of "unknown" in the `addresses` field of
> > > Logical_Switch_Port:
> > > * 23000 with a mac address + ip and without "unknown"
> > > * 250 with a mac address + ip and with "unknown"
> > > * 30 with just "unknown"
> > >
> > > The usecase is a generic public cloud and we do not have any telco
> > > related things.
> >
> > I don't have any numbers from telco deployments at hand but I will poke
> > around.
> >
> > > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > > to populate such information on chassis, which is a reasonable
> > > > > > alternative with pros and cons.
> > > > > > However, I don't think it is necessary to use separate OVS
> > > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > > purpose of using the builtin OVS mac-learning.
> > > > >
> > > > > I agree that this could also be implemented with VLAN tags on the
> > > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > > may require complicated OF pipelines. My intent with this idea was
> > > > > two fold:
> > > > >
> > > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > > >
> > > > IMHO, the L2 pipeline is not really complex. It is probably the
> > > > simplest part (compared with other features for L3, NAT, ACL, LB,
> > > > etc.). Adding dynamic learning to this part probably makes it *a
> > > > little* more complex, but should still be straightforward. We don't
> > > > need any VLAN tag because the incoming packet has geneve VNI in the
> > > > metadata. We just need a flow that resubmits to lookup
> > > > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > > > tunnel endpont information) if the src MAC is not found, with the
> > > > help of the "learn" action. The entries are per-logical_switch
> > > > (VNI). This would serve your purpose of avoiding a central DB for
> > > > L2. At least this looks much simpler to me than managing dynamic
> > > > number of OVS bridges and the patch pairs between them.
> >
> > Would that work for non GENEVE networks (localnet) when there is no VNI?
> > Does that apply as well?
> >
> >
> > > >
> > > > >
> > > > > > Now back to the distributed MAC learning idea itself.
> > > > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > > > VM1 already has VM2's MAC 

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

2023-10-04 Thread Felix Huettner via discuss
Hi Robin,

i'll try to answer what i can.

On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> Hi all,
>
> Felix Huettner, Oct 02, 2023 at 09:35:
> > Hi everyone,
> >
> > just want to add my experience below
> > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry  wrote:
> > > >
> > > > Hi Han,
> > > >
> > > > Please see my comments/questions inline.
> > > >
> > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > Distributed mac learning
> > > > > > 
> > > > > >
> > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > a port bound to the local chassis.
> > > > > >
> > > > > > Pros:
> > > > > >
> > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > >   mostly).
> > > > > > - No central mac binding table required.
> > > > >
> > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > table of distributed logical routers. We should probably call it
> > > > > IP_MAC_binding table, or just Neighbor table.
> > > >
> > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > >
> > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > a MAC address of a VIF.
> > > >
> > > > This is not that uncommon in telco use cases where VNFs can send
> > > > packets from mac addresses unknown to OVN.
> > > >
> > > Understand, but VNFs contributes a very small portion of the
> > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > have "unknown" addresses for the majority of ports in a large scale
> > > cloud. Is this understanding correct?
> >
> > I can only share numbers for our usecase with ~650 chassis we have the
> > following distribution of "unknown" in the `addresses` field of
> > Logical_Switch_Port:
> > * 23000 with a mac address + ip and without "unknown"
> > * 250 with a mac address + ip and with "unknown"
> > * 30 with just "unknown"
> >
> > The usecase is a generic public cloud and we do not have any telco
> > related things.
>
> I don't have any numbers from telco deployments at hand but I will poke
> around.
>
> > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > to populate such information on chassis, which is a reasonable
> > > > > alternative with pros and cons.
> > > > > However, I don't think it is necessary to use separate OVS
> > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > purpose of using the builtin OVS mac-learning.
> > > >
> > > > I agree that this could also be implemented with VLAN tags on the
> > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > may require complicated OF pipelines. My intent with this idea was
> > > > two fold:
> > > >
> > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > >
> > > IMHO, the L2 pipeline is not really complex. It is probably the
> > > simplest part (compared with other features for L3, NAT, ACL, LB,
> > > etc.). Adding dynamic learning to this part probably makes it *a
> > > little* more complex, but should still be straightforward. We don't
> > > need any VLAN tag because the incoming packet has geneve VNI in the
> > > metadata. We just need a flow that resubmits to lookup
> > > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > > tunnel endpont information) if the src MAC is not found, with the
> > > help of the "learn" action. The entries are per-logical_switch
> > > (VNI). This would serve your purpose of avoiding a central DB for
> > > L2. At least this looks much simpler to me than managing dynamic
> > > number of OVS bridges and the patch pairs between them.
>
> Would that work for non GENEVE networks (localnet) when there is no VNI?
> Does that apply as well?
>
>
> > >
> > > >
> > > > > Now back to the distributed MAC learning idea itself.
> > > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > > >
> > > > > In OVN today this information is conveyed through:
> > > > >
> > > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > >