Re: [ovs-discuss] Scaling OVN/Southbound

Terry Wilson via discuss Wed, 05 Jul 2023 10:28:53 -0700

On Wed, Jul 5, 2023 at 9:59 AM Terry Wilson <twil...@redhat.com> wrote:
>
> On Fri, Jun 30, 2023 at 7:09 PM Han Zhou via discuss
> <ovs-discuss@openvswitch.org> wrote:
> >
> >
> >
> > On Wed, May 24, 2023 at 12:26 AM Felix Huettner via discuss 
> > <ovs-discuss@openvswitch.org> wrote:
> > >
> > > Hi Ilya,
> > >
> > > thank you for the detailed reply
> > >
> > > On Tue, May 23, 2023 at 05:25:49PM +0200, Ilya Maximets wrote:
> > > > On 5/23/23 15:59, Felix Hüttner via discuss wrote:
> > > > > Hi everyone,
> > > >
> > > > Hi, Felix.
> > > >
> > > > >
> > > > > we are currently running an OVN Deployment with 450 Nodes. We run a 3 
> > > > > node cluster for the northbound database and a 3 nodes cluster for 
> > > > > the southbound database.
> > > > > Between the southbound cluster and the ovn-controllers we have a 
> > > > > layer of 24 ovsdb relays.
> > > > > The setup is using TLS for all connections, however the TLS Server is 
> > > > > handled by a traefik reverseproxy to offload this from the ovsdb
> > > >
> > > > The very important part of the system description is what versions
> > > > of OVS and OVN are you using in this setup?  If it's not latest
> > > > 3.1 and 23.03, then it's hard to talk about what/if performance
> > > > improvements are actually needed.
> > > >
> > >
> > > We are currently running ovs 3.1 and ovn 22.12 (in the process of
> > > upgrading to 23.03). `monitor-all` is currently disabled, but we want to
> > > try that as well.
> > >
> > Hi Felix, did you try upgrading and enabling "monitor-all"? How does it 
> > look now?
> >
> > > > > Northd and Neutron is connecting directly to north- and southbound 
> > > > > databases without the relays.
> > > >
> > > > One of the big things that is annoying is that Neutron connects to
> > > > Southbound database at all.  There are some reasons to do that,
> > > > but ideally that should be avoided.  I know that in the past limiting
> > > > the number of metadata agents was one of the mitigation strategies
> > > > for scaling issues.  Also, why can't it connect to relays?  There
> > > > shouldn't be too many transactions flowing towards Southbound DB
> > > > from the Neutron.
> > > >
> > >
> > > Thanks for that suggestion, that definately makes sense.
> > >
> > Does this make a big difference? How many Neutron - SB connections are 
> > there?
> > What rings a bell is that Neutron is using the python OVSDB library which 
> > hasn't implemented the fast-resync feature (if I remember correctly).
>
> python-ovs has supported monitor_cond_since since v2.17.0 (though
> there may have been a bug that was fixed in 2.17.1). If fast resync
> isn't happening, then it should be considered a bug. With that said, I
> remember when I looked it a year or two ago, ovsdb-server didn't
> really use fast resync/monitor_cond_since unless it was running in
> raft cluster mode (it would reply, but with the last-txn-id as 0
> IIRC?). Does the ovsdb-relay code actually return the last-txn-id? I
> can set up an environment and run some tests, but maybe someone else
> already knows.


Looks like ovsdb-relay does support last-txn-id now:
https://github.com/openvswitch/ovs/commit/a3e97b1af1bdcaa802c6caa9e73087df7077d2b1,
but only in v3.0+.

> > At the same time, there is the feature leader-transfer-for-snapshot, which 
> > automatically transfer leader whenever a snapshot is to be written, which 
> > would happen frequently if your environment is very active.
>
> I believe snapshot should only be happening "no less frequently than
> 24 hours, with snapshots if there are more than 100 log entries and
> the log size has doubled, but no more frequently than every 10 mins"
> or something pretty close to that. So it seems like once the system
> got up to its expected size, you would just see updates every 24 hours
> since you obviously can't double in size forever. But it's possible
> I'm reading that wrong.
>
> > When a leader transfer happens, if Neutron set the option "leader-only" 
> > (only connects to leader) to SB DB (could someone confirm?), then when the 
> > leader transfer happens, all Neutron workers would reconnect to the new 
> > leader. With fast-resync, like what's implemented in C IDL and Go, the 
> > client that has cached the data would only request the delta when 
> > reconnecting. But since the python lib doesn't have this, the Neutron 
> > server would re-download full data when reconnecting ...
> > This is a speculation based on the information I have, and the assumptions 
> > need to be confirmed.
> >
> > > > >
> > > > > We needed to increase various timeouts on the ovsdb-server and client 
> > > > > side to get this to a mostly stable state:
> > > > > * inactivity probes of 60 seconds (for all connections between 
> > > > > ovsdb-server, relay and clients)
> > > > > * cluster election time of 50 seconds
> > > > >
> > > > > As long as none of the relays restarts the environment is quite 
> > > > > stable.
> > > > > However we see quite regularly the "Unreasonably long xxx ms poll 
> > > > > interval" messages ranging from 1000ms up to 40000ms.
> > > >
> > > > With latest versions of OVS/OVN the CPU usage on Southbound DB
> > > > servers without relays in our weekly 500-node ovn-heater runs
> > > > stays below 10% during the test phase.  No large poll intervals
> > > > are getting registered.
> > > >
> > > > Do you have more details on under which circumstances these
> > > > large poll intervals occur?
> > > >
> > >
> > > It seems to mostly happen on the initial connection of some client to
> > > the ovsdb. From the few times we ran perf there it looks like the time
> > > is spend in creating a monitor and during that sending out the updates
> > > to the client side.
> > >
> > It is one of the worst case scenario for OVSDB when many clients initialize 
> > connections to it at the same time, when the size of data downloaded by 
> > each client is big.
> > OVSDB relay, for what I understand, should greatly help on this. You have 
> > 24 relay nodes, which are supposed to share the burden. Are the SB DB and 
> > the relay instances running with sufficient CPU resources?
> > Is it clear that initial connections from which clients (ovn-controller or 
> > Neutron) are causing this? If it is Neutron, the above speculation about 
> > the lack of fast-resync from Neutron workers may be worth checking.
> >
> > > If it is of interest i can try and get a perf report once this occurs
> > > again.
> > >
> > > > >
> > > > > If a large amount of relays restart simultaneously they can also 
> > > > > bring the ovsdb cluster to fail as the poll interval exceeds the 
> > > > > cluster election time.
> > > > > This happens with the relays already syncing the data from all 3 
> > > > > ovsdb servers.
> > > >
> > > > There was a performance issue with upgrades and simultaneous
> > > > reconnections, but it should be mostly fixed on the current master
> > > > branch, i.e. in the upcoming 3.2 release:
> > > >   
> > > > https://patchwork.ozlabs.org/project/openvswitch/list/?series=348259&state=*
> > > >
> > >
> > > That sounds like that might be similar to when our issue occurs. I'll
> > > see if we can try this out.
> > >
> > > > >
> > > > > We would like to improve this significantly to ensure on the one hand 
> > > > > that our ovsdb clusters will survive unplanned load without issues 
> > > > > and on the other hand to keep the poll intervals short.
> > > > > We would like to ensure a short poll interval to allow us to act on 
> > > > > distributed-gateway-ports failovers and failover of virtual port in a 
> > > > > timely manner (ideally below 1 second).
> > > >
> > > > These are good goals.  But are you sure they are not already
> > > > addressed with the most recent versions of OVS/OVN ?
> > > >
> > >
> > > I was not sure, but all your feedback helped clarifying that.
> > >
> > > > >
> > > > > To do this we found the following solutions that were discussed in 
> > > > > the past:
> > > > > 1. Implementing multithreading for ovsdb 
> > > > > https://patchwork.ozlabs.org/project/openvswitch/list/?series=&submitter=&state=*&q=multithreading&archive=&delegate=
> > > >
> > > > We moved the compaction process to a separate thread in 3.0.
> > > > This partially addressed the multi-threading topic.  General
> > > > handling of client requests/updates in separate threads will
> > > > require significant changes in the internal architecture, AFAICT.
> > > > So, I'd like to avoid doing that unless necessary.  So far we
> > > > were able to overcome almost all the performance challenges
> > > > with simple algorithmic changes instead.
> > > >
> > >
> > > I definately get that since that would be quite a complex change to do.
> > > The only benefit i would see in having clients in separate threads is
> > > that it reduces the impact of performance challenges.
> > > E.g. it would still allow the cluster to healthly work together and make
> > > progress, but individual reconnects would be slow.
> > >
> > > That benefit would be quite significant from my perspective as it makes
> > > the solution more resillient. But i'm not sure if its worth the
> > > additional complexity.
> > >
> > Multithreading for general OVSDB tasks (transactions, monitoring) seems 
> > more complex to implement, and the outcome should be very similar to OVSDB 
> > relay (which is multi-process instead of multi-threading), except that 
> > multi-threading may have a smaller memory footprint.
> > Multithreading for RAFT cluster RPC may help keeping the cluster healthy 
> > when server load is high, but the same can be achieved by setting a longer 
> > election timer. I agree there is a subtle difference when you want fast 
> > failure detection for things like node crash but can tolerate overloaded 
> > servers that can barely respond to clients.
> >
> > Looking forward to hearing back from you regarding the situation.
> >
> > Thanks,
> > Han
> >
> > > > > 2. Changing the storage backend of OVN to an alternative (e.g. etcd) 
> > > > > https://mail.openvswitch.org/pipermail/ovs-discuss/2016-July/041733.html
> > > >
> > > > There was an ovsdb-etcd project, but it didn't manage to provide
> > > > better performance in comparison with ovsdb-server.  So it was
> > > > ultimately abandoned: https://github.com/IBM/ovsdb-etcd
> > > >
> > > > >
> > > > > Both of these discussion are from 2016, not sure if more up-to-date 
> > > > > ones exist.
> > > > >
> > > > > I would like to ask if there are already existing discussions on 
> > > > > scaling ovsdb further/faster?
> > > >
> > > > This again comes to a question what versions you're using.  I'm
> > > > currently not aware of any major performance issues for ovsdb-server
> > > > on the most recent code, besides the conditional monitoring, which is
> > > > not entirely OVSDB server's issue.  And it is also likely to become
> > > > a bit better in 3.2:
> > > >   
> > > > https://patchwork.ozlabs.org/project/openvswitch/patch/20230518121425.550048-1-i.maxim...@ovn.org/
> > > >
> > >
> > > That also sounds like a quite interesting change that might help us
> > > here.
> > >
> > > > >
> > > > > From my perspective whatever such a solution might be, would no 
> > > > > longer require relays and allow the ovsdb servers to handle load 
> > > > > gracefully.
> > > > > I personally see that multithreading for ovsdb sounds quite 
> > > > > promising, as that would allow us to separate the raft/cluster 
> > > > > communication from the client connections.
> > > > > This should allow us to keep the cluster healthly even under 
> > > > > significant pressure of clients.
> > > >
> > > > Again, good goals.  I'm just not sure if we actually need to do
> > > > something or if they are already achievable with the most recent code.
> > > >
> > > > I understand that testing on prod is not an option, so it's unlikely
> > > > we'll have an accurate test.  But maybe you can participate in the
> > > > initiative [1] for creation of ovn-heater OpenStack scenarios that
> > > > might be close to workloads you have?  This way upstream will be able
> > > > to test your use-cases or at least something similar.
> > > >
> > > > Most of our current efforts are focused on ovn-kubernetes use-case,
> > > > because we don't have much details on how high-scale OpenStack 
> > > > deployments
> > > > look like.
> > > >
> > > > [1] https://mail.openvswitch.org/pipermail/ovs-dev/2023-May/404488.html
> > > >
> > >
> > > That looks very interesting and would also help us running scale tests.
> > > I'll get in contact with whoever is working on that to help out as well.
> > >
> > > > Best regards, Ilya Maximets.
> > > >
> > > > >
> > > > > Thank you
> > > > >
> > > > > --
> > > > > Felix Huettner
> > > >
> > >
> > > Thanks for all of the detailed insights.
> > > Felix
> > > Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für 
> > > die Verwertung durch den vorgesehenen Empfänger bestimmt.
> > > Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender 
> > > bitte unverzüglich in Kenntnis und löschen diese E Mail.
> > >
> > > Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.
> > >
> > >
> > > This e-mail may contain confidential content and is intended only for the 
> > > specified recipient/s.
> > > If you are not the intended recipient, please inform the sender 
> > > immediately and delete this e-mail.
> > >
> > > Information on data protection can be found 
> > > here<https://www.datenschutz.schwarz>.
> > > _______________________________________________
> > > discuss mailing list
> > > disc...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > _______________________________________________
> > discuss mailing list
> > disc...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] Scaling OVN/Southbound

Reply via email to