On 12/13/23 23:19, Frode Nordahl wrote:
> Since the topic of clustered database schema conversion issues was
> first brought up [0], there has been several performance
> improvements which has greatly improved the situation.
> 
> However, we cannot get away from the fact that the process remains
> unpredictable, when faced with a cluster that has grown to a scale
> no longer supported by its configuration.
> 
> The OVSDB Server is mostly single threaded, and for clustered
> databases the schema conversion is done server side in the main
> thread.  While conversion is ongoing the server will not service
> critical cluster communication, such as Raft elections.
> 
> In an ideal world operators would regularly re-assess their
> deployment configurations with real data to ensure parameters such
> as the OVSDB Raft election-timer is appropriately configured for
> all cluster operations, including schema conversions.
> 
> The reality is that this does not happen, and operators are instead
> discovering the hard facts of managing a OVSDB cluster at critical
> points in time such as during database upgrades.
> 
> To alleviate this situation we introduce time keeping in the
> ovsdb_convert functions for clustered databases.
> 
> A timeout is deduced from the configured election timer using a
> divisor of 4 to compute buffer time.  Buffer time is the time
> required for writing the result of the conversion to the raft log
> and initiating command to replace data on peers before an election
> is held for the next term.
> 
> For conversions initiated by a client (such as ovsdb-client),
> the process will be aborted and an error will be returned if the
> schema conversion process overruns the timeout.  Conversions
> initiated from storage will not abort, a warning indicating the election
> timer being set too low will be logged instead.
> 
> This will allow a human or automatic operator to take necessary
> action, such as increasing the election timer, before attempting
> the schema conversion again.
> 
> 0: 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052140.html
> Signed-off-by: Frode Nordahl <frode.nord...@canonical.com>
> ---

Hi, Frode.  Thanks for the patch!

It reminds me of the previous unfinished work from Anton:
  
https://patchwork.ozlabs.org/project/openvswitch/patch/20210820091630.7334-1-anton.iva...@cambridgegreys.com/

Looks like the same general idea.  There are a few issues with this
kind of functionality though:

- The task is not finished, i.e. we have to give up and fail, which
  is not nice for the user.

- Leaky abstraction.  This patch requires ovsdb/file module to know
  about RAFT, which is largely irrelevant to that module.  Can be fixed
  by handing off the configuration from top to bottom (see the Anton's
  implementation), but in this case will require all the intermediate
  layers to pass the information along, piercing the whole hierarchy.

- The issue appears to not be fully addressed.  In my previous testing,
  IIRC, sending back all the initial monitor replies after conversion
  was a heavier operation than a conversion itself.  So, even if
  conversion doesn't time out, jsonrpc server may cause the timeout.
  So, we kind of need both parts of the puzzle.

- Also, only timing the conversion is a little arbitrary.  The timer/4
  is chosen, but there is no guarantee that 3/4 of the timer are enough
  for everything else to run.  Anton's solution was trying to keep track
  of time for every operation and subtract from the budget, but that
  becomes a trade off between how deep in the call stack we want to
  be able to yield (foreshadowing :) ) vs at how many levels we need to
  implement time budget tracking.


So, I got another idea to discuss (It might have been mentioned somewhere
in the past years, but I do not remember).

Can we make use of some cooperative multitasking here?
I'm not talking about full-fledged co-routines, but something simpler,
though similar in nature.

Let's say we create a library with 2 functions:

1. cooperative_multitasking_register(callback, arg, time_threshold);
2. cooperative_yield();

Different modules could register a function in this module specifying
a how much time should pass between calls.  Other modules that are
expected to have heavy processing will call the yield() function at
points where they feel comfortable to do so.  The yield() function
may look through all the registered callbacks and call ones that wasn't
called for the duration of their time threshold.  Maybe we'll need
a third function for a callback to call to update the last used time.
Might be useful if the callback is called directly at some point and
not from the yeild().

Some strict rules will be needed for such a library though.  i.e.
the module that registers itself with a callback must not use the
yield functionality inside nor it should be possible to do so via
calls to other modules.  Not sure if recursion detection is enough,
but surely needed.  And basically, the module that registers the
callback should be self-sufficient, i.e. the internal state of that
module should not matter to the outside world, at least it should
not matter for the call stack that enters the yield().
yield() must not be called from places that can loop indefinitely,
only in places that eventually end, otherwise it may give a false
impression that the server is working fine while it is stuck and
not actually doing any useful work.

RAFT module though is very well isolated from the rest of the ovsdb
functionality, so it should be safe to call raft_run() or storage_run()
from almost anywhere else outside of the RAFT module itself.  Lower
layer libraries such as jsonrpc.c or ovsdb/log should not yield(),
since they can be used by the RAFT module.

So, the RAFT module could register itself with a time threshold
derived from an election timer, and jsonrpc-server and conversion
may yield().

Advantages of such a solution:

- Less code, I hope. :)  No need for manual time tracking everywhere.

- Easier to add yield() call to other places, if necessary.

- A bit easier to track control flow.  Everything still runs to
  completion, no breaking out of the loops, no need to save states
  to revisit later.

- No need to fail the conversion!

- Unlike multi-threading, no need to worry much about data races.

- May be useful somewhere else?

Disadvantages:

- Need to be careful on which functions can yield() and which callbacks
  can be registered, to avoid modifications of data structures in use
  by the yielding function and the call chain above it.  E.g. we
  shouldn't modify lists that a yielding function iterates.


We have something similar in the python IDL code:
  
https://github.com/openvswitch/ovs/commit/d28c5ca57650d6866453d0adb9a2e048cda21a86
It's not intended for the python library itself, but for the
application that uses asyncio/eventlet to be able to run housekeeping
of these frameworks while IDL is processing large updates.

What do you think?  Does that make any sense?

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to