Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

Han Zhou via discuss Tue, 10 Jan 2023 23:04:48 -0800

On Mon, Jan 9, 2023 at 3:34 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 1/8/23 04:51, Han Zhou wrote:
> >
> >
> > On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org <mailto:ovs-discuss@openvswitch.org>> wrote:
> >>
> >> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> >> > Hello,
> >> >
> >> > When performing an online schema conversion for a clustered DB the
> >> > `ovsdb-client` connects to the current leader of the cluster and
> >> > requests it to convert the DB to a new schema.
> >> >
> >> > The main thread of the leader ovsdb-server will then parse the new
> >> > schema and copy the entire database into a new in-memory copy using
> >> > the new schema. For a moderately sized database, let's say 650MB
> >> > on-disk, this process can take north of 24 seconds on a modern
> >> > adequately performant system.
> >> >
> >> > While this is happening the ovsdb-server process will not process any
> >> > raft election events or inactivity probes, so by the time the
> >> > conversion is done and the now past leader wants to write the
> >> > converted database to the cluster, its connection to the cluster is
> >> > dead.
> >> >
> >> > The past leader will keep repeating this process indefinitely, until
> >> > the client requesting the conversion disconnects. No message is
passed
> >> > to the client.
> >> >
> >> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >> >
> >> > A workaround for this scenario would be to increase the election
timer
> >> > to a value great enough so that the conversion can succeed within an
> >> > election window.
> >> >
> >> > I don't view this as a permanent solution though, as it would be
> >> > unfair to leave the end user with guessing the correct election timer
> >> > in order for their upgrades to succeed.
> >> >
> >> > Maybe we need to hand off conversion to a thread and make the main
> >> > loop only process raft requests until it is done, similar to the
> >> > recent addition of preparing snapshot JSON in a separate thread [0].
> >> >
> >> > Any other thoughts or ideas?
> >> >
> >> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
<
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
>
> >> >
> >>
> >> Hi, Frode.  Thanks for starting this conversation.
> >>
> >> First of all I'd still respectfully disagree that 650 MB is a
> >> moderately sized database. :)  ovsdb-server on its own doesn't limit
> >> users on how much data they can put in, but that doesn't mean there
> >> is no limit at which it will be difficult for it or even impossible
> >> to handle the database.  From my experience 650 MB is far beyond the
> >> threshold for a smooth work.
> >>
> >> Allowing database to grow to such size might be considered a user
> >> error, or a CMS error.  In any case, setups should be tested at the
> >> desired [simulated at least] scale including upgrades before
> >> deploying in production environment to not run into such issues
> >> unexpectedly.
> >>
> >> Another way out from the situation, beside bumping the election
> >> timer, might be to pin ovn-controllers, destroy the database (maybe
> >> keep port bindings, etc.) and let northd to re-create it after
> >> conversion.  Not sure if that will actually work though, as I
> >> didn't try.
> >>
> >>
> >> For the threads, I'll re-iterate my thought that throwing more
> >> cores on the problem is absolutely last thing we should do.  Only
> >> if there is no other choice.  Simply because many parts of
> >> ovsdb-server was never optimized for performance and there are
> >> likely many things we can do to improve without blindly using more
> >> resources and increasing the code complexity by adding threads.
> >>
> >>
> >> Speaking of not optimal code, the conversion process seems very
> >> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> >> case, focusing on the clustered mode.)
> >>
> >> There are few main steps:
> >>
> >> 1. ovsdb_convert() - Creates a copy of a database converting
> >>    each column along the way and checks the constraints.
> >>
> >> 2. ovsdb_to_txn_json() - Converts the new database into a
> >>    transaction JSON object.
> >>
> >> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
> >>    and the transaction JSON to the storage (RAFT).
> >>
> >> 4. ovsdb_destroy() - Copy of a database is destroyed.
> >>
> >>    -----
> >>
> >> 5. read_db()/parse_txn() - Reads the new schema and the
> >>    transaction JSON from the storage, replaces the current
> >>    database with an empty one and replays the transaction
> >>    that creates a new converted database.
> >>
> >> There is always a storage run between steps 4 and 5, so we generally
> >> care only that steps 1-4 and the step 5 are below the election timer
> >> threshold separately.
> >>
> >>
> >> Now looking closer to the step 1, which is the most time consuming
> >> step.  It has two stages - data conversion and the transaction
> >> check.  Data conversion part makes sure that we're creating all
> >> the rows in a new database with all the new columns and without
> >> removed columns.  It also makes sure that all the datum objects
> >> are converted from the old column type to the new column type by
> >> calling ovsdb_datum_convert() for every one of them.
> >>
> >> Datum conversion is a very heavy operation, because it involves
> >> converting it to JSON and back.  However, in vast majority of cases
> >> column types do not change at all, and even if they do, it only
> >> happens for a few columns, not for all of them.
> >>
> >> This part can be optimized by not converting columns if the type
> >> didn't change, and just creating a shallow copy of the datum with
> >> an ovsdb_datum_clone() instead.
> >>
> >> In my preliminary testing this saves about 70% of the time spent
> >> in ovsdb_convert() and reduces the overall conversion time by
> >> about 30%.  Creation of a shallow copy also speeds up the database
> >> destruction at step 4 and saves some memory.
> >>
> >> I'll post patches I have for this a bit later after cleaning
> >> them up.
> >
> > Thanks Ilya. This is great and I acked to your patches.
>
> Thanks for review!
>
> >
> >>
> >> The next part of the ovsdb_convert() is ovsdb_txn_replay_commit()
> >> that we can't really get rid of, because we have to perform all
> >> the transaction checks including referential integrity checks that
> >> take up most of the time.  We might look at further optimizations
> >> for the weak reference re-assessment process in the future though
> >> as it seems to be the heaviest part.
> >>
> >>
> >>
> >> Now to the steps 2 and 3.
> >> Creation of a transaction JSON and writing it to the storage seems
> >> to be completely redundant.  I understand that it was probably done
> >> this way because conversion to JSON and back was cheaper than
> >> ovsdb_convert().  However, with the change for a step 1, it will not
> >> be the case anymore.
> >>
> >> Database conversion is a deterministic process.  All we need to
> >> perform it is the database content, which is already in the
> >> storage, and the new schema.  So, we should be able to drop the
> >> step 2 and write only the new schema to the storage on step 3.
> >> On step 5 we'll need to read the schema and call ovsdb_convert()
> >> again, but that should not be heavier than parsing transaction from
> >> JSON and replaying it, with the above optimizations.  RAFT will
> >> ensure that we're converting the same data as in the first
> >> ovsdb_convert().
> >>
> >> Writing only the new schema to the storage may mean a backward
> >> incompatible database file format change, but that should not
> >> be very hard to handle taking into account that it only happens
> >> during conversion and the next compaction will get rid of
> >> incompatibility.
> >>
> >> I'll sketch up some patches for this as well.   Didn't try yet,
> >> so don't have any performance data.
> >>
> >
> > For this part, I think your proposal is more efficient, but it is not
clear to me how you would implement it for RAFT. I think currently the
steps 2, 3 and 5 were done this way probably because it is the easiest way
to reuse the RAFT mechanism. In your proposal, are you thinking about
introducing something new in the RAFT command to trigger ovsdb_convert on
each node? Could you explain more about it?
>
> From the RAFT perspective, database conversion is just a very
> heavy transaction, there is nothing special about it.  Normal
> RAFT command is created with raft_command_execute().  This
> command has a JSON object and a prerequisite eid, as usual.
> Nothing really changes there, except that JSON object will
> contain only the schema.
>
I see. So if the JSON includes the schema it already conveys the
information that we are doing schema conversion, so nothing else needs to
be introduced. This was what I missed, and it is clear now.


> Once the log entry is read back from storage, each cluster member
> will execute ovsdb_convert() + ovsdb_replace() finalizing the
> conversion, instead of current ovsdb_replace() with empty database
> plus parsing transaction from the JSON and replaying it.
> This is responsibility of ovsdb/ovsdb-server.c:parse_txn(), which
> is not part of the storage implementation.  Storage (RAFT or
> log) is largely agnostic to what data we put in it as long as it
> is in JSON format.  RAFT only ensures the order in which log
> entries are added and that all cluster members have exactly same
> view on how the log looks like.  What payload is in these log
> entries - it doesn't care.
>
> Does that make sense?

Yes, it makes sense. Thanks for explaining.

>
> It's possible there are some hidden issues with that approach
> that I didn't discover yet as I didn't really start working on
> the implementation.
>
Looking forward to it!

Thanks,
Han

> >
> > Thanks,
> > Han
> >
> >>
> >> All in all the conversion process will be condensed to just two
> >> calls to optimized ovsdb_convert() and two database_destroy().
> >> One before writing to storage and one after reading from it.
> >> This should be enough to cover most of the use cases, I hope.
> >>
> >> Thoughts?
> >>
> >> Best regards, Ilya Maximets.
>

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

Reply via email to