Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-03-27 Thread Frode Nordahl via discuss
On Mon, Mar 27, 2023 at 9:50 PM Ilya Maximets  wrote:
>
> On 1/23/23 11:44, Frode Nordahl wrote:
> > On Tue, Jan 3, 2023 at 3:07 PM Ilya Maximets  wrote:
> >>
> >> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> >>> Hello,
> >>>
> >>> When performing an online schema conversion for a clustered DB the
> >>> `ovsdb-client` connects to the current leader of the cluster and
> >>> requests it to convert the DB to a new schema.
> >>>
> >>> The main thread of the leader ovsdb-server will then parse the new
> >>> schema and copy the entire database into a new in-memory copy using
> >>> the new schema. For a moderately sized database, let's say 650MB
> >>> on-disk, this process can take north of 24 seconds on a modern
> >>> adequately performant system.
> >>>
> >>> While this is happening the ovsdb-server process will not process any
> >>> raft election events or inactivity probes, so by the time the
> >>> conversion is done and the now past leader wants to write the
> >>> converted database to the cluster, its connection to the cluster is
> >>> dead.
> >>>
> >>> The past leader will keep repeating this process indefinitely, until
> >>> the client requesting the conversion disconnects. No message is passed
> >>> to the client.
> >>>
> >>> Meanwhile the other nodes in the cluster have moved on with a new leader.
> >>>
> >>> A workaround for this scenario would be to increase the election timer
> >>> to a value great enough so that the conversion can succeed within an
> >>> election window.
> >>>
> >>> I don't view this as a permanent solution though, as it would be
> >>> unfair to leave the end user with guessing the correct election timer
> >>> in order for their upgrades to succeed.
> >>>
> >>> Maybe we need to hand off conversion to a thread and make the main
> >>> loop only process raft requests until it is done, similar to the
> >>> recent addition of preparing snapshot JSON in a separate thread [0].
> >>>
> >>> Any other thoughts or ideas?
> >>>
> >>> 0: 
> >>> https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> >>>
> >>
> >> Hi, Frode.  Thanks for starting this conversation.
> >
> > Thanks alot for your comprehensive response!
> >
> >> First of all I'd still respectfully disagree that 650 MB is a
> >> moderately sized database. :)  ovsdb-server on its own doesn't limit
> >> users on how much data they can put in, but that doesn't mean there
> >> is no limit at which it will be difficult for it or even impossible
> >> to handle the database.  From my experience 650 MB is far beyond the
> >> threshold for a smooth work.
> >
> > I guess my comment about the moderate size was really targeted at the
> > size of the deployment the DB came from. After looking more deeply
> > into it, it is also clear that after a compaction the raw size of the
> > file is around 305MB.
> >
> >> Allowing database to grow to such size might be considered a user
> >> error, or a CMS error.  In any case, setups should be tested at the
> >> desired [simulated at least] scale including upgrades before
> >> deploying in production environment to not run into such issues
> >> unexpectedly.
> >
> > I do not disagree with this, part of the problem is that the ovn-ctl
> > script currently defaults to attempting a schema upgrade on startup,
> > which would mean that this procedure is executed whenever a user
> > upgrades packages.
> >
> > The lack of visibility of the failed upgrade and the severe side
> > effects a failed upgrade leads to, make me think we may need to change
> > this.
> >
> > The reality is, sadly, that few users do full scale tests/simulations
> > of their deployment to check whether the next upgrade would succeed.
> >
> >> Another way out from the situation, beside bumping the election
> >> timer, might be to pin ovn-controllers, destroy the database (maybe
> >> keep port bindings, etc.) and let northd to re-create it after
> >> conversion.  Not sure if that will actually work though, as I
> >> didn't try.
> >
> > I believe for the incident that raised this topic to our attention the
> > user has been unblocked by emptying the southbound DB and let northd
> > re-create it.
> >
> >> For the threads, I'll re-iterate my thought that throwing more
> >> cores on the problem is absolutely last thing we should do.  Only
> >> if there is no other choice.  Simply because many parts of
> >> ovsdb-server was never optimized for performance and there are
> >> likely many things we can do to improve without blindly using more
> >> resources and increasing the code complexity by adding threads.
> >
> > I admire your persistence in upholding this standard, and you are
> > right, we should absolutely try every avenue before adding bloat and
> > complexity to OVS/OVN, thank you for keeping the bar high! :)
> >
> > The main reason for thinking about a separate thread here was to avoid
> > the DB cluster leader disappearing from the cluster until the client
> > attempting the conversion disconnects. For t

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-03-27 Thread Ilya Maximets via discuss
On 1/23/23 11:44, Frode Nordahl wrote:
> On Tue, Jan 3, 2023 at 3:07 PM Ilya Maximets  wrote:
>>
>> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
>>> Hello,
>>>
>>> When performing an online schema conversion for a clustered DB the
>>> `ovsdb-client` connects to the current leader of the cluster and
>>> requests it to convert the DB to a new schema.
>>>
>>> The main thread of the leader ovsdb-server will then parse the new
>>> schema and copy the entire database into a new in-memory copy using
>>> the new schema. For a moderately sized database, let's say 650MB
>>> on-disk, this process can take north of 24 seconds on a modern
>>> adequately performant system.
>>>
>>> While this is happening the ovsdb-server process will not process any
>>> raft election events or inactivity probes, so by the time the
>>> conversion is done and the now past leader wants to write the
>>> converted database to the cluster, its connection to the cluster is
>>> dead.
>>>
>>> The past leader will keep repeating this process indefinitely, until
>>> the client requesting the conversion disconnects. No message is passed
>>> to the client.
>>>
>>> Meanwhile the other nodes in the cluster have moved on with a new leader.
>>>
>>> A workaround for this scenario would be to increase the election timer
>>> to a value great enough so that the conversion can succeed within an
>>> election window.
>>>
>>> I don't view this as a permanent solution though, as it would be
>>> unfair to leave the end user with guessing the correct election timer
>>> in order for their upgrades to succeed.
>>>
>>> Maybe we need to hand off conversion to a thread and make the main
>>> loop only process raft requests until it is done, similar to the
>>> recent addition of preparing snapshot JSON in a separate thread [0].
>>>
>>> Any other thoughts or ideas?
>>>
>>> 0: 
>>> https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
>>>
>>
>> Hi, Frode.  Thanks for starting this conversation.
> 
> Thanks alot for your comprehensive response!
> 
>> First of all I'd still respectfully disagree that 650 MB is a
>> moderately sized database. :)  ovsdb-server on its own doesn't limit
>> users on how much data they can put in, but that doesn't mean there
>> is no limit at which it will be difficult for it or even impossible
>> to handle the database.  From my experience 650 MB is far beyond the
>> threshold for a smooth work.
> 
> I guess my comment about the moderate size was really targeted at the
> size of the deployment the DB came from. After looking more deeply
> into it, it is also clear that after a compaction the raw size of the
> file is around 305MB.
> 
>> Allowing database to grow to such size might be considered a user
>> error, or a CMS error.  In any case, setups should be tested at the
>> desired [simulated at least] scale including upgrades before
>> deploying in production environment to not run into such issues
>> unexpectedly.
> 
> I do not disagree with this, part of the problem is that the ovn-ctl
> script currently defaults to attempting a schema upgrade on startup,
> which would mean that this procedure is executed whenever a user
> upgrades packages.
> 
> The lack of visibility of the failed upgrade and the severe side
> effects a failed upgrade leads to, make me think we may need to change
> this.
> 
> The reality is, sadly, that few users do full scale tests/simulations
> of their deployment to check whether the next upgrade would succeed.
> 
>> Another way out from the situation, beside bumping the election
>> timer, might be to pin ovn-controllers, destroy the database (maybe
>> keep port bindings, etc.) and let northd to re-create it after
>> conversion.  Not sure if that will actually work though, as I
>> didn't try.
> 
> I believe for the incident that raised this topic to our attention the
> user has been unblocked by emptying the southbound DB and let northd
> re-create it.
> 
>> For the threads, I'll re-iterate my thought that throwing more
>> cores on the problem is absolutely last thing we should do.  Only
>> if there is no other choice.  Simply because many parts of
>> ovsdb-server was never optimized for performance and there are
>> likely many things we can do to improve without blindly using more
>> resources and increasing the code complexity by adding threads.
> 
> I admire your persistence in upholding this standard, and you are
> right, we should absolutely try every avenue before adding bloat and
> complexity to OVS/OVN, thank you for keeping the bar high! :)
> 
> The main reason for thinking about a separate thread here was to avoid
> the DB cluster leader disappearing from the cluster until the client
> attempting the conversion disconnects. For the operator, the
> underlying reason for this happening is non-obvious. I guess we could
> handle that part with better messaging.
> 
>> Speaking of not optimal code, the conversion process seems very
>> inefficient.  Let's deconstruct it.  (I'll skip 

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-23 Thread Frode Nordahl via discuss
On Tue, Jan 3, 2023 at 3:07 PM Ilya Maximets  wrote:
>
> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> > Hello,
> >
> > When performing an online schema conversion for a clustered DB the
> > `ovsdb-client` connects to the current leader of the cluster and
> > requests it to convert the DB to a new schema.
> >
> > The main thread of the leader ovsdb-server will then parse the new
> > schema and copy the entire database into a new in-memory copy using
> > the new schema. For a moderately sized database, let's say 650MB
> > on-disk, this process can take north of 24 seconds on a modern
> > adequately performant system.
> >
> > While this is happening the ovsdb-server process will not process any
> > raft election events or inactivity probes, so by the time the
> > conversion is done and the now past leader wants to write the
> > converted database to the cluster, its connection to the cluster is
> > dead.
> >
> > The past leader will keep repeating this process indefinitely, until
> > the client requesting the conversion disconnects. No message is passed
> > to the client.
> >
> > Meanwhile the other nodes in the cluster have moved on with a new leader.
> >
> > A workaround for this scenario would be to increase the election timer
> > to a value great enough so that the conversion can succeed within an
> > election window.
> >
> > I don't view this as a permanent solution though, as it would be
> > unfair to leave the end user with guessing the correct election timer
> > in order for their upgrades to succeed.
> >
> > Maybe we need to hand off conversion to a thread and make the main
> > loop only process raft requests until it is done, similar to the
> > recent addition of preparing snapshot JSON in a separate thread [0].
> >
> > Any other thoughts or ideas?
> >
> > 0: 
> > https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> >
>
> Hi, Frode.  Thanks for starting this conversation.

Thanks alot for your comprehensive response!

> First of all I'd still respectfully disagree that 650 MB is a
> moderately sized database. :)  ovsdb-server on its own doesn't limit
> users on how much data they can put in, but that doesn't mean there
> is no limit at which it will be difficult for it or even impossible
> to handle the database.  From my experience 650 MB is far beyond the
> threshold for a smooth work.

I guess my comment about the moderate size was really targeted at the
size of the deployment the DB came from. After looking more deeply
into it, it is also clear that after a compaction the raw size of the
file is around 305MB.

> Allowing database to grow to such size might be considered a user
> error, or a CMS error.  In any case, setups should be tested at the
> desired [simulated at least] scale including upgrades before
> deploying in production environment to not run into such issues
> unexpectedly.

I do not disagree with this, part of the problem is that the ovn-ctl
script currently defaults to attempting a schema upgrade on startup,
which would mean that this procedure is executed whenever a user
upgrades packages.

The lack of visibility of the failed upgrade and the severe side
effects a failed upgrade leads to, make me think we may need to change
this.

The reality is, sadly, that few users do full scale tests/simulations
of their deployment to check whether the next upgrade would succeed.

> Another way out from the situation, beside bumping the election
> timer, might be to pin ovn-controllers, destroy the database (maybe
> keep port bindings, etc.) and let northd to re-create it after
> conversion.  Not sure if that will actually work though, as I
> didn't try.

I believe for the incident that raised this topic to our attention the
user has been unblocked by emptying the southbound DB and let northd
re-create it.

> For the threads, I'll re-iterate my thought that throwing more
> cores on the problem is absolutely last thing we should do.  Only
> if there is no other choice.  Simply because many parts of
> ovsdb-server was never optimized for performance and there are
> likely many things we can do to improve without blindly using more
> resources and increasing the code complexity by adding threads.

I admire your persistence in upholding this standard, and you are
right, we should absolutely try every avenue before adding bloat and
complexity to OVS/OVN, thank you for keeping the bar high! :)

The main reason for thinking about a separate thread here was to avoid
the DB cluster leader disappearing from the cluster until the client
attempting the conversion disconnects. For the operator, the
underlying reason for this happening is non-obvious. I guess we could
handle that part with better messaging.

> Speaking of not optimal code, the conversion process seems very
> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> case, focusing on the clustered mode.)
>
> There are few main steps:
>
> 1. ovsdb_convert() - Creates a copy of a database conve

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-10 Thread Han Zhou via discuss
On Mon, Jan 9, 2023 at 3:34 AM Ilya Maximets  wrote:
>
> On 1/8/23 04:51, Han Zhou wrote:
> >
> >
> > On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org > wrote:
> >>
> >> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> >> > Hello,
> >> >
> >> > When performing an online schema conversion for a clustered DB the
> >> > `ovsdb-client` connects to the current leader of the cluster and
> >> > requests it to convert the DB to a new schema.
> >> >
> >> > The main thread of the leader ovsdb-server will then parse the new
> >> > schema and copy the entire database into a new in-memory copy using
> >> > the new schema. For a moderately sized database, let's say 650MB
> >> > on-disk, this process can take north of 24 seconds on a modern
> >> > adequately performant system.
> >> >
> >> > While this is happening the ovsdb-server process will not process any
> >> > raft election events or inactivity probes, so by the time the
> >> > conversion is done and the now past leader wants to write the
> >> > converted database to the cluster, its connection to the cluster is
> >> > dead.
> >> >
> >> > The past leader will keep repeating this process indefinitely, until
> >> > the client requesting the conversion disconnects. No message is
passed
> >> > to the client.
> >> >
> >> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >> >
> >> > A workaround for this scenario would be to increase the election
timer
> >> > to a value great enough so that the conversion can succeed within an
> >> > election window.
> >> >
> >> > I don't view this as a permanent solution though, as it would be
> >> > unfair to leave the end user with guessing the correct election timer
> >> > in order for their upgrades to succeed.
> >> >
> >> > Maybe we need to hand off conversion to a thread and make the main
> >> > loop only process raft requests until it is done, similar to the
> >> > recent addition of preparing snapshot JSON in a separate thread [0].
> >> >
> >> > Any other thoughts or ideas?
> >> >
> >> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
<
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
>
> >> >
> >>
> >> Hi, Frode.  Thanks for starting this conversation.
> >>
> >> First of all I'd still respectfully disagree that 650 MB is a
> >> moderately sized database. :)  ovsdb-server on its own doesn't limit
> >> users on how much data they can put in, but that doesn't mean there
> >> is no limit at which it will be difficult for it or even impossible
> >> to handle the database.  From my experience 650 MB is far beyond the
> >> threshold for a smooth work.
> >>
> >> Allowing database to grow to such size might be considered a user
> >> error, or a CMS error.  In any case, setups should be tested at the
> >> desired [simulated at least] scale including upgrades before
> >> deploying in production environment to not run into such issues
> >> unexpectedly.
> >>
> >> Another way out from the situation, beside bumping the election
> >> timer, might be to pin ovn-controllers, destroy the database (maybe
> >> keep port bindings, etc.) and let northd to re-create it after
> >> conversion.  Not sure if that will actually work though, as I
> >> didn't try.
> >>
> >>
> >> For the threads, I'll re-iterate my thought that throwing more
> >> cores on the problem is absolutely last thing we should do.  Only
> >> if there is no other choice.  Simply because many parts of
> >> ovsdb-server was never optimized for performance and there are
> >> likely many things we can do to improve without blindly using more
> >> resources and increasing the code complexity by adding threads.
> >>
> >>
> >> Speaking of not optimal code, the conversion process seems very
> >> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> >> case, focusing on the clustered mode.)
> >>
> >> There are few main steps:
> >>
> >> 1. ovsdb_convert() - Creates a copy of a database converting
> >>each column along the way and checks the constraints.
> >>
> >> 2. ovsdb_to_txn_json() - Converts the new database into a
> >>transaction JSON object.
> >>
> >> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
> >>and the transaction JSON to the storage (RAFT).
> >>
> >> 4. ovsdb_destroy() - Copy of a database is destroyed.
> >>
> >>-
> >>
> >> 5. read_db()/parse_txn() - Reads the new schema and the
> >>transaction JSON from the storage, replaces the current
> >>database with an empty one and replays the transaction
> >>that creates a new converted database.
> >>
> >> There is always a storage run between steps 4 and 5, so we generally
> >> care only that steps 1-4 and the step 5 are below the election timer
> >> threshold separately.
> >>
> >>
> >> Now looking closer to the step 1, which is the most time consuming
> >> step.  It has two stages - data conversion and th

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-09 Thread Ilya Maximets via discuss
On 1/8/23 04:51, Han Zhou wrote:
> 
> 
> On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss 
> mailto:ovs-discuss@openvswitch.org>> wrote:
>>
>> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
>> > Hello,
>> >
>> > When performing an online schema conversion for a clustered DB the
>> > `ovsdb-client` connects to the current leader of the cluster and
>> > requests it to convert the DB to a new schema.
>> >
>> > The main thread of the leader ovsdb-server will then parse the new
>> > schema and copy the entire database into a new in-memory copy using
>> > the new schema. For a moderately sized database, let's say 650MB
>> > on-disk, this process can take north of 24 seconds on a modern
>> > adequately performant system.
>> >
>> > While this is happening the ovsdb-server process will not process any
>> > raft election events or inactivity probes, so by the time the
>> > conversion is done and the now past leader wants to write the
>> > converted database to the cluster, its connection to the cluster is
>> > dead.
>> >
>> > The past leader will keep repeating this process indefinitely, until
>> > the client requesting the conversion disconnects. No message is passed
>> > to the client.
>> >
>> > Meanwhile the other nodes in the cluster have moved on with a new leader.
>> >
>> > A workaround for this scenario would be to increase the election timer
>> > to a value great enough so that the conversion can succeed within an
>> > election window.
>> >
>> > I don't view this as a permanent solution though, as it would be
>> > unfair to leave the end user with guessing the correct election timer
>> > in order for their upgrades to succeed.
>> >
>> > Maybe we need to hand off conversion to a thread and make the main
>> > loop only process raft requests until it is done, similar to the
>> > recent addition of preparing snapshot JSON in a separate thread [0].
>> >
>> > Any other thoughts or ideas?
>> >
>> > 0: 
>> > https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
>> >  
>> > 
>> >
>>
>> Hi, Frode.  Thanks for starting this conversation.
>>
>> First of all I'd still respectfully disagree that 650 MB is a
>> moderately sized database. :)  ovsdb-server on its own doesn't limit
>> users on how much data they can put in, but that doesn't mean there
>> is no limit at which it will be difficult for it or even impossible
>> to handle the database.  From my experience 650 MB is far beyond the
>> threshold for a smooth work.
>>
>> Allowing database to grow to such size might be considered a user
>> error, or a CMS error.  In any case, setups should be tested at the
>> desired [simulated at least] scale including upgrades before
>> deploying in production environment to not run into such issues
>> unexpectedly.
>>
>> Another way out from the situation, beside bumping the election
>> timer, might be to pin ovn-controllers, destroy the database (maybe
>> keep port bindings, etc.) and let northd to re-create it after
>> conversion.  Not sure if that will actually work though, as I
>> didn't try.
>>
>>
>> For the threads, I'll re-iterate my thought that throwing more
>> cores on the problem is absolutely last thing we should do.  Only
>> if there is no other choice.  Simply because many parts of
>> ovsdb-server was never optimized for performance and there are
>> likely many things we can do to improve without blindly using more
>> resources and increasing the code complexity by adding threads.
>>
>>
>> Speaking of not optimal code, the conversion process seems very
>> inefficient.  Let's deconstruct it.  (I'll skip the standalone
>> case, focusing on the clustered mode.)
>>
>> There are few main steps:
>>
>> 1. ovsdb_convert() - Creates a copy of a database converting
>>    each column along the way and checks the constraints.
>>
>> 2. ovsdb_to_txn_json() - Converts the new database into a
>>    transaction JSON object.
>>
>> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
>>    and the transaction JSON to the storage (RAFT).
>>
>> 4. ovsdb_destroy() - Copy of a database is destroyed.
>>
>>    -
>>
>> 5. read_db()/parse_txn() - Reads the new schema and the
>>    transaction JSON from the storage, replaces the current
>>    database with an empty one and replays the transaction
>>    that creates a new converted database.
>>
>> There is always a storage run between steps 4 and 5, so we generally
>> care only that steps 1-4 and the step 5 are below the election timer
>> threshold separately.
>>
>>
>> Now looking closer to the step 1, which is the most time consuming
>> step.  It has two stages - data conversion and the transaction
>> check.  Data conversion part makes sure that we're creating all
>> the rows in a new database with all the new columns and without
>> removed columns.  It also makes sure that all the datum objects
>> are converted from the old column type to the new column type b

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-07 Thread Han Zhou via discuss
On Tue, Jan 3, 2023 at 6:07 AM Ilya Maximets via discuss <
ovs-discuss@openvswitch.org> wrote:
>
> On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> > Hello,
> >
> > When performing an online schema conversion for a clustered DB the
> > `ovsdb-client` connects to the current leader of the cluster and
> > requests it to convert the DB to a new schema.
> >
> > The main thread of the leader ovsdb-server will then parse the new
> > schema and copy the entire database into a new in-memory copy using
> > the new schema. For a moderately sized database, let's say 650MB
> > on-disk, this process can take north of 24 seconds on a modern
> > adequately performant system.
> >
> > While this is happening the ovsdb-server process will not process any
> > raft election events or inactivity probes, so by the time the
> > conversion is done and the now past leader wants to write the
> > converted database to the cluster, its connection to the cluster is
> > dead.
> >
> > The past leader will keep repeating this process indefinitely, until
> > the client requesting the conversion disconnects. No message is passed
> > to the client.
> >
> > Meanwhile the other nodes in the cluster have moved on with a new
leader.
> >
> > A workaround for this scenario would be to increase the election timer
> > to a value great enough so that the conversion can succeed within an
> > election window.
> >
> > I don't view this as a permanent solution though, as it would be
> > unfair to leave the end user with guessing the correct election timer
> > in order for their upgrades to succeed.
> >
> > Maybe we need to hand off conversion to a thread and make the main
> > loop only process raft requests until it is done, similar to the
> > recent addition of preparing snapshot JSON in a separate thread [0].
> >
> > Any other thoughts or ideas?
> >
> > 0:
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> >
>
> Hi, Frode.  Thanks for starting this conversation.
>
> First of all I'd still respectfully disagree that 650 MB is a
> moderately sized database. :)  ovsdb-server on its own doesn't limit
> users on how much data they can put in, but that doesn't mean there
> is no limit at which it will be difficult for it or even impossible
> to handle the database.  From my experience 650 MB is far beyond the
> threshold for a smooth work.
>
> Allowing database to grow to such size might be considered a user
> error, or a CMS error.  In any case, setups should be tested at the
> desired [simulated at least] scale including upgrades before
> deploying in production environment to not run into such issues
> unexpectedly.
>
> Another way out from the situation, beside bumping the election
> timer, might be to pin ovn-controllers, destroy the database (maybe
> keep port bindings, etc.) and let northd to re-create it after
> conversion.  Not sure if that will actually work though, as I
> didn't try.
>
>
> For the threads, I'll re-iterate my thought that throwing more
> cores on the problem is absolutely last thing we should do.  Only
> if there is no other choice.  Simply because many parts of
> ovsdb-server was never optimized for performance and there are
> likely many things we can do to improve without blindly using more
> resources and increasing the code complexity by adding threads.
>
>
> Speaking of not optimal code, the conversion process seems very
> inefficient.  Let's deconstruct it.  (I'll skip the standalone
> case, focusing on the clustered mode.)
>
> There are few main steps:
>
> 1. ovsdb_convert() - Creates a copy of a database converting
>each column along the way and checks the constraints.
>
> 2. ovsdb_to_txn_json() - Converts the new database into a
>transaction JSON object.
>
> 3. ovsdb_txn_propose_schema_change() - Writes the new schema
>and the transaction JSON to the storage (RAFT).
>
> 4. ovsdb_destroy() - Copy of a database is destroyed.
>
>-
>
> 5. read_db()/parse_txn() - Reads the new schema and the
>transaction JSON from the storage, replaces the current
>database with an empty one and replays the transaction
>that creates a new converted database.
>
> There is always a storage run between steps 4 and 5, so we generally
> care only that steps 1-4 and the step 5 are below the election timer
> threshold separately.
>
>
> Now looking closer to the step 1, which is the most time consuming
> step.  It has two stages - data conversion and the transaction
> check.  Data conversion part makes sure that we're creating all
> the rows in a new database with all the new columns and without
> removed columns.  It also makes sure that all the datum objects
> are converted from the old column type to the new column type by
> calling ovsdb_datum_convert() for every one of them.
>
> Datum conversion is a very heavy operation, because it involves
> converting it to JSON and back.  However, in vast majority of cases
> column types do not change at all, and even if they do, it only

Re: [ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2023-01-03 Thread Ilya Maximets via discuss
On 12/14/22 08:28, Frode Nordahl via discuss wrote:
> Hello,
> 
> When performing an online schema conversion for a clustered DB the
> `ovsdb-client` connects to the current leader of the cluster and
> requests it to convert the DB to a new schema.
> 
> The main thread of the leader ovsdb-server will then parse the new
> schema and copy the entire database into a new in-memory copy using
> the new schema. For a moderately sized database, let's say 650MB
> on-disk, this process can take north of 24 seconds on a modern
> adequately performant system.
> 
> While this is happening the ovsdb-server process will not process any
> raft election events or inactivity probes, so by the time the
> conversion is done and the now past leader wants to write the
> converted database to the cluster, its connection to the cluster is
> dead.
> 
> The past leader will keep repeating this process indefinitely, until
> the client requesting the conversion disconnects. No message is passed
> to the client.
> 
> Meanwhile the other nodes in the cluster have moved on with a new leader.
> 
> A workaround for this scenario would be to increase the election timer
> to a value great enough so that the conversion can succeed within an
> election window.
> 
> I don't view this as a permanent solution though, as it would be
> unfair to leave the end user with guessing the correct election timer
> in order for their upgrades to succeed.
> 
> Maybe we need to hand off conversion to a thread and make the main
> loop only process raft requests until it is done, similar to the
> recent addition of preparing snapshot JSON in a separate thread [0].
> 
> Any other thoughts or ideas?
> 
> 0: 
> https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790
> 

Hi, Frode.  Thanks for starting this conversation.

First of all I'd still respectfully disagree that 650 MB is a
moderately sized database. :)  ovsdb-server on its own doesn't limit
users on how much data they can put in, but that doesn't mean there
is no limit at which it will be difficult for it or even impossible
to handle the database.  From my experience 650 MB is far beyond the
threshold for a smooth work.

Allowing database to grow to such size might be considered a user
error, or a CMS error.  In any case, setups should be tested at the
desired [simulated at least] scale including upgrades before
deploying in production environment to not run into such issues
unexpectedly.

Another way out from the situation, beside bumping the election
timer, might be to pin ovn-controllers, destroy the database (maybe
keep port bindings, etc.) and let northd to re-create it after
conversion.  Not sure if that will actually work though, as I
didn't try.


For the threads, I'll re-iterate my thought that throwing more
cores on the problem is absolutely last thing we should do.  Only
if there is no other choice.  Simply because many parts of
ovsdb-server was never optimized for performance and there are
likely many things we can do to improve without blindly using more
resources and increasing the code complexity by adding threads.


Speaking of not optimal code, the conversion process seems very
inefficient.  Let's deconstruct it.  (I'll skip the standalone
case, focusing on the clustered mode.)

There are few main steps:

1. ovsdb_convert() - Creates a copy of a database converting
   each column along the way and checks the constraints.

2. ovsdb_to_txn_json() - Converts the new database into a
   transaction JSON object.

3. ovsdb_txn_propose_schema_change() - Writes the new schema
   and the transaction JSON to the storage (RAFT).

4. ovsdb_destroy() - Copy of a database is destroyed.

   -

5. read_db()/parse_txn() - Reads the new schema and the
   transaction JSON from the storage, replaces the current
   database with an empty one and replays the transaction
   that creates a new converted database.

There is always a storage run between steps 4 and 5, so we generally
care only that steps 1-4 and the step 5 are below the election timer
threshold separately.


Now looking closer to the step 1, which is the most time consuming
step.  It has two stages - data conversion and the transaction
check.  Data conversion part makes sure that we're creating all
the rows in a new database with all the new columns and without
removed columns.  It also makes sure that all the datum objects
are converted from the old column type to the new column type by
calling ovsdb_datum_convert() for every one of them.

Datum conversion is a very heavy operation, because it involves
converting it to JSON and back.  However, in vast majority of cases
column types do not change at all, and even if they do, it only
happens for a few columns, not for all of them.

This part can be optimized by not converting columns if the type
didn't change, and just creating a shallow copy of the datum with
an ovsdb_datum_clone() instead.

In my preliminary testing this saves about 70% of the time spent
in ovsdb_c

[ovs-discuss] ovsdb: schema conversion for clustered db blocks preventing processing of raft election and inactivity probes

2022-12-13 Thread Frode Nordahl via discuss
Hello,

When performing an online schema conversion for a clustered DB the
`ovsdb-client` connects to the current leader of the cluster and
requests it to convert the DB to a new schema.

The main thread of the leader ovsdb-server will then parse the new
schema and copy the entire database into a new in-memory copy using
the new schema. For a moderately sized database, let's say 650MB
on-disk, this process can take north of 24 seconds on a modern
adequately performant system.

While this is happening the ovsdb-server process will not process any
raft election events or inactivity probes, so by the time the
conversion is done and the now past leader wants to write the
converted database to the cluster, its connection to the cluster is
dead.

The past leader will keep repeating this process indefinitely, until
the client requesting the conversion disconnects. No message is passed
to the client.

Meanwhile the other nodes in the cluster have moved on with a new leader.

A workaround for this scenario would be to increase the election timer
to a value great enough so that the conversion can succeed within an
election window.

I don't view this as a permanent solution though, as it would be
unfair to leave the end user with guessing the correct election timer
in order for their upgrades to succeed.

Maybe we need to hand off conversion to a thread and make the main
loop only process raft requests until it is done, similar to the
recent addition of preparing snapshot JSON in a separate thread [0].

Any other thoughts or ideas?

0: 
https://github.com/openvswitch/ovs/commit/3cd2cbd684e023682d04dd11d2640b53e4725790

-- 
Frode Nordahl
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss