On 6/7/23 12:23, Ilya Maximets wrote:
> On 6/7/23 00:48, Han Zhou wrote:
>>
>>
>> On Tue, Jun 6, 2023 at 3:08 PM Ilya Maximets <i.maxim...@ovn.org 
>> <mailto:i.maxim...@ovn.org>> wrote:
>>>
>>> On 6/6/23 22:12, Han Zhou wrote:
>>>>
>>>>
>>>> On Tue, Jun 6, 2023 at 12:10 PM Ilya Maximets <i.maxim...@ovn.org 
>>>> <mailto:i.maxim...@ovn.org> <mailto:i.maxim...@ovn.org 
>>>> <mailto:i.maxim...@ovn.org>>> wrote:
>>>>>
>>>>> On 3/27/23 21:43, Ilya Maximets wrote:
>>>>>> Change sets in OVSDB monitor are storing all the changes that happened
>>>>>> between a particular transaction ID and now.  Initial change set
>>>>>> basically contains all the data.
>>>>>>
>>>>>> On each monitor request a new initial change set is created by creating
>>>>>> an empty change set and adding all the database rows.  Then it is
>>>>>> converted into JSON reply and immediately untracked and destroyed.
>>>>>>
>>>>>> This is causing significant performance issues if many clients are
>>>>>> requesting new monitors at the same time.  For example, that is
>>>>>> happening after database schema conversion, because conversion triggers
>>>>>> cancellation of all monitors.  After cancellation, every client sends
>>>>>> a new monitor request.  The server then creates a new initial change
>>>>>> set, sends a reply, destroys initial change set and repeats that for
>>>>>> each client.  On a system with 200 MB database and 500 clients,
>>>>>> cluster of 3 servers spends 20 minutes replying to all the clients
>>>>>> (200 MB x 500 = 100 GB):
>>>>>>
>>>>>>   timeval|WARN|Unreasonably long 1201525ms poll interval
>>>>>>
>>>>>> Of course, all the clients are already disconnected due to inactivity
>>>>>> at this point.  When they are re-connecting back, server accepts new
>>>>>> connections one at a time, so inactivity probes will not be triggered
>>>>>> anymore, but it still takes another 20 minutes to handle all the
>>>>>> incoming connections.
>>>>>>
>>>>>> Let's keep the initial change set around for as long as the monitor
>>>>>> itself exists.  This will allow us to not construct a new change set
>>>>>> on each new monitor request and even utilize the JSON cache in some
>>>>>> cases.  All that at a relatively small maintenance cost, since we'll
>>>>>> need to commit changes to one extra change set on every transaction.
>>>>>> Measured memory usage increase due to keeping around a shallow copy
>>>>>> of a database is about 10%.  Measured CPU usage difference during
>>>>>> normal operation is negligible.
>>>>>>
>>>>>> With this change it takes only 30 seconds to send out all the monitor
>>>>>> replies in the example above.  So, it's a 40x performance improvement.
>>>>>> On a more reasonable setup with 250 nodes, the process takes up to
>>>>>> 8-10 seconds instead of 4-5 minutes.
>>>>>>
>>>>>> Conditional monitoring will benefit from this change as well, however
>>>>>> results might be less impressive due to lack of JSON cache.
>>>>>>
>>>>>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org 
>>>>>> <mailto:i.maxim...@ovn.org> <mailto:i.maxim...@ovn.org 
>>>>>> <mailto:i.maxim...@ovn.org>>>
>>>>>> ---
>>>>>>  ovsdb/monitor.c | 5 ++++-
>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/ovsdb/monitor.c b/ovsdb/monitor.c
>>>>>> index 191befcae..3cdd03b20 100644
>>>>>> --- a/ovsdb/monitor.c
>>>>>> +++ b/ovsdb/monitor.c
>>>>>> @@ -609,7 +609,10 @@ ovsdb_monitor_untrack_change_set(struct 
>>>>>> ovsdb_monitor *dbmon,
>>>>>>      ovs_assert(mcs);
>>>>>>      if (--mcs->n_refs == 0) {
>>>>>>          if (mcs == dbmon->init_change_set) {
>>>>>> -            dbmon->init_change_set = NULL;
>>>>>> +            /* The initial change set should exist as long as the
>>>>>> +             * monitor itself. */
>>>>>> +            mcs->n_refs++;
>>>>>> +            return;
>>>>>>          } else if (mcs == dbmon->new_change_set) {
>>>>>>              dbmon->new_change_set = NULL;
>>>>>>          }
>>>>>
>>>>> Hi.  I'm thinking about backporting this one patch to 3.1.  It's our 
>>>>> latest
>>>>> stable version and it's starting to get used at scale.  Without the change
>>>>> it's really hard to manage simultaneous re-connections in large (250+ 
>>>>> nodes)
>>>>> OVN clusters.
>>>>>
>>>>> The change itself is fairly simple, so shouldn't cause any issues in 3.1.
>>>>> We're running all our scale tests with it since it was accepted without
>>>>> any issues.
>>>>>
>>>>> Backporting to 2.17 is probably not advisable, because 2.17 doesn't have
>>>>> lazy-copy mechanism for database objects.
>>>>>
>>>>> Any thoughts on this?
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>>> Sorry for missing this in the original code review. I think there is a 
>>>> problem in this patch. (It was one of my TODOs years ago, and this is what 
>>>> I found from my notes :))
>>>>
>>>> It may cause problems because clients monitoring the same tables and 
>>>> columns could have different conditions, and if a new client comes with a 
>>>> column in the condition that doesn't exist in the monitor. The new column 
>>>> will be added to the monitor in ovsdb_monitor_condition_bind(). However, 
>>>> the initial change set generated before that would not have such columns. 
>>>> It would crash when evaluating conditions for the new client against the 
>>>> initial set. The solution is to always include all columns for initial 
>>>> change set if we want to keep it. Maybe we should have a test case for 
>>>> this.
>>>
>>>
>>> Hmm.  So, it's a case where new monitor has a condition on a column
>>> that it doesn't monitor.  Right?
>>>
>>> It should not be hard to write a test for this scenario.
>>>
>>> I guess, another solution will be to destroy initial change set whenever
>>> a new column is added to the monitor.  It should not really happen in
>>> the real world applications, as I don't think OVN, for example, is having
>>> conditions on columns that it doesn't monitor.  So, destruction will
>>> not be triggered in normal cases and we will not have any side effects
>>> like increased memory/CPU usage if we will add all the columns.
>>>
>> Probably having conditions on columns that it doesn't monitor is not 
>> abnormal, although I don't recall immediately any such use cases in OVN.
> 
> Me neither, but I guess one simple use case is that if someone has
> a condition like '<column B> equals A', there is no need for them
> to monitor column B, because every row received from the server
> will have the same value in this column..
> 
>> However, I like your approach because when it comes to scale, a large number 
>> of clients should have condition columns in common and the destruction 
>> should happen in very rare cases, even if many of them have conditions on 
>> columns that they don't monitor.
> 
> Yeah, also columns seems to be never removed from the monitor,
> so the total number of cases where the change set needs to be
> destroyed is limited by the number of columns in the table.
> 
>>
>>> Maybe something like this (not tested):
>>>
>>> diff --git a/ovsdb/monitor.c b/ovsdb/monitor.c
>>> index 04dcd2298..247f6f615 100644
>>> --- a/ovsdb/monitor.c
>>> +++ b/ovsdb/monitor.c
>>> @@ -478,6 +478,7 @@ ovsdb_monitor_add_column(struct ovsdb_monitor *dbmon,
>>>                           enum ovsdb_monitor_selection select,
>>>                           bool monitored)
>>>  {
>>> +    struct ovsdb_monitor_change_set *mcs;
>>>      struct ovsdb_monitor_table *mt;
>>>      struct ovsdb_monitor_column *c;
>>>
>>> @@ -488,6 +489,18 @@ ovsdb_monitor_add_column(struct ovsdb_monitor *dbmon,
>>>          return column->name;
>>>      }
>>>
>>> +    mcs = dbmon->init_change_set;
>>> +    if (mcs) {
>>> +        /* A new column is going to be added to the monitor.  Existing
>>> +         * initial change set doesn't have it, so can no longer be used.
>>> +         * Initial change set is never used by more than one session at
>>> +         * the same time, so it's safe to destroy it here. */
>>> +        ovs_assert(mcs->n_refs == 1);
>>> +        ovsdb_monitor_json_cache_destroy(dbmon, mcs);
>>> +        ovsdb_monitor_change_set_destroy(mcs);
>>> +        dbmon->init_change_set = NULL;
>>> +    }
>>> +
>>>      if (mt->n_columns >= mt->allocated_columns) {
>>>          mt->columns = x2nrealloc(mt->columns, &mt->allocated_columns,
>>>                                   sizeof *mt->columns);
>>> ---
>>>
>>> What do you think?
>>>
>>
>> At a quick glance, it looks good to me.
> 
> Ack.  I'll work on a unit test for this and post the formal patch.
> Thanks!
> 
>>
>> Thanks,
>> Han
>>
>>>>
>>>> With this addressed, I have no objection for backporting to 3.1.

I backported the patch and the fix to 3.1 now.  Thanks!

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to