On 8/29/25 4:02 PM, Dumitru Ceara wrote:
> On 8/29/25 12:22 PM, Ilya Maximets wrote:
>> On 8/29/25 9:27 AM, Dumitru Ceara wrote:
>>> On 8/28/25 11:55 PM, Ilya Maximets wrote:
>>>> On 8/28/25 11:41 PM, Ilya Maximets wrote:
>>>>> On 8/13/25 7:20 PM, Mark Michelson via dev wrote:
>>>>>> Prior to this commit, if you wanted to find the corresponding northbound
>>>>>> UUID for a southbound Logical Datapath, you could find it in one of two
>>>>>> places:
>>>>>>
>>>>>> For logical switches, it was in external-ids:logical-switch.
>>>>>> For logical routers, it was in external-ids:logical-router.
>>>>>>
>>>>>> With this commit, we are separating the type and UUID into separate
>>>>>> fields. This way, no matter the type of the datapath, you can find the
>>>>>> UUID. And you can find the type of the datapath without having to
>>>>>> potentially check multiple external-ids keys to do so.
>>>>>>
>>>>>> These fields are going to be used pretty heavily by northd in upcoming
>>>>>> patches, so instead of making them external-ids, these are now
>>>>>> full-fledged columns on the southbound Datapath_Binding. The UUID of the
>>>>>> northbound logical datapath is in a column called "nb_uuid" and the type
>>>>>> of the logical datapath is stored in an enumerated column called "type".
>>>>>>
>>>>>> Note that there is a seemingly-unrelated change in the check packet
>>>>>> length test in tests/ovn.at. This test started failing because its use
>>>>>> of `grep` was picking up the new "nb_uuid" column accidentally. The
>>>>>> change to the test uses `fetch_column` since it is more precise.
>>>>>>
>>>>>> Signed-off-by: Mark Michelson <[email protected]>
>>>>>> Acked-by: Ales Musil <[email protected]>
>>>>>> ---
>>>>>> V17:
>>>>>> - cherry picked from v15
>>>>>> ---
>>>>>>  controller/local_data.c |  2 +-
>>>>>>  ic/ovn-ic.c             |  5 +++--
>>>>>>  lib/ovn-util.c          | 45 +++++++++++++++++++++++++++++++++++++++++
>>>>>>  lib/ovn-util.h          |  8 ++++++++
>>>>>>  northd/northd.c         | 38 +++++++++++++++++-----------------
>>>>>>  northd/northd.h         |  5 +++--
>>>>>>  ovn-sb.ovsschema        | 11 ++++++++--
>>>>>>  ovn-sb.xml              | 21 ++++++++++---------
>>>>>>  tests/ovn-controller.at |  4 ++++
>>>>>>  tests/ovn-northd.at     |  6 +++---
>>>>>>  tests/ovn.at            |  6 ++----
>>>>>>  utilities/ovn-sbctl.c   |  4 ++--
>>>>>>  utilities/ovn-trace.c   |  3 +--
>>>>>>  13 files changed, 112 insertions(+), 46 deletions(-)
>>>>>
>>>>> <snip>
>>>>>
>>>>>>  /* Returns an "enum ovn_stage" built from the arguments.
>>>>>> diff --git a/ovn-sb.ovsschema b/ovn-sb.ovsschema
>>>>>> index 4c24f5b51..8bc7bbba3 100644
>>>>>> --- a/ovn-sb.ovsschema
>>>>>> +++ b/ovn-sb.ovsschema
>>>>>> @@ -1,7 +1,7 @@
>>>>>>  {
>>>>>>      "name": "OVN_Southbound",
>>>>>> -    "version": "21.2.0",
>>>>>> -    "cksum": "29145795 34859",
>>>>>> +    "version": "21.3.0",
>>>>>> +    "cksum": "3179330761 35286",
>>>>>>      "tables": {
>>>>>>          "SB_Global": {
>>>>>>              "columns": {
>>>>>> @@ -196,6 +196,13 @@
>>>>>>                  "load_balancers": {"type": {"key": {"type": "uuid"},
>>>>>>                                              "min": 0,
>>>>>>                                              "max": "unlimited"}},
>>>>>> +                "type": {"type": {"key": {"type": "string",
>>>>>> +                                          "enum": ["set",
>>>>>> +                                                      ["logical-switch",
>>>>>> +                                                       
>>>>>> "logical-router"]]}}},
>>>>>
>>>>> Hi, Mark, Dumitru, Ales.
>>>>>
>>>
>>> Hi Ilya,
>>>
>>>>> This change breaks upgrades, at least in active-backup database 
>>>>> configuration.
>>>>>
>>>>> When the old database is converted to the new schema, the default value 
>>>>> for
>>>>> the column is an empty string.  But this is against the schema definition.
>>>>> This is not a problem for the database itself, as it doesn't check these
>>>>> constraints, but if we have the backup database connected, it will check 
>>>>> the
>>>>> constraints upon receiving the new data and will fail:
>>>>>
>>>>> |replication|INFO|OVN_Southbound: Monitor reply received. Resetting the 
>>>>> database.
>>>>> |00045|ovsdb_error|ERR|unexpected ovsdb error: constraint violation: "" 
>>>>> is not
>>>>>                        one of the allowed values ([logical-router, 
>>>>> logical-switch])
>>>>>
>>>>> At this point the backup server will disconnect and turn into active 
>>>>> server
>>>>> as ti thinks that the origin is broken.  We probably need to allow this
>>>>> column to not have a value specified.
>>>>>
>>>>> It's fairly easy to reporoduce - get an older database and start a sandbox
>>>>> with it.  The sb2 will error out and disconnect.
>>>>>
>>>
>>> Nice catch!  I managed to reproduce the problem as following your
>>> suggestion locally: I started a 25.03 sandbox, ran ./ovn-setup.sh, saved
>>> the NB/SB and then started a sandbox on main with the saved DBs.
>>>
>>>>> Also, northd and ovn-controller should be able to handle this case where
>>>>> the column is not yet set after upgrade.  I didn't check if that works or
>>>>> not.
>>>
>>> Yes, that part is handled correctly, see datapath_get_nb_uuid_and_type().
>>>
>>> Are you planning to post a patch to fix this?  I assume we just need to
>>> make the column value optional in the SB schema.
>>
>> I'll probably not have time for this today, so feel free to work on it.
>> Otherwise, I can find some time early next week.
>>
> 
> No worries, I posted a patch here:
> https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/
> 
>>>
>>>>
>>>> And it seems like northd is in the infinite lop of re-writing entire sb.
>>>> Not sure if it is related or some other issue.
>>>>
>>>
>>> Could it be that you didn't have the Static_Mac_Binding fix in your
>>> local main branch?  I don't see the infinite loop in my reproducer.
>>>
>>> https://github.com/ovn-org/ovn/commit/e8a0ade
>>
>> I saw this on the latest main:
>> https://github.com/ovn-org/ovn/commit/b6ee938b9
>>
>> So, both of the static MAC binding changes are there.
>>
> 
> Upon further debugging there is a problem with static MAC bindings but
> that's not what is causing the loop.  Ales just posted a separate patch
> for that.
> 
>> It feels like northd removes datapaths with all the acssociated resources
>> and then re-adds them on the next iteration.  E.g. I see the same static
>> mac binding deleted in one transaction and added back in the next and
>> so on in the loop, same with the logical flows and some other stuff.
>>
>>>
>>> If not, it might be good to share the DBs if possible.
>>
>> I saw this in one of the OpenShift DBs from OVN 24.03.  I can share more
>> details on this particular database privately.
>>
> 
> Thanks for the databases, I managed to reproduce this locally.
> Debugging further, I'll update once I know more.

Hi Ilya,

I turned the datapath schema fix patch into a series which now includes
all the issues that we found with the databases you spotted the bug on
initially:

https://patchwork.ozlabs.org/project/ovn/list/?series=471679&state=*

Thanks again for the bug report, great catch!

Regards,
Dumitru

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to