On 4/11/24 19:10, Chris Riches wrote:
> On 11/04/2024 17:10, Dumitru Ceara wrote:
>> On 4/11/24 15:43, Chris Riches wrote:
>>> On 11/04/2024 14:24, Ilya Maximets wrote:
>>>> On 4/11/24 10:59, Chris Riches wrote:
>> Hi Chris, Ilya,
>>
>>>>>   From what we know so far, the DB was full of stale
>>>>> connection-tracking
>>>>> information such as the following:
>>>>>
>>>>> [...]
>>>>>
>>>>> Once the host was recovered by putting in the timeout increase,
>>>>> ovsdb-server successfully started and GCed the database down from 2.4
>>>>> *GB* to 29 *KB*. Had this happened before the host restart, we would
>>>>> have never seen this problem. But since it seems possible to end up
>>>>> booting with such a large DB, we figured a timeout increase was a
>>>>> sensible measure to take.
>>>> Uff.  Sounds like ovn-controller went off the rails.
>>>>
>>>> Normally, ovsdb-server compacts the database once in 10-20 minutes,
>>>> if the database doubles the size since the previous check.  If all
>>>> the transactions are that small, it would mean ovn-controller made
>>>> about 10K transactions per second in the 10-20 minutes before the
>>>> restart.  That's huge.
>>>>
>>>> I wonder if this can be addressed with a better compaction strategy.
>>>> Something like forcing compaction if "the database is more than 10 MB
>>>> and increased 10x" regardless of the time.
>>> I'm not sure exactly what the test was doing when this was observed, so
>>> I don't know whether that transaction volume is within the realm of
>>> possibility or if we're looking at a failure to perform compaction on
>>> time. It would be nice to have an enhanced safety-net for DB size, as we
>>> were only a few hundred MB away from hitting filesystem space issues as
>>> well.
>>>
>> To rule out any known issues, what OVN version is running on that setup?
> This was during an upgrade test. We started with OVN 20.9, and this
> produced the massive DB. We then upgraded to 21.9 and rebooted, which
> failed to come up as described due to the massive DB.
> 

Both 20.09 and 21.09 are not supported for a while now.  Currently
supported releases are 24.03, 23.09, 23.06, 22.03:

https://www.ovn.org/en/releases/all_releases/

> Our networking team doing the RCA think that the system was rapidly
> flapping external ports between two configurations, hence the excessive
> DB transactions. The root cause of flapping is yet to be determined but
> these transactions were being done from OVN itself. They raised the

Maybe you're missing these commits (it's hard to say without knowing the
exact version you're running - "21.09" is vague, we need the z version
too, e.g. 21.09.0 or 21.09.1):

https://github.com/ovn-org/ovn/commit/d4bca93c08
https://github.com/ovn-org/ovn/commit/6fb87aad8c

> theory that the flapping was so intense that ovsdb didn't actually get a
> chance to compact at all - is this a possibility?
> 

It doesn't sound possible to me but I'll let Ilya comment on this.

> I've CCed Priyankar who is in charge of the RCA.
> 

I've CCed Numan in case he has more ideas about what could cause this.

Regards,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to