> On Apr 11, 2024, at 9:43 AM, Chris Riches <chris.ric...@nutanix.com> wrote:
> 
> On 11/04/2024 14:24, Ilya Maximets wrote:
>> On 4/11/24 10:59, Chris Riches wrote:
>>> From what we know so far, the DB was full of stale connection-tracking
>>> information such as the following:
>>> 
>>> [...]
>>> 
>>> Once the host was recovered by putting in the timeout increase,
>>> ovsdb-server successfully started and GCed the database down from 2.4
>>> *GB* to 29 *KB*. Had this happened before the host restart, we would
>>> have never seen this problem. But since it seems possible to end up
>>> booting with such a large DB, we figured a timeout increase was a
>>> sensible measure to take.
>> Uff.  Sounds like ovn-controller went off the rails.
>> 
>> Normally, ovsdb-server compacts the database once in 10-20 minutes,
>> if the database doubles the size since the previous check.  If all
>> the transactions are that small, it would mean ovn-controller made
>> about 10K transactions per second in the 10-20 minutes before the
>> restart.  That's huge.
>> 
>> I wonder if this can be addressed with a better compaction strategy.
>> Something like forcing compaction if "the database is more than 10 MB
>> and increased 10x" regardless of the time.
> 
> I'm not sure exactly what the test was doing when this was observed, so I 
> don't know whether that transaction volume is within the realm of possibility 
> or if we're looking at a failure to perform compaction on time. It would be 
> nice to have an enhanced safety-net for DB size, as we were only a few 
> hundred MB away from hitting filesystem space issues as well.
> 
>> Normally, ovsdb-server compacts the database once in 10-20 minutes, if the 
>> database doubles the size since the previous check.
> 
> I presume you mean if it doubled in size since the previous *compaction*? If 
> we only compact when it doubles since the last *check*, then it would be easy 
> for it to slightly-less-than-double every 10-20 minutes and never trigger the 
> compaction while still growing exponentially.
> 
> I'm happy to discuss compaction approaches (though my expertise is very much 
> in host service management and not OVS itself), but do you think there's 
> merit in having this extended timeout as a backstop too?

FWIW, I think we should do both extending the time out and tuning up the
compaction, as having a situation where a service can get in an endless
loop if for whatever reason it takes too long is problematic. Addressing
the root cause (compaction, too many calls, some other bug(s) etc) is
good, but extending the timeout seems like an easy backstop.

Jon

> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=W-MV_AlPAPbGd0QQE1V3omKJ2hiODNwbKHcM7ION6RNc0sYiyjrAH_TO-iOsIPpm&s=pGAqsnVB7yeN2KmbcZaS7UGC4ybLp4oJPc4wVMaK02A&e=

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to