On Thu, Apr 18, 2024 at 03:35:06PM +0100, Chris Riches wrote: > On 15/04/2024 14:39, Jon Kohler wrote: > > > On Apr 11, 2024, at 9:43 AM, Chris Riches <chris.ric...@nutanix.com> > > > wrote: > > > > > > On 11/04/2024 14:24, Ilya Maximets wrote: > > > > On 4/11/24 10:59, Chris Riches wrote: > > > > > From what we know so far, the DB was full of stale > > > > > connection-tracking > > > > > information such as the following: > > > > > > > > > > [...] > > > > > > > > > > Once the host was recovered by putting in the timeout increase, > > > > > ovsdb-server successfully started and GCed the database down from 2.4 > > > > > *GB* to 29 *KB*. Had this happened before the host restart, we would > > > > > have never seen this problem. But since it seems possible to end up > > > > > booting with such a large DB, we figured a timeout increase was a > > > > > sensible measure to take. > > > > Uff. Sounds like ovn-controller went off the rails. > > > > > > > > Normally, ovsdb-server compacts the database once in 10-20 minutes, > > > > if the database doubles the size since the previous check. If all > > > > the transactions are that small, it would mean ovn-controller made > > > > about 10K transactions per second in the 10-20 minutes before the > > > > restart. That's huge. > > > > > > > > I wonder if this can be addressed with a better compaction strategy. > > > > Something like forcing compaction if "the database is more than 10 MB > > > > and increased 10x" regardless of the time. > > > I'm not sure exactly what the test was doing when this was observed, so I > > > don't know whether that transaction volume is within the realm of > > > possibility or if we're looking at a failure to perform compaction on > > > time. It would be nice to have an enhanced safety-net for DB size, as we > > > were only a few hundred MB away from hitting filesystem space issues as > > > well. > > > > > > > Normally, ovsdb-server compacts the database once in 10-20 minutes, if > > > > the database doubles the size since the previous check. > > > I presume you mean if it doubled in size since the previous *compaction*? > > > If we only compact when it doubles since the last *check*, then it would > > > be easy for it to slightly-less-than-double every 10-20 minutes and never > > > trigger the compaction while still growing exponentially. > > > > > > I'm happy to discuss compaction approaches (though my expertise is very > > > much in host service management and not OVS itself), but do you think > > > there's merit in having this extended timeout as a backstop too? > > FWIW, I think we should do both extending the time out and tuning up the > > compaction, as having a situation where a service can get in an endless > > loop if for whatever reason it takes too long is problematic. Addressing > > the root cause (compaction, too many calls, some other bug(s) etc) is > > good, but extending the timeout seems like an easy backstop. > > I agree with Jon's assessment - regardless of any action taken on compaction > or preventing growth in the first place, we should consider the proposed > timeout increase as a backstop against getting stuck in an infinite loop. > > Ilya (or another maintainer) - can I get an opinion on this?
Yes, I agree that the timeout increase is a good idea. Acked-by: Simon Horman <ho...@ovn.org> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev