On 11/04/2024 17:10, Dumitru Ceara wrote:
On 4/11/24 15:43, Chris Riches wrote:
On 11/04/2024 14:24, Ilya Maximets wrote:
On 4/11/24 10:59, Chris Riches wrote:
Hi Chris, Ilya,

  From what we know so far, the DB was full of stale connection-tracking
information such as the following:

[...]

Once the host was recovered by putting in the timeout increase,
ovsdb-server successfully started and GCed the database down from 2.4
*GB* to 29 *KB*. Had this happened before the host restart, we would
have never seen this problem. But since it seems possible to end up
booting with such a large DB, we figured a timeout increase was a
sensible measure to take.
Uff.  Sounds like ovn-controller went off the rails.

Normally, ovsdb-server compacts the database once in 10-20 minutes,
if the database doubles the size since the previous check.  If all
the transactions are that small, it would mean ovn-controller made
about 10K transactions per second in the 10-20 minutes before the
restart.  That's huge.

I wonder if this can be addressed with a better compaction strategy.
Something like forcing compaction if "the database is more than 10 MB
and increased 10x" regardless of the time.
I'm not sure exactly what the test was doing when this was observed, so
I don't know whether that transaction volume is within the realm of
possibility or if we're looking at a failure to perform compaction on
time. It would be nice to have an enhanced safety-net for DB size, as we
were only a few hundred MB away from hitting filesystem space issues as
well.

To rule out any known issues, what OVN version is running on that setup?
This was during an upgrade test. We started with OVN 20.9, and this produced the massive DB. We then upgraded to 21.9 and rebooted, which failed to come up as described due to the massive DB.

Our networking team doing the RCA think that the system was rapidly flapping external ports between two configurations, hence the excessive DB transactions. The root cause of flapping is yet to be determined but these transactions were being done from OVN itself. They raised the theory that the flapping was so intense that ovsdb didn't actually get a chance to compact at all - is this a possibility?

I've CCed Priyankar who is in charge of the RCA.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to