On 4/11/24 19:10, Chris Riches wrote: > On 11/04/2024 17:10, Dumitru Ceara wrote: >> On 4/11/24 15:43, Chris Riches wrote: >>> On 11/04/2024 14:24, Ilya Maximets wrote: >>>> On 4/11/24 10:59, Chris Riches wrote: >> Hi Chris, Ilya, >> >>>>> From what we know so far, the DB was full of stale >>>>> connection-tracking >>>>> information such as the following: >>>>> >>>>> [...] >>>>> >>>>> Once the host was recovered by putting in the timeout increase, >>>>> ovsdb-server successfully started and GCed the database down from 2.4 >>>>> *GB* to 29 *KB*. Had this happened before the host restart, we would >>>>> have never seen this problem. But since it seems possible to end up >>>>> booting with such a large DB, we figured a timeout increase was a >>>>> sensible measure to take. >>>> Uff. Sounds like ovn-controller went off the rails. >>>> >>>> Normally, ovsdb-server compacts the database once in 10-20 minutes, >>>> if the database doubles the size since the previous check. If all >>>> the transactions are that small, it would mean ovn-controller made >>>> about 10K transactions per second in the 10-20 minutes before the >>>> restart. That's huge. >>>> >>>> I wonder if this can be addressed with a better compaction strategy. >>>> Something like forcing compaction if "the database is more than 10 MB >>>> and increased 10x" regardless of the time. >>> I'm not sure exactly what the test was doing when this was observed, so >>> I don't know whether that transaction volume is within the realm of >>> possibility or if we're looking at a failure to perform compaction on >>> time. It would be nice to have an enhanced safety-net for DB size, as we >>> were only a few hundred MB away from hitting filesystem space issues as >>> well. >>> >> To rule out any known issues, what OVN version is running on that setup? > This was during an upgrade test. We started with OVN 20.9, and this > produced the massive DB. We then upgraded to 21.9 and rebooted, which > failed to come up as described due to the massive DB. >
Both 20.09 and 21.09 are not supported for a while now. Currently supported releases are 24.03, 23.09, 23.06, 22.03: https://www.ovn.org/en/releases/all_releases/ > Our networking team doing the RCA think that the system was rapidly > flapping external ports between two configurations, hence the excessive > DB transactions. The root cause of flapping is yet to be determined but > these transactions were being done from OVN itself. They raised the Maybe you're missing these commits (it's hard to say without knowing the exact version you're running - "21.09" is vague, we need the z version too, e.g. 21.09.0 or 21.09.1): https://github.com/ovn-org/ovn/commit/d4bca93c08 https://github.com/ovn-org/ovn/commit/6fb87aad8c > theory that the flapping was so intense that ovsdb didn't actually get a > chance to compact at all - is this a possibility? > It doesn't sound possible to me but I'll let Ilya comment on this. > I've CCed Priyankar who is in charge of the RCA. > I've CCed Numan in case he has more ideas about what could cause this. Regards, Dumitru _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev