On 12/04/2024 10:20, Dumitru Ceara wrote:
On 4/11/24 19:10, Chris Riches wrote:
On 11/04/2024 17:10, Dumitru Ceara wrote:
On 4/11/24 15:43, Chris Riches wrote:
On 11/04/2024 14:24, Ilya Maximets wrote:
On 4/11/24 10:59, Chris Riches wrote:
Hi Chris, Ilya,

   From what we know so far, the DB was full of stale
connection-tracking
information such as the following:

[...]

Once the host was recovered by putting in the timeout increase,
ovsdb-server successfully started and GCed the database down from 2.4
*GB* to 29 *KB*. Had this happened before the host restart, we would
have never seen this problem. But since it seems possible to end up
booting with such a large DB, we figured a timeout increase was a
sensible measure to take.
Uff.  Sounds like ovn-controller went off the rails.

Normally, ovsdb-server compacts the database once in 10-20 minutes,
if the database doubles the size since the previous check.  If all
the transactions are that small, it would mean ovn-controller made
about 10K transactions per second in the 10-20 minutes before the
restart.  That's huge.

I wonder if this can be addressed with a better compaction strategy.
Something like forcing compaction if "the database is more than 10 MB
and increased 10x" regardless of the time.
I'm not sure exactly what the test was doing when this was observed, so
I don't know whether that transaction volume is within the realm of
possibility or if we're looking at a failure to perform compaction on
time. It would be nice to have an enhanced safety-net for DB size, as we
were only a few hundred MB away from hitting filesystem space issues as
well.

To rule out any known issues, what OVN version is running on that setup?
This was during an upgrade test. We started with OVN 20.9, and this
produced the massive DB. We then upgraded to 21.9 and rebooted, which
failed to come up as described due to the massive DB.

Both 20.09 and 21.09 are not supported for a while now.  Currently
supported releases are 24.03, 23.09, 23.06, 22.03:

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ovn.org_en_releases_all-5Freleases_&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=lxL5_Fdek_AGq_9h43DzQLJRReaIKPX4GMXchXq5QVo&m=3aTZ7K4dr0rP7Qn1aNh2BXR3F3VHvgXYatSR5-eTJaRGoX3lioRv3X3-lPLPNNX-&s=Ubvzv3RnWxiLDUFo7DbNnnomHMpaulOFpMB5-NxwEDg&e=

Our networking team doing the RCA think that the system was rapidly
flapping external ports between two configurations, hence the excessive
DB transactions. The root cause of flapping is yet to be determined but
these transactions were being done from OVN itself. They raised the
Maybe you're missing these commits (it's hard to say without knowing the
exact version you're running - "21.09" is vague, we need the z version
too, e.g. 21.09.0 or 21.09.1):

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ovn-2Dorg_ovn_commit_d4bca93c08&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=lxL5_Fdek_AGq_9h43DzQLJRReaIKPX4GMXchXq5QVo&m=3aTZ7K4dr0rP7Qn1aNh2BXR3F3VHvgXYatSR5-eTJaRGoX3lioRv3X3-lPLPNNX-&s=J0DIwACWSXfaiA-u42A2YHcPHNkzWjJZjkKBU9h0W4o&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ovn-2Dorg_ovn_commit_6fb87aad8c&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=lxL5_Fdek_AGq_9h43DzQLJRReaIKPX4GMXchXq5QVo&m=3aTZ7K4dr0rP7Qn1aNh2BXR3F3VHvgXYatSR5-eTJaRGoX3lioRv3X3-lPLPNNX-&s=JV7P7Azze44C87-0TgVY2vvytbvC_qHrsLhd7_SQ13E&e=
Exact versions appear to be the following before upgrade:
[root@pre-upgrade ~]# ovn-controller --version
ovn-controller 20.09.1
Open vSwitch Library 2.14.2
OpenFlow versions 0x6:0x6

And the following after:
[root@post-upgrade ~]# ovn-controller --version
ovn-controller 21.09.2
Open vSwitch Library 2.16.90
OpenFlow versions 0x6:0x6
SB DB Schema 20.21.0

I'll pass this onto the networking team, but I'd also like to take a step back and look at the original patch proposed here, which is about increasing the timeout. Do you think that this timeout increase is a sensible second line of defence against oversized DBs, even if such DBs can only arise due to a separate historical or future bug? Or do you feel that preventing the DB from growing in the first place is sufficient?
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to