On 7/15/21 3:32 PM, Dumitru Ceara wrote: > Hi Ilya, > > On 7/14/21 6:52 PM, Ilya Maximets wrote: >> On 7/14/21 3:50 PM, Ilya Maximets wrote: >>> Replication can be used to scale out read-only access to the database. >>> But there are clients that are not read-only, but read-mostly. >>> One of the main examples is ovn-controller that mostly monitors >>> updates from the Southbound DB, but needs to claim ports by sending >>> transactions that changes some database tables. >>> >>> Southbound database serves lots of connections: all connections >>> from ovn-controllers and some service connections from cloud >>> infrastructure, e.g. some OpenStack agents are monitoring updates. >>> At a high scale and with a big size of the database ovsdb-server >>> spends too much time processing monitor updates and it's required >>> to move this load somewhere else. This patch-set aims to introduce >>> required functionality to scale out read-mostly connections by >>> introducing a new OVSDB 'relay' service model . >>> >>> In this new service model ovsdb-server connects to existing OVSDB >>> server and maintains in-memory copy of the database. It serves >>> read-only transactions and monitor requests by its own, but forwards >>> write transactions to the relay source. >>> >>> Key differences from the active-backup replication: >>> - support for "write" transactions. >>> - no on-disk storage. (probably, faster operation) >>> - support for multiple remotes (connect to the clustered db). >>> - doesn't try to keep connection as long as possible, but >>> faster reconnects to other remotes to avoid missing updates. >>> - No need to know the complete database schema beforehand, >>> only the schema name. >>> - can be used along with other standalone and clustered databases >>> by the same ovsdb-server process. (doesn't turn the whole >>> jsonrpc server to read-only mode) >>> - supports modern version of monitors (monitor_cond_since), >>> because based on ovsdb-cs. >>> - could be chained, i.e. multiple relays could be connected >>> one to another in a row or in a tree-like form. >>> >>> Bringing all above functionality to the existing active-backup >>> replication doesn't look right as it will make it less reliable >>> for the actual backup use case, and this also would be much >>> harder from the implementation point of view, because current >>> replication code is not based on ovsdb-cs or idl and all the required >>> features would be likely duplicated or replication would be fully >>> re-written on top of ovsdb-cs with severe modifications of the former. >>> >>> Relay is somewhere in the middle between active-backup replication and >>> the clustered model taking a lot from both, therefore is hard to >>> implement on top of any of them. >>> >>> To run ovsdb-server in relay mode, user need to simply run: >>> >>> ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes> >>> >>> e.g. >>> >>> ovsdb-server --remote=punix:db.sock >>> relay:OVN_Southbound:tcp:127.0.0.1:6642 >>> >>> More details and examples in the documentation in the last patch >>> of the series. >>> >>> I actually tried to implement transaction forwarding on top of >>> active-backup replication in v1 of this seies, but it required >>> a lot of tricky changes, including schema format changes in order >>> to bring required information to the end clients, so I decided >>> to fully rewrite the functionality in v2 with a different approach. >>> >>> >>> Testing >>> ======= >>> >>> Some scale tests were performed with OVSDB Relays that mimics OVN >>> workloads with ovn-kubernetes. >>> Tests performed with ovn-heater (https://github.com/dceara/ovn-heater) >>> on scenario ocp-120-density-heavy: >>> >>> https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml >>> In short, the test gradually creates a lot of OVN resources and >>> checks that network is configured correctly (by pinging diferent >>> namespaces). The test includes 120 chassis (created by >>> ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs >>> with 15625 VIPs each, attached to all node LSes, etc. Test performed >>> with monitor-all=true. >>> >>> Note 1: >>> - Memory consumption is checked at the end of a test in a following >>> way: 1) check RSS 2) compact database 3) check RSS again. >>> It's observed that ovn-controllers in this test are fairly slow >>> and backlog builds up on monitors, because ovn-controllers are >>> not able to receive updates fast enough. This contributes to >>> RSS of the process, especially in combination of glibc bug (glibc >>> doesn't free fastbins back to the system). Memory trimming on >>> compaction is enabled in the test, so after compaction we can >>> see more or less real value of the RSS at the end of the test >>> wihtout backlog noise. (Compaction on relay in this case is >>> just plain malloc_trim()). >>> >>> Note 2: >>> - I didn't collect memory consumption (RSS) after compaction for a >>> test with 10 relays, because I got the idea only after the test >>> was finished and another one already started. And run takes >>> significant amount of time. So, values marked with a star (*) >>> are an approximation based on results form other tests, hence >>> might be not fully correct. >>> >>> Note 3: >>> - 'Max. poll' is a maximum of the 'long poll intervals' logged by >>> ovsdb-server during the test. Poll intervals that involved database >>> compaction (huge disk writes) are same in all tests and excluded >>> from the results. (Sb DB size in the test is 256MB, fully >>> compacted). 'Number of intervals' is just a number of logged >>> unreasonably long poll intervals. >>> Also note that ovsdb-server logs only compactions that took > 1s, >>> so poll intervals that involved compaction, but under 1s can not >>> be reliably excluded from the test results. >>> 'central' - main Sb DB servers. >>> 'relay' - relay servers connected to central ones. >>> 'before'/'after' - RSS before and after compaction + malloc_trim(). >>> 'time' - is a total time the process spent in Running state. >>> >>> >>> Baseline (3 main servers, 0 relays): >>> ++++++++++++++++++++++++++++++++++++++++ >>> >>> RSS >>> central before after clients time Max. poll Number of >>> intervals >>> 7552924 3828848 ~41 109:50 5882 1249 >>> 7342468 4109576 ~43 108:37 5717 1169 >>> 5886260 4109496 ~39 96:31 4990 1233 >>> >>> --------------------------------------------------------------------- >>> 20G 12G 126 314:58 5882 3651 >>> >>> 3x3 (3 main servers, 3 relays): >>> +++++++++++++++++++++++++++++++ >>> >>> RSS >>> central before after clients time Max. poll Number of >>> intervals >>> 6228176 3542164 ~1-5 36:53 2174 358 >>> 5723920 3570616 ~1-5 24:03 2205 382 >>> 5825420 3490840 ~1-5 35:42 2214 309 >>> >>> --------------------------------------------------------------------- >>> 17.7G 10.6G 9 96:38 2214 1049 >>> >>> relay before after clients time Max. poll Number of >>> intervals >>> 2174328 726576 37 69:44 5216 627 >>> 2122144 729640 32 63:52 4767 625 >>> 2824160 751384 51 89:09 5980 627 >>> >>> --------------------------------------------------------------------- >>> 7G 2.2G 120 222:45 5980 1879 >>> >>> Total: >>> ===================================================================== >>> 24.7G 12.8G 129 319:23 5980 2928 >>> >>> 3x10 (3 main servers, 10 relays): >>> +++++++++++++++++++++++++++++++++ >>> >>> RSS >>> central before after clients time Max. poll Number of intervals >>> 6190892 --- ~1-6 42:43 2041 634 >>> 5687576 --- ~1-5 27:09 2503 405 >>> 5958432 --- ~1-7 40:44 2193 450 >>> >>> --------------------------------------------------------------------- >>> 17.8G ~10G* 16 110:36 2503 1489 >>> >>> relay before after clients time Max. poll Number of intervals >>> 1331256 --- 9 22:58 1327 140 >>> 1218288 --- 13 28:28 1840 621 >>> 1507644 --- 19 41:44 2869 623 >>> 1257692 --- 12 27:40 1532 517 >>> 1125368 --- 9 22:23 1148 105 >>> 1380664 --- 16 35:04 2422 619 >>> 1087248 --- 6 18:18 1038 6 >>> 1277484 --- 14 34:02 2392 616 >>> 1209936 --- 10 25:31 1603 451 >>> 1293092 --- 12 29:03 2071 621 >>> >>> --------------------------------------------------------------------- >>> 12.6G 5-7G* 120 285:11 2869 4319 >>> >>> Total: >>> ===================================================================== >>> 30.4G 15-17G* 136 395:47 2869 5808 > > This is very cool, thanks for taking the time to share all this data! > >>> >>> >>> Conclusions from the test: >>> ========================== >>> >>> 1. Relays relieve a lot of pressure from main Sb DB servers. >>> In my testing total CPU time on main servers goes down from 314 >>> to 96-110 minutes, which is 3 times lower. >>> During the test, number of registered 'unreasonably poll interval's >>> on main servers goes down by 3-4 times. At the same time the >>> maximum duration of these intervals goes down by a factor of 2.5. >>> Also, factor should be higher with increased number of clinents. >>> >>> 2. Since number of clients is significantly lower, memory consumption >>> of main Db DB servers also goes down by ~12%. >>> >>> 3. For the 3x3 test total memory consumed by all processes increased >>> only by 6%. And total CPU usage increased by 1.2%. Poll intervals >>> on relay servers are comparable to poll intervals on main servers >>> with no relays, but poll intervals on main servers are significantly >>> better (see conclusion # 1). In general, it seems that for this >>> test running of 3 relays next to 3 main Sb DB servers significanlty >>> increases cluster stability and responsiveness without noticeable >>> increase in memory or CPU usage. >>> >>> 4. For the 3x10 test total memory consumed by all processes increased >>> by ~50-70%*. And total CPU usage increased by 26% in compare with >> >> ~50-70%* should be ~25-40%*. I miscalculated because used 10G from 3x3 >> test instead of 12G from the baseline. >> >>> baseline setup. At the same time poll intervals on both main >>> and relay servers are lower by a factor of 2-4 (depends on a >>> particular server). In general, cluster with 10 relays is much more >>> stable and responsive with a reasonably low memory consumption and >>> CPU time overhead. >>> >>> > > Nice! > >>> >>> Future work: >>> - Add support for transaction history (it could be just inherited >>> from the transaction ids received from the relay source). This >>> will allow clients to utilize monitor_cond_since while working >>> with relay. >>> - Possibly try to inherit min_index from the relay source to give >>> clients ability to detect relays with stale data. >>> - Probably, add support for both above things to standalone databases, >>> so relays will be able to inherit not only from clustered ones. > > Nit: I don't think this should block the series but I think the above > should be added to ovsdb/TODO.rst in a follow up patch.
Will do. TODO.rst also needs some clean-up as it seems that some bits from there are already implemented. > > I just acked the single patch I hadn't acked in v2 (7/9) and left a > minor comment on 5/9 (which can be fixed at apply time). > > The series looks good to me. Thanks, Mark and Dumitru! I fixed the small comment on patch 5/9 and applied the series to master with a minor rebase due to a memory leak fix that got accepted in the meantime. Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev