The ovsdb database is regularly compacted if it exceeds a amount of data growth. The compaction reduces the size of the database by bundling all already commited transactions in a single large initial transaction. This allows the DB to get rid of all changes that happened to the data in the mean time.
To ensure consistency only a part of the compaction could happen asynchronously. Some parts (like writing the actual new file) needed to happen afterwards in the main thread to ensure no new transactions would interfere with the state. As this takes some time (especially for larger databases) in a raft cluster the leader would give up its leadership to ensure transactions can still be processed. However nothing guarantees that the new leader will not do a compaction a few seconds later as well an pass the leadership around again. This mean in the worst case for a N node cluster there could be N leadership changes. The average of leadership changes would probably be N/2. Such leadership changes are not free. The raft cluster is unable to process transactions until a new leader is elected (which is generally quite fast in leadership transfers). In addition all clients that need to connect to the leader will need to search for the new leader and reconnect to it. The idea of this patch series it to attempt to minimize the time a compaction blocks the main thread. By having the compaction time small enough there is no need anymore to change the raft leadership. In addition this minimizes the downtime on non-raft systems as well. To solve the consistency needs we split the ovsdb database into multiple files. Only one of these files is ever written to for new transactions. When doing a compaction only the other files, that are not written to are compacted. Since they are no longer written to, there is no potential consistency issue anymore and we have an immutable state for the compaction to run on. In normal operations we therefor have 2 files: 1. The base file. This is also what is passed in as the file name in the CLI. It is immutable and generally contains the last compacted status, or the initial state for a new cluster. It points to the second file by using log metadata. 2. The current file. All operations are written to this file just as normal. During compaction we can now: 1. Create a new file to write new transactions to 2. Write a pointer in the current file to point to the new file 3. Open a temporary compaction file 4. Spawn the asynchronous compaction to operate on the current state. So everything in the base and current file. Changes to the new file will be ignored by the compaction. The compaction writes the whole output to the temporary file. 5. Wait for the compaction to complete and process transactions normally by writing to the new file 6. When the compaction completes: a. Write a pointer to the new file into the temporary compaction file b. rename the temporary compaction file to the base file (atomically) c. the temporary compaction file is now the new base file d. the new file is now the current file e. We can get rid of the previous base and current file as they are no longer necessary With this the main thread at most needs to write pointers to new files and rename files while all the work is done in the async compaction thread. This works for raft as well as for a standalone ovsdb. This also means that leader changes are no longer necessary. Note that this is not affecting raft snapshot installation in any way. That still blocks the main thread for some time. In addition snapshot installation is treated as more important than compaction and compaction will be aborted if a snapshot installation is requested. I tested the performance improvements with a 833MB database (after compaction 683 MB) as reported by the log output (compiled with -O2). | | Before Change | After Change | |-------------------------|---------------|--------------| | Init | 1015ms | 1046ms | | Write/Commit | 2869ms | 137ms | | Total Compaction Thread | 29407ms | 32104ms | | Total Main Thread | 3884ms | 1183ms | The 1 second in "Init" is nearly completely do to cloning the database for the compaction thread. If someone has an idea how to optimize this even further i would be quite interested. But from my perspective i see a blocking of the main thread by ~1.2s as more optimal than a raft leader change, since that will cause a longer downtime for all connected services. Felix Huettner (10): raft: Prevent underflows. log: Wrap log data in json key. log: Cleanup locking on ovsdb_log_open. log: Split log and log_file. ovsdb: Rename snapshot to compact when compacting. log: Store prev_offset also for writes. log: Support multiple files. ovsdb: Support online compactions. raft: Allow compaction when being leader. ovsdb: Free more data in compaction thread. lib/ovs-thread.c | 1 + lib/ovs-thread.h | 1 + ovsdb/log.c | 1152 +++++++++++++++++++++++++++++++++-------- ovsdb/log.h | 23 +- ovsdb/ovsdb-server.c | 20 +- ovsdb/ovsdb-tool.c | 21 +- ovsdb/ovsdb.c | 171 ++++-- ovsdb/ovsdb.h | 29 +- ovsdb/raft-private.h | 7 + ovsdb/raft.c | 150 +++++- ovsdb/raft.h | 20 +- ovsdb/storage.c | 102 +++- ovsdb/storage.h | 29 +- ovsdb/transaction.c | 4 +- tests/ovsdb-log.at | 203 +++++++- tests/ovsdb-server.at | 75 +-- tests/ovsdb-tool.at | 42 +- tests/test-ovsdb.c | 13 +- 18 files changed, 1623 insertions(+), 440 deletions(-) base-commit: 270de5dfb80e3e094d072b38c8b25a9242c73958 -- 2.43.0 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev