The ovsdb database is regularly compacted if it exceeds a amount of data
growth. The compaction reduces the size of the database by bundling all already
commited transactions in a single large initial transaction. This allows the DB
to get rid of all changes that happened to the data in the mean time.

To ensure consistency only a part of the compaction could happen
asynchronously. Some parts (like writing the actual new file) needed to happen
afterwards in the main thread to ensure no new transactions would interfere
with the state. As this takes some time (especially for larger databases) in a
raft cluster the leader would give up its leadership to ensure transactions can
still be processed.

However nothing guarantees that the new leader will not do a compaction a few
seconds later as well an pass the leadership around again. This mean in the
worst case for a N node cluster there could be N leadership changes. The
average of leadership changes would probably be N/2.

Such leadership changes are not free. The raft cluster is unable to process
transactions until a new leader is elected (which is generally quite fast in
leadership transfers). In addition all clients that need to connect to the
leader will need to search for the new leader and reconnect to it.

The idea of this patch series it to attempt to minimize the time a compaction
blocks the main thread. By having the compaction time small enough there is no
need anymore to change the raft leadership. In addition this minimizes the
downtime on non-raft systems as well.

To solve the consistency needs we split the ovsdb database into multiple files.
Only one of these files is ever written to for new transactions. When doing a
compaction only the other files, that are not written to are compacted. Since
they are no longer written to, there is no potential consistency issue anymore
and we have an immutable state for the compaction to run on.

In normal operations we therefor have 2 files:
1. The base file. This is also what is passed in as the file name in the CLI.
   It is immutable and generally contains the last compacted status, or the
   initial state for a new cluster. It points to the second file by using
   log metadata.
2. The current file. All operations are written to this file just as normal.

During compaction we can now:
1. Create a new file to write new transactions to
2. Write a pointer in the current file to point to the new file
3. Open a temporary compaction file
4. Spawn the asynchronous compaction to operate on the current state. So
   everything in the base and current file. Changes to the new file will be
   ignored by the compaction. The compaction writes the whole output to the
   temporary file.
5. Wait for the compaction to complete and process transactions normally by
   writing to the new file
6. When the compaction completes:
    a. Write a pointer to the new file into the temporary compaction file
    b. rename the temporary compaction file to the base file (atomically)
    c. the temporary compaction file is now the new base file
    d. the new file is now the current file
    e. We can get rid of the previous base and current file as they are no
       longer necessary
    
With this the main thread at most needs to write pointers to new files and
rename files while all the work is done in the async compaction thread. This
works for raft as well as for a standalone ovsdb. This also means that leader
changes are no longer necessary.

Note that this is not affecting raft snapshot installation in any way. That
still blocks the main thread for some time. In addition snapshot installation
is treated as more important than compaction and compaction will be aborted if
a snapshot installation is requested.
    
I tested the performance improvements with a 833MB database (after compaction
683 MB) as reported by the log output (compiled with -O2).

|                         | Before Change | After Change |
|-------------------------|---------------|--------------|
| Init                    | 1015ms        | 1046ms       |
| Write/Commit            | 2869ms        | 137ms        |
| Total Compaction Thread | 29407ms       | 32104ms      |
| Total Main Thread       | 3884ms        | 1183ms       |

The 1 second in "Init" is nearly completely do to cloning the database for the
compaction thread. If someone has an idea how to optimize this even further i
would be quite interested.

But from my perspective i see a blocking of the main thread by ~1.2s as more
optimal than a raft leader change, since that will cause a longer downtime for
all connected services.

Felix Huettner (10):
  raft: Prevent underflows.
  log: Wrap log data in json key.
  log: Cleanup locking on ovsdb_log_open.
  log: Split log and log_file.
  ovsdb: Rename snapshot to compact when compacting.
  log: Store prev_offset also for writes.
  log: Support multiple files.
  ovsdb: Support online compactions.
  raft: Allow compaction when being leader.
  ovsdb: Free more data in compaction thread.

 lib/ovs-thread.c      |    1 +
 lib/ovs-thread.h      |    1 +
 ovsdb/log.c           | 1152 +++++++++++++++++++++++++++++++++--------
 ovsdb/log.h           |   23 +-
 ovsdb/ovsdb-server.c  |   20 +-
 ovsdb/ovsdb-tool.c    |   21 +-
 ovsdb/ovsdb.c         |  171 ++++--
 ovsdb/ovsdb.h         |   29 +-
 ovsdb/raft-private.h  |    7 +
 ovsdb/raft.c          |  150 +++++-
 ovsdb/raft.h          |   20 +-
 ovsdb/storage.c       |  102 +++-
 ovsdb/storage.h       |   29 +-
 ovsdb/transaction.c   |    4 +-
 tests/ovsdb-log.at    |  203 +++++++-
 tests/ovsdb-server.at |   75 +--
 tests/ovsdb-tool.at   |   42 +-
 tests/test-ovsdb.c    |   13 +-
 18 files changed, 1623 insertions(+), 440 deletions(-)


base-commit: 270de5dfb80e3e094d072b38c8b25a9242c73958
-- 
2.43.0


_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to