Current implementation of the leadership transfer just shoots the
leadership in the general direction of the first stable server in the
configuration.  It doesn't check if the server was active recently or
even that the connection is established.  This may result in sending
leadership to a disconnected or otherwise unavailable server.

Such behavior should not cause log truncation or any other correctness
issues because the destination server would have all the append
requests queued up or the connection will be dropped by the leader.
In a worst case we will have a leader-less cluster until the next
election timer fires up.  Other servers will notice the absence of the
leader and will trigger a new leader election normally.
However, the potential wait for the election timer is not good as
real-world setups may have high values configured.

Fix that by trying to transfer to servers that we know have applied
the most changes, i.e., have the highest 'match_index'.  Such servers
replied to the most recent append requests, so they have highest
chances to be healthy.  Choosing the random starting point in the
list of such servers so we don't transfer to the same server every
single time.  This slightly improves load distribution, but, most
importantly, increases robustness of our test suite, making it cover
more cases.  Also checking that the message was actually sent without
immediate failure.

If we fail to transfer to any server with the highest index, try to
just transfer to any other server that is not behind majority and then
just any other server that is connected.  We did actually send them
all the updates (if the connection is open), they just didn't reply
yet for one reason or another.  It should be better than leaving the
cluster without a leader.

Note that there is always a chance that transfer will fail, since
we're not waiting for it to be acknowledged (and must not wait).
In this case, normal election will be triggered after the election
timer fires up.

Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered 
databases.")
Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
---

CC: Felix Huettner <felix.huettner@mail.schwarz>

 ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index f463afcb3..b171da345 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char 
*reason)
         return;
     }
 
-    struct raft_server *s;
+    struct raft_server **servers, *s;
+    uint64_t threshold = 0;
+    size_t n = 0, start, i;
+
+    servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers);
+
     HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
-        if (!uuid_equals(&raft->sid, &s->sid)
-            && s->phase == RAFT_PHASE_STABLE) {
+        if (uuid_equals(&raft->sid, &s->sid)
+            || s->phase != RAFT_PHASE_STABLE) {
+            continue;
+        }
+        if (s->match_index > threshold) {
+            threshold = s->match_index;
+        }
+        servers[n++] = s;
+    }
+
+    start = n ? random_range(n) : 0;
+
+retry:
+    for (i = 0; i < n; i++) {
+        s = servers[(start + i) % n];
+
+        if (s->match_index >= threshold) {
             struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
             if (!conn) {
                 continue;
@@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char 
*reason)
                     .term = raft->term,
                 }
             };
-            raft_send_to_conn(raft, &rpc, conn);
+
+            if (!raft_send_to_conn(raft, &rpc, conn)) {
+                continue;
+            }
 
             raft_record_note(raft, "transfer leadership",
                              "transferring leadership to %s because %s",
@@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char 
*reason)
             break;
         }
     }
+
+    if (n && i == n && threshold) {
+        if (threshold > raft->commit_index) {
+            /* Failed to transfer to servers with the highest 'match_index'.
+             * Try other servers that are not behind the majority. */
+            threshold = raft->commit_index;
+        } else {
+            /* Try any other server.  It is safe, because they either have all
+             * the append requests queued up for them before the leadership
+             * transfer message or their connection is broken and we will not
+             * transfer anyway. */
+            threshold = 0;
+        }
+        goto retry;
+    }
+
+    free(servers);
 }
 
 /* Send a RemoveServerRequest to the rest of the servers in the cluster.
-- 
2.44.0

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to