Current implementation of the leadership transfer just shoots the leadership in the general direction of the first stable server in the configuration. It doesn't check if the server was active recently or even that the connection is established. This may result in sending leadership to a disconnected or otherwise unavailable server.
Such behavior should not cause log truncation or any other correctness issues because the destination server would have all the append requests queued up or the connection will be dropped by the leader. In a worst case we will have a leader-less cluster until the next election timer fires up. Other servers will notice the absence of the leader and will trigger a new leader election normally. However, the potential wait for the election timer is not good as real-world setups may have high values configured. Fix that by trying to transfer to servers that we know have applied the most changes, i.e., have the highest 'match_index'. Such servers replied to the most recent append requests, so they have highest chances to be healthy. Choosing the random starting point in the list of such servers so we don't transfer to the same server every single time. This slightly improves load distribution, but, most importantly, increases robustness of our test suite, making it cover more cases. Also checking that the message was actually sent without immediate failure. If we fail to transfer to any server with the highest index, try to just transfer to any other server that is not behind majority and then just any other server that is connected. We did actually send them all the updates (if the connection is open), they just didn't reply yet for one reason or another. It should be better than leaving the cluster without a leader. Note that there is always a chance that transfer will fail, since we're not waiting for it to be acknowledged (and must not wait). In this case, normal election will be triggered after the election timer fires up. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> --- CC: Felix Huettner <felix.huettner@mail.schwarz> ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 44 insertions(+), 4 deletions(-) diff --git a/ovsdb/raft.c b/ovsdb/raft.c index f463afcb3..b171da345 100644 --- a/ovsdb/raft.c +++ b/ovsdb/raft.c @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason) return; } - struct raft_server *s; + struct raft_server **servers, *s; + uint64_t threshold = 0; + size_t n = 0, start, i; + + servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers); + HMAP_FOR_EACH (s, hmap_node, &raft->servers) { - if (!uuid_equals(&raft->sid, &s->sid) - && s->phase == RAFT_PHASE_STABLE) { + if (uuid_equals(&raft->sid, &s->sid) + || s->phase != RAFT_PHASE_STABLE) { + continue; + } + if (s->match_index > threshold) { + threshold = s->match_index; + } + servers[n++] = s; + } + + start = n ? random_range(n) : 0; + +retry: + for (i = 0; i < n; i++) { + s = servers[(start + i) % n]; + + if (s->match_index >= threshold) { struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid); if (!conn) { continue; @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason) .term = raft->term, } }; - raft_send_to_conn(raft, &rpc, conn); + + if (!raft_send_to_conn(raft, &rpc, conn)) { + continue; + } raft_record_note(raft, "transfer leadership", "transferring leadership to %s because %s", @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason) break; } } + + if (n && i == n && threshold) { + if (threshold > raft->commit_index) { + /* Failed to transfer to servers with the highest 'match_index'. + * Try other servers that are not behind the majority. */ + threshold = raft->commit_index; + } else { + /* Try any other server. It is safe, because they either have all + * the append requests queued up for them before the leadership + * transfer message or their connection is broken and we will not + * transfer anyway. */ + threshold = 0; + } + goto retry; + } + + free(servers); } /* Send a RemoveServerRequest to the rest of the servers in the cluster. -- 2.44.0 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev