On 3/26/24 21:10, Han Zhou wrote: > > > On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maxim...@ovn.org > <mailto:i.maxim...@ovn.org>> wrote: >> >> Current implementation of the leadership transfer just shoots the >> leadership in the general direction of the first stable server in the >> configuration. It doesn't check if the server was active recently or >> even that the connection is established. This may result in sending >> leadership to a disconnected or otherwise unavailable server. >> >> Such behavior should not cause log truncation or any other correctness >> issues because the destination server would have all the append >> requests queued up or the connection will be dropped by the leader. >> In a worst case we will have a leader-less cluster until the next >> election timer fires up. Other servers will notice the absence of the >> leader and will trigger a new leader election normally. >> However, the potential wait for the election timer is not good as >> real-world setups may have high values configured. >> >> Fix that by trying to transfer to servers that we know have applied >> the most changes, i.e., have the highest 'match_index'. Such servers >> replied to the most recent append requests, so they have highest >> chances to be healthy. Choosing the random starting point in the >> list of such servers so we don't transfer to the same server every >> single time. This slightly improves load distribution, but, most >> importantly, increases robustness of our test suite, making it cover >> more cases. Also checking that the message was actually sent without >> immediate failure. >> >> If we fail to transfer to any server with the highest index, try to >> just transfer to any other server that is not behind majority and then >> just any other server that is connected. We did actually send them >> all the updates (if the connection is open), they just didn't reply >> yet for one reason or another. It should be better than leaving the >> cluster without a leader. >> >> Note that there is always a chance that transfer will fail, since >> we're not waiting for it to be acknowledged (and must not wait). >> In this case, normal election will be triggered after the election >> timer fires up. >> >> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered >> databases.") >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>> >> --- >> >> CC: Felix Huettner <felix.huettner@mail.schwarz> >> >> ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- >> 1 file changed, 44 insertions(+), 4 deletions(-) >> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c >> index f463afcb3..b171da345 100644 >> --- a/ovsdb/raft.c >> +++ b/ovsdb/raft.c >> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const >> char *reason) >> return; >> } >> >> - struct raft_server *s; >> + struct raft_server **servers, *s; >> + uint64_t threshold = 0; >> + size_t n = 0, start, i; >> + >> + servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers); >> + >> HMAP_FOR_EACH (s, hmap_node, &raft->servers) { >> - if (!uuid_equals(&raft->sid, &s->sid) >> - && s->phase == RAFT_PHASE_STABLE) { >> + if (uuid_equals(&raft->sid, &s->sid) >> + || s->phase != RAFT_PHASE_STABLE) { >> + continue; >> + } >> + if (s->match_index > threshold) { >> + threshold = s->match_index; >> + } >> + servers[n++] = s; >> + } >> + >> + start = n ? random_range(n) : 0; >> + >> +retry: >> + for (i = 0; i < n; i++) { >> + s = servers[(start + i) % n]; >> + >> + if (s->match_index >= threshold) { >> struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid); >> if (!conn) { >> continue; >> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const >> char *reason) >> .term = raft->term, >> } >> }; >> - raft_send_to_conn(raft, &rpc, conn); >> + >> + if (!raft_send_to_conn(raft, &rpc, conn)) { >> + continue; >> + } >> >> raft_record_note(raft, "transfer leadership", >> "transferring leadership to %s because %s", >> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const >> char *reason) >> break; >> } >> } >> + >> + if (n && i == n && threshold) { >> + if (threshold > raft->commit_index) { >> + /* Failed to transfer to servers with the highest 'match_index'. >> + * Try other servers that are not behind the majority. */ >> + threshold = raft->commit_index; >> + } else { >> + /* Try any other server. It is safe, because they either have >> all >> + * the append requests queued up for them before the leadership >> + * transfer message or their connection is broken and we will >> not >> + * transfer anyway. */ >> + threshold = 0; >> + } >> + goto retry; > > Thanks Ilya. It seems the retry could try the earlier failed server (e.g. the > ones that > raft_send_to_conn() returned false) one or two more times, but it should be > fine because > the number of servers are very small anyway. So this looks good to me.
Yeah, most of the time this array will have just 2 elements and commit_index will be equal to the highest match_index, so we will likely only retry once. > > Acked-by: Han Zhou <hz...@ovn.org <mailto:hz...@ovn.org>> > >> + } >> + >> + free(servers); >> } >> >> /* Send a RemoveServerRequest to the rest of the servers in the cluster. >> -- >> 2.44.0 >> > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev