On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > > Current implementation of the leadership transfer just shoots the > leadership in the general direction of the first stable server in the > configuration. It doesn't check if the server was active recently or > even that the connection is established. This may result in sending > leadership to a disconnected or otherwise unavailable server. > > Such behavior should not cause log truncation or any other correctness > issues because the destination server would have all the append > requests queued up or the connection will be dropped by the leader. > In a worst case we will have a leader-less cluster until the next > election timer fires up. Other servers will notice the absence of the > leader and will trigger a new leader election normally. > However, the potential wait for the election timer is not good as > real-world setups may have high values configured. > > Fix that by trying to transfer to servers that we know have applied > the most changes, i.e., have the highest 'match_index'. Such servers > replied to the most recent append requests, so they have highest > chances to be healthy. Choosing the random starting point in the > list of such servers so we don't transfer to the same server every > single time. This slightly improves load distribution, but, most > importantly, increases robustness of our test suite, making it cover > more cases. Also checking that the message was actually sent without > immediate failure. > > If we fail to transfer to any server with the highest index, try to > just transfer to any other server that is not behind majority and then > just any other server that is connected. We did actually send them > all the updates (if the connection is open), they just didn't reply > yet for one reason or another. It should be better than leaving the > cluster without a leader. > > Note that there is always a chance that transfer will fail, since > we're not waiting for it to be acknowledged (and must not wait). > In this case, normal election will be triggered after the election > timer fires up. > > Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") > Signed-off-by: Ilya Maximets <i.maxim...@ovn.org> > --- > > CC: Felix Huettner <felix.huettner@mail.schwarz> > > ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 44 insertions(+), 4 deletions(-) > > diff --git a/ovsdb/raft.c b/ovsdb/raft.c > index f463afcb3..b171da345 100644 > --- a/ovsdb/raft.c > +++ b/ovsdb/raft.c > @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > return; > } > > - struct raft_server *s; > + struct raft_server **servers, *s; > + uint64_t threshold = 0; > + size_t n = 0, start, i; > + > + servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers); > + > HMAP_FOR_EACH (s, hmap_node, &raft->servers) { > - if (!uuid_equals(&raft->sid, &s->sid) > - && s->phase == RAFT_PHASE_STABLE) { > + if (uuid_equals(&raft->sid, &s->sid) > + || s->phase != RAFT_PHASE_STABLE) { > + continue; > + } > + if (s->match_index > threshold) { > + threshold = s->match_index; > + } > + servers[n++] = s; > + } > + > + start = n ? random_range(n) : 0; > + > +retry: > + for (i = 0; i < n; i++) { > + s = servers[(start + i) % n]; > + > + if (s->match_index >= threshold) { > struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid); > if (!conn) { > continue; > @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > .term = raft->term, > } > }; > - raft_send_to_conn(raft, &rpc, conn); > + > + if (!raft_send_to_conn(raft, &rpc, conn)) { > + continue; > + } > > raft_record_note(raft, "transfer leadership", > "transferring leadership to %s because %s", > @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const char *reason) > break; > } > } > + > + if (n && i == n && threshold) { > + if (threshold > raft->commit_index) { > + /* Failed to transfer to servers with the highest 'match_index'. > + * Try other servers that are not behind the majority. */ > + threshold = raft->commit_index; > + } else { > + /* Try any other server. It is safe, because they either have all > + * the append requests queued up for them before the leadership > + * transfer message or their connection is broken and we will not > + * transfer anyway. */ > + threshold = 0; > + } > + goto retry;
Thanks Ilya. It seems the retry could try the earlier failed server (e.g. the ones that raft_send_to_conn() returned false) one or two more times, but it should be fine because the number of servers are very small anyway. So this looks good to me. Acked-by: Han Zhou <hz...@ovn.org> > + } > + > + free(servers); > } > > /* Send a RemoveServerRequest to the rest of the servers in the cluster. > -- > 2.44.0 > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev