On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> Current implementation of the leadership transfer just shoots the
> leadership in the general direction of the first stable server in the
> configuration.  It doesn't check if the server was active recently or
> even that the connection is established.  This may result in sending
> leadership to a disconnected or otherwise unavailable server.
>
> Such behavior should not cause log truncation or any other correctness
> issues because the destination server would have all the append
> requests queued up or the connection will be dropped by the leader.
> In a worst case we will have a leader-less cluster until the next
> election timer fires up.  Other servers will notice the absence of the
> leader and will trigger a new leader election normally.
> However, the potential wait for the election timer is not good as
> real-world setups may have high values configured.
>
> Fix that by trying to transfer to servers that we know have applied
> the most changes, i.e., have the highest 'match_index'.  Such servers
> replied to the most recent append requests, so they have highest
> chances to be healthy.  Choosing the random starting point in the
> list of such servers so we don't transfer to the same server every
> single time.  This slightly improves load distribution, but, most
> importantly, increases robustness of our test suite, making it cover
> more cases.  Also checking that the message was actually sent without
> immediate failure.
>
> If we fail to transfer to any server with the highest index, try to
> just transfer to any other server that is not behind majority and then
> just any other server that is connected.  We did actually send them
> all the updates (if the connection is open), they just didn't reply
> yet for one reason or another.  It should be better than leaving the
> cluster without a leader.
>
> Note that there is always a chance that transfer will fail, since
> we're not waiting for it to be acknowledged (and must not wait).
> In this case, normal election will be triggered after the election
> timer fires up.
>
> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered
databases.")
> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> ---
>
> CC: Felix Huettner <felix.huettner@mail.schwarz>
>
>  ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index f463afcb3..b171da345 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const
char *reason)
>          return;
>      }
>
> -    struct raft_server *s;
> +    struct raft_server **servers, *s;
> +    uint64_t threshold = 0;
> +    size_t n = 0, start, i;
> +
> +    servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers);
> +
>      HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> -        if (!uuid_equals(&raft->sid, &s->sid)
> -            && s->phase == RAFT_PHASE_STABLE) {
> +        if (uuid_equals(&raft->sid, &s->sid)
> +            || s->phase != RAFT_PHASE_STABLE) {
> +            continue;
> +        }
> +        if (s->match_index > threshold) {
> +            threshold = s->match_index;
> +        }
> +        servers[n++] = s;
> +    }
> +
> +    start = n ? random_range(n) : 0;
> +
> +retry:
> +    for (i = 0; i < n; i++) {
> +        s = servers[(start + i) % n];
> +
> +        if (s->match_index >= threshold) {
>              struct raft_conn *conn = raft_find_conn_by_sid(raft,
&s->sid);
>              if (!conn) {
>                  continue;
> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const
char *reason)
>                      .term = raft->term,
>                  }
>              };
> -            raft_send_to_conn(raft, &rpc, conn);
> +
> +            if (!raft_send_to_conn(raft, &rpc, conn)) {
> +                continue;
> +            }
>
>              raft_record_note(raft, "transfer leadership",
>                               "transferring leadership to %s because %s",
> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const
char *reason)
>              break;
>          }
>      }
> +
> +    if (n && i == n && threshold) {
> +        if (threshold > raft->commit_index) {
> +            /* Failed to transfer to servers with the highest
'match_index'.
> +             * Try other servers that are not behind the majority. */
> +            threshold = raft->commit_index;
> +        } else {
> +            /* Try any other server.  It is safe, because they either
have all
> +             * the append requests queued up for them before the
leadership
> +             * transfer message or their connection is broken and we
will not
> +             * transfer anyway. */
> +            threshold = 0;
> +        }
> +        goto retry;

Thanks Ilya. It seems the retry could try the earlier failed server (e.g.
the ones that raft_send_to_conn() returned false) one or two more times,
but it should be fine because the number of servers are very small anyway.
So this looks good to me.

Acked-by: Han Zhou <hz...@ovn.org>

> +    }
> +
> +    free(servers);
>  }
>
>  /* Send a RemoveServerRequest to the rest of the servers in the cluster.
> --
> 2.44.0
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to