On 3/26/24 21:10, Han Zhou wrote:
> 
> 
> On Tue, Mar 26, 2024 at 10:26 AM Ilya Maximets <i.maxim...@ovn.org 
> <mailto:i.maxim...@ovn.org>> wrote:
>>
>> Current implementation of the leadership transfer just shoots the
>> leadership in the general direction of the first stable server in the
>> configuration.  It doesn't check if the server was active recently or
>> even that the connection is established.  This may result in sending
>> leadership to a disconnected or otherwise unavailable server.
>>
>> Such behavior should not cause log truncation or any other correctness
>> issues because the destination server would have all the append
>> requests queued up or the connection will be dropped by the leader.
>> In a worst case we will have a leader-less cluster until the next
>> election timer fires up.  Other servers will notice the absence of the
>> leader and will trigger a new leader election normally.
>> However, the potential wait for the election timer is not good as
>> real-world setups may have high values configured.
>>
>> Fix that by trying to transfer to servers that we know have applied
>> the most changes, i.e., have the highest 'match_index'.  Such servers
>> replied to the most recent append requests, so they have highest
>> chances to be healthy.  Choosing the random starting point in the
>> list of such servers so we don't transfer to the same server every
>> single time.  This slightly improves load distribution, but, most
>> importantly, increases robustness of our test suite, making it cover
>> more cases.  Also checking that the message was actually sent without
>> immediate failure.
>>
>> If we fail to transfer to any server with the highest index, try to
>> just transfer to any other server that is not behind majority and then
>> just any other server that is connected.  We did actually send them
>> all the updates (if the connection is open), they just didn't reply
>> yet for one reason or another.  It should be better than leaving the
>> cluster without a leader.
>>
>> Note that there is always a chance that transfer will fail, since
>> we're not waiting for it to be acknowledged (and must not wait).
>> In this case, normal election will be triggered after the election
>> timer fires up.
>>
>> Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered 
>> databases.")
>> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org <mailto:i.maxim...@ovn.org>>
>> ---
>>
>> CC: Felix Huettner <felix.huettner@mail.schwarz>
>>
>>  ovsdb/raft.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 44 insertions(+), 4 deletions(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index f463afcb3..b171da345 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -1261,10 +1261,30 @@ raft_transfer_leadership(struct raft *raft, const 
>> char *reason)
>>          return;
>>      }
>>
>> -    struct raft_server *s;
>> +    struct raft_server **servers, *s;
>> +    uint64_t threshold = 0;
>> +    size_t n = 0, start, i;
>> +
>> +    servers = xmalloc(hmap_count(&raft->servers) * sizeof *servers);
>> +
>>      HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
>> -        if (!uuid_equals(&raft->sid, &s->sid)
>> -            && s->phase == RAFT_PHASE_STABLE) {
>> +        if (uuid_equals(&raft->sid, &s->sid)
>> +            || s->phase != RAFT_PHASE_STABLE) {
>> +            continue;
>> +        }
>> +        if (s->match_index > threshold) {
>> +            threshold = s->match_index;
>> +        }
>> +        servers[n++] = s;
>> +    }
>> +
>> +    start = n ? random_range(n) : 0;
>> +
>> +retry:
>> +    for (i = 0; i < n; i++) {
>> +        s = servers[(start + i) % n];
>> +
>> +        if (s->match_index >= threshold) {
>>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
>>              if (!conn) {
>>                  continue;
>> @@ -1280,7 +1300,10 @@ raft_transfer_leadership(struct raft *raft, const 
>> char *reason)
>>                      .term = raft->term,
>>                  }
>>              };
>> -            raft_send_to_conn(raft, &rpc, conn);
>> +
>> +            if (!raft_send_to_conn(raft, &rpc, conn)) {
>> +                continue;
>> +            }
>>
>>              raft_record_note(raft, "transfer leadership",
>>                               "transferring leadership to %s because %s",
>> @@ -1288,6 +1311,23 @@ raft_transfer_leadership(struct raft *raft, const 
>> char *reason)
>>              break;
>>          }
>>      }
>> +
>> +    if (n && i == n && threshold) {
>> +        if (threshold > raft->commit_index) {
>> +            /* Failed to transfer to servers with the highest 'match_index'.
>> +             * Try other servers that are not behind the majority. */
>> +            threshold = raft->commit_index;
>> +        } else {
>> +            /* Try any other server.  It is safe, because they either have 
>> all
>> +             * the append requests queued up for them before the leadership
>> +             * transfer message or their connection is broken and we will 
>> not
>> +             * transfer anyway. */
>> +            threshold = 0;
>> +        }
>> +        goto retry;
> 
> Thanks Ilya. It seems the retry could try the earlier failed server (e.g. the 
> ones that
> raft_send_to_conn() returned false) one or two more times, but it should be 
> fine because
> the number of servers are very small anyway. So this looks good to me.

Yeah, most of the time this array will have just 2 elements
and commit_index will be equal to the highest match_index,
so we will likely only retry once.

> 
> Acked-by: Han Zhou <hz...@ovn.org <mailto:hz...@ovn.org>>
> 
>> +    }
>> +
>> +    free(servers);
>>  }
>>
>>  /* Send a RemoveServerRequest to the rest of the servers in the cluster.
>> --
>> 2.44.0
>>
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to