On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
> On 3/18/24 17:15, Felix Huettner wrote:
> > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> >> Each cluster member typically always transfers leadership to the same
> >> other member, which is the first in their list of servers.  This may
> >> result in two servers in a 3-node cluster to transfer leadership to
> >> each other and never to the third one.
> >>
> >> Randomizing the selection to make the load more evenly distributed.
> > 
> > Hi Ilya,
> 
> Hi, Felix.  Thanks for the comments!
> 
> > 
> > just out of curiosity: since basically only one of the 3 members is
> > active at any point in time, is balancing the load even relevant. It
> > will always only be on one of the 3 members anyway.
> It is not very important, I agree.  What I observed in practice is
> that sometimes if, for example, compactions happen in approximately
> similar time, the server we transfer the leadership to may send it
> right back, while the first server is busy compacting.  This is
> less of a problem today as well, since we have parallel compaction,
> but it may still be annoying if that happens every time.
> 
> I'm mostly making this patch for the purpose of better testing below.
> 
> > 
> >>
> >> This also makes cluster failure tests cover more scenarios as servers
> >> will transfer leadership to servers they didn't before.  This is
> >> important especially for cluster joining tests.
> >>
> >> Ideally, we would transfer to a random server with a highest apply
> >> index, but not trying to implement this for now.
> >>
> >> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> >> ---
> >>  ovsdb/raft.c | 6 +++++-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> >> index f463afcb3..25f462431 100644
> >> --- a/ovsdb/raft.c
> >> +++ b/ovsdb/raft.c
> >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const 
> >> char *reason)
> >>          return;
> >>      }
> >>  
> >> +    size_t n = hmap_count(&raft->servers) * 3;
> >>      struct raft_server *s;
> >> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> >> +
> >> +    while (n--) {
> >> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
> >> +                         struct raft_server, hmap_node);
> >>          if (!uuid_equals(&raft->sid, &s->sid)
> >>              && s->phase == RAFT_PHASE_STABLE) {
> >>              struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
> > 
> > i think this has the risk of never selecting one server out of the list of
> > cluster members. Suppose you have a 3 node cluster where one of them
> > members is down. In this case there is a single member the leadership
> > can be transfered to.
> > This means for a single iteration of the while loop has a chance of 2/3
> > to select a member that can not be used. Over the 9 iterations this
> > would do this would give a chance of (2/3)^9 to always choose an
> > inappropriate member. This equals to a chance of 0.026% of never
> > selecting the single appropriate target member.
> > 
> > Could we instead rather start iterating the hmap at some random offset?
> > That should give a similar result while giving the guarantee that we
> > visit each member once.
> 
> I don't think it's very important, because we're transferring leadership
> without verifying if the other side is alive and we're not checking if we
> actually transferred it or not.  So, retries are basically just for the
> server not hitting itself or servers that didn't join yet.  We will transfer
> to the first other server that already joined regardless of the iteration
> method.
> 
> The way to mostly fix the issue with dead servers is, as I mentioned, to
> transfer only to the servers with the highest apply index, i.e. the servers
> that acknowledged the most amount of changes.  This will ensure that we
> at least heard something from the server in the recent past.  But that's
> a separate feature to implement.
> 
> Also, the leadership transfer is just an optimization to speed up elections,
> so it's not necessary for correctness for this operation to be successful.
> If we fail to transfer or transfer to the dead server, the rest of the
> cluster will notice the absence of the leader and initiate election by
> the timeout.
> 
> Best regards, Ilya Maximets.

Hi Ilya,

thanks for the clarifications.
Then i guess the 0.026% chance is not too relevant.

Thanks
Felix
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to