[ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-15 Thread Ilya Maximets
Each cluster member typically always transfers leadership to the same
other member, which is the first in their list of servers.  This may
result in two servers in a 3-node cluster to transfer leadership to
each other and never to the third one.

Randomizing the selection to make the load more evenly distributed.

This also makes cluster failure tests cover more scenarios as servers
will transfer leadership to servers they didn't before.  This is
important especially for cluster joining tests.

Ideally, we would transfer to a random server with a highest apply
index, but not trying to implement this for now.

Signed-off-by: Ilya Maximets 
---
 ovsdb/raft.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/ovsdb/raft.c b/ovsdb/raft.c
index f463afcb3..25f462431 100644
--- a/ovsdb/raft.c
+++ b/ovsdb/raft.c
@@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const char 
*reason)
 return;
 }
 
+size_t n = hmap_count(&raft->servers) * 3;
 struct raft_server *s;
-HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
+
+while (n--) {
+s = CONTAINER_OF(hmap_random_node(&raft->servers),
+ struct raft_server, hmap_node);
 if (!uuid_equals(&raft->sid, &s->sid)
 && s->phase == RAFT_PHASE_STABLE) {
 struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
-- 
2.43.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-18 Thread Felix Huettner via dev
On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> Each cluster member typically always transfers leadership to the same
> other member, which is the first in their list of servers.  This may
> result in two servers in a 3-node cluster to transfer leadership to
> each other and never to the third one.
> 
> Randomizing the selection to make the load more evenly distributed.

Hi Ilya,

just out of curiosity: since basically only one of the 3 members is
active at any point in time, is balancing the load even relevant. It
will always only be on one of the 3 members anyway.

> 
> This also makes cluster failure tests cover more scenarios as servers
> will transfer leadership to servers they didn't before.  This is
> important especially for cluster joining tests.
> 
> Ideally, we would transfer to a random server with a highest apply
> index, but not trying to implement this for now.
> 
> Signed-off-by: Ilya Maximets 
> ---
>  ovsdb/raft.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> index f463afcb3..25f462431 100644
> --- a/ovsdb/raft.c
> +++ b/ovsdb/raft.c
> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const char 
> *reason)
>  return;
>  }
>  
> +size_t n = hmap_count(&raft->servers) * 3;
>  struct raft_server *s;
> -HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> +
> +while (n--) {
> +s = CONTAINER_OF(hmap_random_node(&raft->servers),
> + struct raft_server, hmap_node);
>  if (!uuid_equals(&raft->sid, &s->sid)
>  && s->phase == RAFT_PHASE_STABLE) {
>  struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);

i think this has the risk of never selecting one server out of the list of
cluster members. Suppose you have a 3 node cluster where one of them
members is down. In this case there is a single member the leadership
can be transfered to.
This means for a single iteration of the while loop has a chance of 2/3
to select a member that can not be used. Over the 9 iterations this
would do this would give a chance of (2/3)^9 to always choose an
inappropriate member. This equals to a chance of 0.026% of never
selecting the single appropriate target member.

Could we instead rather start iterating the hmap at some random offset?
That should give a similar result while giving the guarantee that we
visit each member once.

Thanks
Felix
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-18 Thread Ilya Maximets
On 3/18/24 17:15, Felix Huettner wrote:
> On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
>> Each cluster member typically always transfers leadership to the same
>> other member, which is the first in their list of servers.  This may
>> result in two servers in a 3-node cluster to transfer leadership to
>> each other and never to the third one.
>>
>> Randomizing the selection to make the load more evenly distributed.
> 
> Hi Ilya,

Hi, Felix.  Thanks for the comments!

> 
> just out of curiosity: since basically only one of the 3 members is
> active at any point in time, is balancing the load even relevant. It
> will always only be on one of the 3 members anyway.
It is not very important, I agree.  What I observed in practice is
that sometimes if, for example, compactions happen in approximately
similar time, the server we transfer the leadership to may send it
right back, while the first server is busy compacting.  This is
less of a problem today as well, since we have parallel compaction,
but it may still be annoying if that happens every time.

I'm mostly making this patch for the purpose of better testing below.

> 
>>
>> This also makes cluster failure tests cover more scenarios as servers
>> will transfer leadership to servers they didn't before.  This is
>> important especially for cluster joining tests.
>>
>> Ideally, we would transfer to a random server with a highest apply
>> index, but not trying to implement this for now.
>>
>> Signed-off-by: Ilya Maximets 
>> ---
>>  ovsdb/raft.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> index f463afcb3..25f462431 100644
>> --- a/ovsdb/raft.c
>> +++ b/ovsdb/raft.c
>> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const 
>> char *reason)
>>  return;
>>  }
>>  
>> +size_t n = hmap_count(&raft->servers) * 3;
>>  struct raft_server *s;
>> -HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
>> +
>> +while (n--) {
>> +s = CONTAINER_OF(hmap_random_node(&raft->servers),
>> + struct raft_server, hmap_node);
>>  if (!uuid_equals(&raft->sid, &s->sid)
>>  && s->phase == RAFT_PHASE_STABLE) {
>>  struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
> 
> i think this has the risk of never selecting one server out of the list of
> cluster members. Suppose you have a 3 node cluster where one of them
> members is down. In this case there is a single member the leadership
> can be transfered to.
> This means for a single iteration of the while loop has a chance of 2/3
> to select a member that can not be used. Over the 9 iterations this
> would do this would give a chance of (2/3)^9 to always choose an
> inappropriate member. This equals to a chance of 0.026% of never
> selecting the single appropriate target member.
> 
> Could we instead rather start iterating the hmap at some random offset?
> That should give a similar result while giving the guarantee that we
> visit each member once.

I don't think it's very important, because we're transferring leadership
without verifying if the other side is alive and we're not checking if we
actually transferred it or not.  So, retries are basically just for the
server not hitting itself or servers that didn't join yet.  We will transfer
to the first other server that already joined regardless of the iteration
method.

The way to mostly fix the issue with dead servers is, as I mentioned, to
transfer only to the servers with the highest apply index, i.e. the servers
that acknowledged the most amount of changes.  This will ensure that we
at least heard something from the server in the recent past.  But that's
a separate feature to implement.

Also, the leadership transfer is just an optimization to speed up elections,
so it's not necessary for correctness for this operation to be successful.
If we fail to transfer or transfer to the dead server, the rest of the
cluster will notice the absence of the leader and initiate election by
the timeout.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-19 Thread Felix Huettner via dev
On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
> On 3/18/24 17:15, Felix Huettner wrote:
> > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> >> Each cluster member typically always transfers leadership to the same
> >> other member, which is the first in their list of servers.  This may
> >> result in two servers in a 3-node cluster to transfer leadership to
> >> each other and never to the third one.
> >>
> >> Randomizing the selection to make the load more evenly distributed.
> > 
> > Hi Ilya,
> 
> Hi, Felix.  Thanks for the comments!
> 
> > 
> > just out of curiosity: since basically only one of the 3 members is
> > active at any point in time, is balancing the load even relevant. It
> > will always only be on one of the 3 members anyway.
> It is not very important, I agree.  What I observed in practice is
> that sometimes if, for example, compactions happen in approximately
> similar time, the server we transfer the leadership to may send it
> right back, while the first server is busy compacting.  This is
> less of a problem today as well, since we have parallel compaction,
> but it may still be annoying if that happens every time.
> 
> I'm mostly making this patch for the purpose of better testing below.
> 
> > 
> >>
> >> This also makes cluster failure tests cover more scenarios as servers
> >> will transfer leadership to servers they didn't before.  This is
> >> important especially for cluster joining tests.
> >>
> >> Ideally, we would transfer to a random server with a highest apply
> >> index, but not trying to implement this for now.
> >>
> >> Signed-off-by: Ilya Maximets 
> >> ---
> >>  ovsdb/raft.c | 6 +-
> >>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> >> index f463afcb3..25f462431 100644
> >> --- a/ovsdb/raft.c
> >> +++ b/ovsdb/raft.c
> >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, const 
> >> char *reason)
> >>  return;
> >>  }
> >>  
> >> +size_t n = hmap_count(&raft->servers) * 3;
> >>  struct raft_server *s;
> >> -HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> >> +
> >> +while (n--) {
> >> +s = CONTAINER_OF(hmap_random_node(&raft->servers),
> >> + struct raft_server, hmap_node);
> >>  if (!uuid_equals(&raft->sid, &s->sid)
> >>  && s->phase == RAFT_PHASE_STABLE) {
> >>  struct raft_conn *conn = raft_find_conn_by_sid(raft, &s->sid);
> > 
> > i think this has the risk of never selecting one server out of the list of
> > cluster members. Suppose you have a 3 node cluster where one of them
> > members is down. In this case there is a single member the leadership
> > can be transfered to.
> > This means for a single iteration of the while loop has a chance of 2/3
> > to select a member that can not be used. Over the 9 iterations this
> > would do this would give a chance of (2/3)^9 to always choose an
> > inappropriate member. This equals to a chance of 0.026% of never
> > selecting the single appropriate target member.
> > 
> > Could we instead rather start iterating the hmap at some random offset?
> > That should give a similar result while giving the guarantee that we
> > visit each member once.
> 
> I don't think it's very important, because we're transferring leadership
> without verifying if the other side is alive and we're not checking if we
> actually transferred it or not.  So, retries are basically just for the
> server not hitting itself or servers that didn't join yet.  We will transfer
> to the first other server that already joined regardless of the iteration
> method.
> 
> The way to mostly fix the issue with dead servers is, as I mentioned, to
> transfer only to the servers with the highest apply index, i.e. the servers
> that acknowledged the most amount of changes.  This will ensure that we
> at least heard something from the server in the recent past.  But that's
> a separate feature to implement.
> 
> Also, the leadership transfer is just an optimization to speed up elections,
> so it's not necessary for correctness for this operation to be successful.
> If we fail to transfer or transfer to the dead server, the rest of the
> cluster will notice the absence of the leader and initiate election by
> the timeout.
> 
> Best regards, Ilya Maximets.

Hi Ilya,

thanks for the clarifications.
Then i guess the 0.026% chance is not too relevant.

Thanks
Felix
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-25 Thread Han Zhou
On Tue, Mar 19, 2024 at 12:05 AM Felix Huettner via dev <
ovs-dev@openvswitch.org> wrote:
>
> On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
> > On 3/18/24 17:15, Felix Huettner wrote:
> > > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
> > >> Each cluster member typically always transfers leadership to the same
> > >> other member, which is the first in their list of servers.  This may
> > >> result in two servers in a 3-node cluster to transfer leadership to
> > >> each other and never to the third one.
> > >>
> > >> Randomizing the selection to make the load more evenly distributed.
> > >
> > > Hi Ilya,
> >
> > Hi, Felix.  Thanks for the comments!
> >
> > >
> > > just out of curiosity: since basically only one of the 3 members is
> > > active at any point in time, is balancing the load even relevant. It
> > > will always only be on one of the 3 members anyway.
> > It is not very important, I agree.  What I observed in practice is
> > that sometimes if, for example, compactions happen in approximately
> > similar time, the server we transfer the leadership to may send it
> > right back, while the first server is busy compacting.  This is
> > less of a problem today as well, since we have parallel compaction,
> > but it may still be annoying if that happens every time.
> >
> > I'm mostly making this patch for the purpose of better testing below.
> >
> > >
> > >>
> > >> This also makes cluster failure tests cover more scenarios as servers
> > >> will transfer leadership to servers they didn't before.  This is
> > >> important especially for cluster joining tests.
> > >>
> > >> Ideally, we would transfer to a random server with a highest apply
> > >> index, but not trying to implement this for now.
> > >>
> > >> Signed-off-by: Ilya Maximets 
> > >> ---
> > >>  ovsdb/raft.c | 6 +-
> > >>  1 file changed, 5 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> > >> index f463afcb3..25f462431 100644
> > >> --- a/ovsdb/raft.c
> > >> +++ b/ovsdb/raft.c
> > >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft,
const char *reason)
> > >>  return;
> > >>  }
> > >>
> > >> +size_t n = hmap_count(&raft->servers) * 3;
> > >>  struct raft_server *s;
> > >> -HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> > >> +
> > >> +while (n--) {
> > >> +s = CONTAINER_OF(hmap_random_node(&raft->servers),
> > >> + struct raft_server, hmap_node);
> > >>  if (!uuid_equals(&raft->sid, &s->sid)
> > >>  && s->phase == RAFT_PHASE_STABLE) {
> > >>  struct raft_conn *conn = raft_find_conn_by_sid(raft,
&s->sid);
> > >
> > > i think this has the risk of never selecting one server out of the
list of
> > > cluster members. Suppose you have a 3 node cluster where one of them
> > > members is down. In this case there is a single member the leadership
> > > can be transfered to.
> > > This means for a single iteration of the while loop has a chance of
2/3
> > > to select a member that can not be used. Over the 9 iterations this
> > > would do this would give a chance of (2/3)^9 to always choose an
> > > inappropriate member. This equals to a chance of 0.026% of never
> > > selecting the single appropriate target member.
> > >
> > > Could we instead rather start iterating the hmap at some random
offset?
> > > That should give a similar result while giving the guarantee that we
> > > visit each member once.
> >
> > I don't think it's very important, because we're transferring leadership
> > without verifying if the other side is alive and we're not checking if
we
> > actually transferred it or not.  So, retries are basically just for the
> > server not hitting itself or servers that didn't join yet.  We will
transfer
> > to the first other server that already joined regardless of the
iteration
> > method.
> >
> > The way to mostly fix the issue with dead servers is, as I mentioned, to
> > transfer only to the servers with the highest apply index, i.e. the
servers
> > that acknowledged the most amount of changes.  This will ensure that we
> > at least heard something from the server in the recent past.  But that's
> > a separate feature to implement.
> >
> > Also, the leadership transfer is just an optimization to speed up
elections,
> > so it's not necessary for correctness for this operation to be
successful.
> > If we fail to transfer or transfer to the dead server, the rest of the
> > cluster will notice the absence of the leader and initiate election by
> > the timeout.
> >
> > Best regards, Ilya Maximets.
>
> Hi Ilya,
>
> thanks for the clarifications.
> Then i guess the 0.026% chance is not too relevant.
>

Thanks for the patch and the discussion.
I still have a concern regarding the retry logic. Maybe it is not a
critical problem if the function failed to select any server to transfer,
but I'd still prefer this function to be more predictable. For example, we
can store 

Re: [ovs-dev] [PATCH 1/5] ovsdb: raft: Randomize leadership transfer.

2024-03-25 Thread Ilya Maximets
On 3/25/24 08:19, Han Zhou wrote:
> 
> 
> On Tue, Mar 19, 2024 at 12:05 AM Felix Huettner via dev 
> mailto:ovs-dev@openvswitch.org>> wrote:
>>
>> On Mon, Mar 18, 2024 at 05:52:12PM +0100, Ilya Maximets wrote:
>> > On 3/18/24 17:15, Felix Huettner wrote:
>> > > On Fri, Mar 15, 2024 at 09:14:49PM +0100, Ilya Maximets wrote:
>> > >> Each cluster member typically always transfers leadership to the same
>> > >> other member, which is the first in their list of servers.  This may
>> > >> result in two servers in a 3-node cluster to transfer leadership to
>> > >> each other and never to the third one.
>> > >>
>> > >> Randomizing the selection to make the load more evenly distributed.
>> > >
>> > > Hi Ilya,
>> >
>> > Hi, Felix.  Thanks for the comments!
>> >
>> > >
>> > > just out of curiosity: since basically only one of the 3 members is
>> > > active at any point in time, is balancing the load even relevant. It
>> > > will always only be on one of the 3 members anyway.
>> > It is not very important, I agree.  What I observed in practice is
>> > that sometimes if, for example, compactions happen in approximately
>> > similar time, the server we transfer the leadership to may send it
>> > right back, while the first server is busy compacting.  This is
>> > less of a problem today as well, since we have parallel compaction,
>> > but it may still be annoying if that happens every time.
>> >
>> > I'm mostly making this patch for the purpose of better testing below.
>> >
>> > >
>> > >>
>> > >> This also makes cluster failure tests cover more scenarios as servers
>> > >> will transfer leadership to servers they didn't before.  This is
>> > >> important especially for cluster joining tests.
>> > >>
>> > >> Ideally, we would transfer to a random server with a highest apply
>> > >> index, but not trying to implement this for now.
>> > >>
>> > >> Signed-off-by: Ilya Maximets > > >> >
>> > >> ---
>> > >>  ovsdb/raft.c | 6 +-
>> > >>  1 file changed, 5 insertions(+), 1 deletion(-)
>> > >>
>> > >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> > >> index f463afcb3..25f462431 100644
>> > >> --- a/ovsdb/raft.c
>> > >> +++ b/ovsdb/raft.c
>> > >> @@ -1261,8 +1261,12 @@ raft_transfer_leadership(struct raft *raft, 
>> > >> const char *reason)
>> > >>          return;
>> > >>      }
>> > >>
>> > >> +    size_t n = hmap_count(&raft->servers) * 3;
>> > >>      struct raft_server *s;
>> > >> -    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
>> > >> +
>> > >> +    while (n--) {
>> > >> +        s = CONTAINER_OF(hmap_random_node(&raft->servers),
>> > >> +                         struct raft_server, hmap_node);
>> > >>          if (!uuid_equals(&raft->sid, &s->sid)
>> > >>              && s->phase == RAFT_PHASE_STABLE) {
>> > >>              struct raft_conn *conn = raft_find_conn_by_sid(raft, 
>> > >> &s->sid);
>> > >
>> > > i think this has the risk of never selecting one server out of the list 
>> > > of
>> > > cluster members. Suppose you have a 3 node cluster where one of them
>> > > members is down. In this case there is a single member the leadership
>> > > can be transfered to.
>> > > This means for a single iteration of the while loop has a chance of 2/3
>> > > to select a member that can not be used. Over the 9 iterations this
>> > > would do this would give a chance of (2/3)^9 to always choose an
>> > > inappropriate member. This equals to a chance of 0.026% of never
>> > > selecting the single appropriate target member.
>> > >
>> > > Could we instead rather start iterating the hmap at some random offset?
>> > > That should give a similar result while giving the guarantee that we
>> > > visit each member once.
>> >
>> > I don't think it's very important, because we're transferring leadership
>> > without verifying if the other side is alive and we're not checking if we
>> > actually transferred it or not.  So, retries are basically just for the
>> > server not hitting itself or servers that didn't join yet.  We will 
>> > transfer
>> > to the first other server that already joined regardless of the iteration
>> > method.
>> >
>> > The way to mostly fix the issue with dead servers is, as I mentioned, to
>> > transfer only to the servers with the highest apply index, i.e. the servers
>> > that acknowledged the most amount of changes.  This will ensure that we
>> > at least heard something from the server in the recent past.  But that's
>> > a separate feature to implement.
>> >
>> > Also, the leadership transfer is just an optimization to speed up 
>> > elections,
>> > so it's not necessary for correctness for this operation to be successful.
>> > If we fail to transfer or transfer to the dead server, the rest of the
>> > cluster will notice the absence of the leader and initiate election by
>> > the timeout.
>> >
>> > Best regards, Ilya Maximets.
>>
>> Hi Ilya,
>>
>> thanks for the clarifications.
>> Then i guess the 0.026% chance is not too relevant.
>>
> 
> Thanks for the patch and the discussion