[jira] [Updated] (IGNITE-15705) Investigate raft client timeouts

2022-01-26 Thread Vyacheslav Koptilin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Koptilin updated IGNITE-15705:
-
Reviewer: Alexey Scherbakov

> Investigate raft client timeouts
> 
>
> Key: IGNITE-15705
> URL: https://issues.apache.org/jira/browse/IGNITE-15705
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Assignee: Mirza Aliev
>Priority: Critical
>  Labels: ignite-3
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> h3. Problem
> Raft client timeout should be large enough for the operation to be performed 
> even if it falls on several consecutive rounds of choosing a new leader of 
> the raft group. Most of jraft timeouts are based on electionTimeoutMs.
> {code:java}
> // A follower would become a candidate if it doesn't receive any message
> // from the leader in |election_timeout_ms| milliseconds
> // Default: 1000 (1s)
> private int electionTimeoutMs = 1000; // follower to candidate timeout
> {code}
>  For example both voteTime and electionTime use exact value of 
> getElectionTimeoutMs (1000 ms):
> {code:java}
> String name = "JRaft-VoteTimer-" + suffix;
> this.voteTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};
> name = "JRaft-ElectionTimer-" + suffix;
> electionTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getElectionTimer(name)) {...};
> {code}
> Going back to client timeout, seems that it should be greater than 
> reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).
> So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
> set client timeout to the corresponding value.
> Not sure whether it’s a good idea but let’s also consider raft client timeout 
> to be derivative of leader election timeout not only semantically but also 
> within code:
> {code:java}
> private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}
> UPD: 
> It was decided to implement election timeout autoadjusting mechanism. The 
> main idea is that in a stable cluster election timeout should be relatively 
> small, but when 
> something is preventing elections from completion, like an unstable network 
> or long GC pauses, we don't want to have a lot of elections, so election 
> timeout is adjusted.
> Hence, the upper bound of the election timeout adjusting is the value, which 
> is enough to elect a leader or handle problems that prevent a successful 
> leader election. 
> Leader election timeout is set to an initial value after a successful 
> election of a leader.
> In our case, the upper bound of the timeout adjusting is more than timeout of 
> a membership protocol to remove failed node from the cluster. So, we may 
> assume
> that 11s could be enough as far as 11s  is greater than suspicion timeout 
> ((log_2(1000) * 500ms * 1)) for the 1000 nodes cluster with ping interval 
> equals 500ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-15705) Investigate raft client timeouts

2021-12-28 Thread Mirza Aliev (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mirza Aliev updated IGNITE-15705:
-
Description: 
h3. Problem

Raft client timeout should be large enough for the operation to be performed 
even if it falls on several consecutive rounds of choosing a new leader of the 
raft group. Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
// A follower would become a candidate if it doesn't receive any message
// from the leader in |election_timeout_ms| milliseconds
// Default: 1000 (1s)
private int electionTimeoutMs = 1000; // follower to candidate timeout

{code}
 For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):
{code:java}
String name = "JRaft-VoteTimer-" + suffix;
this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

name = "JRaft-ElectionTimer-" + suffix;
electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}
Going back to client timeout, seems that it should be greater than 
reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).

So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
set client timeout to the corresponding value.

Not sure whether it’s a good idea but let’s also consider raft client timeout 
to be derivative of leader election timeout not only semantically but also 
within code:
{code:java}
private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}

UPD: 
It was decided to implement election timeout autoadjusting mechanism. The main 
idea is that in a stable cluster election timeout should be relatively small, 
but when 
something is preventing elections from completion, like an unstable network or 
long GC pauses, we don't want to have a lot of elections, so election timeout 
is adjusted.
Hence, the upper bound of the election timeout adjusting is the value, which is 
enough to elect a leader or handle problems that prevent a successful leader 
election. 
Leader election timeout is set to an initial value after a successful election 
of a leader.

In our case, the upper bound of the timeout adjusting is more than timeout of a 
membership protocol to remove failed node from the cluster. So, we may assume
that 11s could be enough as far as 11s  is greater than suspicion timeout 
((log_2(1000) * 500ms * 1)) for the 1000 nodes cluster with ping interval 
equals 500ms.

  was:
h3. Problem

Raft client timeout should be large enough for the operation to be performed 
even if it falls on several consecutive rounds of choosing a new leader of the 
raft group. Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
// A follower would become a candidate if it doesn't receive any message
// from the leader in |election_timeout_ms| milliseconds
// Default: 1000 (1s)
private int electionTimeoutMs = 1000; // follower to candidate timeout

{code}
 For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):
{code:java}
String name = "JRaft-VoteTimer-" + suffix;
this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

name = "JRaft-ElectionTimer-" + suffix;
electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}
Going back to client timeout, seems that it should be greater than 
reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).

So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
set client timeout to the corresponding value.

Not sure whether it’s a good idea but let’s also consider raft client timeout 
to be derivative of leader election timeout not only semantically but also 
within code:
{code:java}
private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}

UPD: 
It was decided to implement election timeout autoadjusting mechanism. The main 
idea is that in a stable cluster election timeout should be relatively small, 
but when 
something is preventing elections from completion, like an unstable network or 
long GC pauses, we don't want to have a lot of elections, so election timeout 
is adjusted.
Hence, the upper bound of the election timeout adjusting is the value, which is 
enough to elect a leader or handle problems that prevent a successful leader 
election. 
Leader election timeout is set to an initial value after a successful election 
of a leader.

In our case, the upper bound of the timeout adjusting is more than timeout of a 
membership protocol to remove failed node from the cluster. So, we may assume
that 11s could be enough as far as 11s (log_2(1000) * 500ms) is greater than 
suspicion timeout for the 1000 nodes cluster with ping interval equals 

[jira] [Updated] (IGNITE-15705) Investigate raft client timeouts

2021-12-28 Thread Mirza Aliev (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mirza Aliev updated IGNITE-15705:
-
Description: 
h3. Problem

Raft client timeout should be large enough for the operation to be performed 
even if it falls on several consecutive rounds of choosing a new leader of the 
raft group. Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
// A follower would become a candidate if it doesn't receive any message
// from the leader in |election_timeout_ms| milliseconds
// Default: 1000 (1s)
private int electionTimeoutMs = 1000; // follower to candidate timeout

{code}
 For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):
{code:java}
String name = "JRaft-VoteTimer-" + suffix;
this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

name = "JRaft-ElectionTimer-" + suffix;
electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}
Going back to client timeout, seems that it should be greater than 
reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).

So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
set client timeout to the corresponding value.

Not sure whether it’s a good idea but let’s also consider raft client timeout 
to be derivative of leader election timeout not only semantically but also 
within code:
{code:java}
private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}

UPD: 
It was decided to implement election timeout autoadjusting mechanism. The main 
idea is that in a stable cluster election timeout should be relatively small, 
but when 
something is preventing elections from completion, like an unstable network or 
long GC pauses, we don't want to have a lot of elections, so election timeout 
is adjusted.
Hence, the upper bound of the election timeout adjusting is the value, which is 
enough to elect a leader or handle problems that prevent a successful leader 
election. 
Leader election timeout is set to an initial value after a successful election 
of a leader.

In our case, the upper bound of the timeout adjusting is more than timeout of a 
membership protocol to remove failed node from the cluster. So, we may assume
that 11s could be enough as far as 11s (log_2(1000) * 500ms) is greater than 
suspicion timeout for the 1000 nodes cluster with ping interval equals 500ms.

  was:
h3. Problem

Raft client timeout should be large enough for the operation to be performed 
even if it falls on several consecutive rounds of choosing a new leader of the 
raft group. Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
// A follower would become a candidate if it doesn't receive any message
// from the leader in |election_timeout_ms| milliseconds
// Default: 1000 (1s)
private int electionTimeoutMs = 1000; // follower to candidate timeout

{code}
 For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):
{code:java}
String name = "JRaft-VoteTimer-" + suffix;
this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

name = "JRaft-ElectionTimer-" + suffix;
electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}
Going back to client timeout, seems that it should be greater than 
reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).

So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
set client timeout to the corresponding value.

Not sure whether it’s a good idea but let’s also consider raft client timeout 
to be derivative of leader election timeout not only semantically but also 
within code:
{code:java}
private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}


> Investigate raft client timeouts
> 
>
> Key: IGNITE-15705
> URL: https://issues.apache.org/jira/browse/IGNITE-15705
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Assignee: Mirza Aliev
>Priority: Critical
>  Labels: ignite-3
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h3. Problem
> Raft client timeout should be large enough for the operation to be performed 
> even if it falls on several consecutive rounds of choosing a new leader of 
> the raft group. Most of jraft timeouts are based on electionTimeoutMs.
> {code:java}
> // A follower would become a candidate if it doesn't receive any message
> // from the leader in |election_timeout_ms| milliseconds
> // Default: 1000 (1s)
> private int electionTimeoutMs = 1000; // follower to candidate timeout
> 

[jira] [Updated] (IGNITE-15705) Investigate raft client timeouts

2021-10-18 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-15705:
-
Description: 
h3. Problem

Raft client timeout should be large enough for the operation to be performed 
even if it falls on several consecutive rounds of choosing a new leader of the 
raft group. Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
// A follower would become a candidate if it doesn't receive any message
// from the leader in |election_timeout_ms| milliseconds
// Default: 1000 (1s)
private int electionTimeoutMs = 1000; // follower to candidate timeout

{code}
 For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):
{code:java}
String name = "JRaft-VoteTimer-" + suffix;
this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

name = "JRaft-ElectionTimer-" + suffix;
electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}
Going back to client timeout, seems that it should be greater than 
reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).

So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
set client timeout to the corresponding value.

Not sure whether it’s a good idea but let’s also consider raft client timeout 
to be derivative of leader election timeout not only semantically but also 
within code:
{code:java}
private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}

  was:
h3. Problem

Raft client timeout should be large enough for the operation to be performed 
even if it falls on several consecutive rounds of choosing a new leader of the 
raft group. Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
// A follower would become a candidate if it doesn't receive any message
// from the leader in |election_timeout_ms| milliseconds
// Default: 1000 (1s)
private int electionTimeoutMs = 1000; // follower to candidate timeout

{code}
 For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):

{{}}
{code:java}
String name = "JRaft-VoteTimer-" + suffix;
this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

name = "JRaft-ElectionTimer-" + suffix;
electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}
{{}}

Going back to client timeout, seems that it should be greater than 
reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).

So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
set client timeout to the corresponding value.

Not sure whether it’s a good idea but let’s also consider raft client timeout 
to be derivative of leader election timeout not only semantically but also 
within code:

{{}}
{code:java}
private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}
{{}}


> Investigate raft client timeouts
> 
>
> Key: IGNITE-15705
> URL: https://issues.apache.org/jira/browse/IGNITE-15705
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Priority: Critical
>  Labels: ignite-3
>
> h3. Problem
> Raft client timeout should be large enough for the operation to be performed 
> even if it falls on several consecutive rounds of choosing a new leader of 
> the raft group. Most of jraft timeouts are based on electionTimeoutMs.
> {code:java}
> // A follower would become a candidate if it doesn't receive any message
> // from the leader in |election_timeout_ms| milliseconds
> // Default: 1000 (1s)
> private int electionTimeoutMs = 1000; // follower to candidate timeout
> {code}
>  For example both voteTime and electionTime use exact value of 
> getElectionTimeoutMs (1000 ms):
> {code:java}
> String name = "JRaft-VoteTimer-" + suffix;
> this.voteTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};
> name = "JRaft-ElectionTimer-" + suffix;
> electionTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getElectionTimer(name)) {...};
> {code}
> Going back to client timeout, seems that it should be greater than 
> reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).
> So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
> set client timeout to the corresponding value.
> Not sure whether it’s a good idea but let’s also consider raft client timeout 
> to be derivative of leader election timeout not only semantically but also 
> within code:
> {code:java}
> private static final int TIMEOUT = 10 

[jira] [Updated] (IGNITE-15705) Investigate raft client timeouts

2021-10-18 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-15705:
-
Description: 
h3. Problem

Raft client timeout should be large enough for the operation to be performed 
even if it falls on several consecutive rounds of choosing a new leader of the 
raft group. Most of jraft timeouts are based on electionTimeoutMs.
{code:java}
// A follower would become a candidate if it doesn't receive any message
// from the leader in |election_timeout_ms| milliseconds
// Default: 1000 (1s)
private int electionTimeoutMs = 1000; // follower to candidate timeout

{code}
 For example both voteTime and electionTime use exact value of 
getElectionTimeoutMs (1000 ms):

{{}}
{code:java}
String name = "JRaft-VoteTimer-" + suffix;
this.voteTimer = new RepeatedTimer(name, 
options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};

name = "JRaft-ElectionTimer-" + suffix;
electionTimer = new RepeatedTimer(name, options.getElectionTimeoutMs(), 
timerFactory.getElectionTimer(name)) {...};
{code}
{{}}

Going back to client timeout, seems that it should be greater than 
reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).

So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
set client timeout to the corresponding value.

Not sure whether it’s a good idea but let’s also consider raft client timeout 
to be derivative of leader election timeout not only semantically but also 
within code:

{{}}
{code:java}
private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}
{{}}

  was:TODO


> Investigate raft client timeouts
> 
>
> Key: IGNITE-15705
> URL: https://issues.apache.org/jira/browse/IGNITE-15705
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Priority: Critical
>  Labels: ignite-3
>
> h3. Problem
> Raft client timeout should be large enough for the operation to be performed 
> even if it falls on several consecutive rounds of choosing a new leader of 
> the raft group. Most of jraft timeouts are based on electionTimeoutMs.
> {code:java}
> // A follower would become a candidate if it doesn't receive any message
> // from the leader in |election_timeout_ms| milliseconds
> // Default: 1000 (1s)
> private int electionTimeoutMs = 1000; // follower to candidate timeout
> {code}
>  For example both voteTime and electionTime use exact value of 
> getElectionTimeoutMs (1000 ms):
> {{}}
> {code:java}
> String name = "JRaft-VoteTimer-" + suffix;
> this.voteTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getVoteTimer(name)) {...};
> name = "JRaft-ElectionTimer-" + suffix;
> electionTimer = new RepeatedTimer(name, 
> options.getElectionTimeoutMs(), timerFactory.getElectionTimer(name)) {...};
> {code}
> {{}}
> Going back to client timeout, seems that it should be greater than 
> reasonableAmountOfElecionRounds(electionTime + networkTimeoutToRetrieveAcks).
> So seems that we should check the value of “networkTimeoutToRetrieveAcks” and 
> set client timeout to the corresponding value.
> Not sure whether it’s a good idea but let’s also consider raft client timeout 
> to be derivative of leader election timeout not only semantically but also 
> within code:
> {{}}
> {code:java}
> private static final int TIMEOUT = 10 * leaderElectionTimeout;{code}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-15705) Investigate raft client timeouts

2021-10-12 Thread Vyacheslav Koptilin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Koptilin updated IGNITE-15705:
-
Labels: ignite-3  (was: )

> Investigate raft client timeouts
> 
>
> Key: IGNITE-15705
> URL: https://issues.apache.org/jira/browse/IGNITE-15705
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Priority: Critical
>  Labels: ignite-3
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (IGNITE-15705) Investigate raft client timeouts

2021-10-08 Thread Vyacheslav Koptilin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Koptilin updated IGNITE-15705:
-
Priority: Critical  (was: Major)

> Investigate raft client timeouts
> 
>
> Key: IGNITE-15705
> URL: https://issues.apache.org/jira/browse/IGNITE-15705
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Priority: Critical
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)