[ 
https://issues.apache.org/jira/browse/KUDU-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Burkert updated KUDU-2472:
------------------------------
    Description: 
Currently {{master-stress-test}} is 5-7% flaky, failing during a create table 
operation:
{code:java}
F0611 20:58:01.335697 23508 master-stress-test.cc:217] Check failed: _s.ok() 
Bad status: Invalid argument: Error creating table 
default.table_6473953088b54f90af982172a0471cf6 on the master: not enough live 
tablet servers to create a table with the requested replication factor 3; 2 
tablet servers are alive{code}
Due to the frequent master failovers introduced by the test, CREATE TABLE 
operations are failing because not enough tablet servers are known to be alive 
by the current leader master, who likely was just started and quickly elected.

In this case the master returns an InvalidArgument status to the client, which 
is not retried.  This indicates a real issue that could occur in a production 
cluster, if the leader master were restarted and quickly regained leadership.  
I'm not sure yet what the right fix is, I can think of at least a few:
 * Change the return status to be ServiceUnavailable. The client will retry up 
to the timeout.  The downside is that in legitimate scenarios where there 
aren't enough tablet servers the operation will take the full timeout to fail, 
and probably have a less useful error status type.  Perhaps we could have a 
heuristic which says that if the leader hasn't been active for at least {{n * 
heartbeat_interval}} (where n is a small integer), then ServiceUnavailable is 
used.
 * Change master-stress-test to use replication 1 tables. This makes it much 
less likely for the race to occur, although it's still possible.  This also 
doesn't fix the underlying issue.
 * Introduce a special case in the table creating thread of master-stress-test 
to retry the specific InvalidError status.  Also doesn't fix the underlying 
issue.

  was:
Currently {{master-stress-test}} is 5-7% flaky, failing during a create table 
operation:
{code:java}
F0611 20:58:01.335697 23508 master-stress-test.cc:217] Check failed: _s.ok() 
Bad status: Invalid argument: Error creating table 
default.table_6473953088b54f90af982172a0471cf6 on the master: not enough live 
tablet servers to create a table with the requested replication factor 3; 2 
tablet servers are alive{code}
Due to the frequent master failovers introduced by the test, CREATE TABLE 
operations are failing because not enough tablet servers are known to be alive 
by the current leader master, who likely was just started and quickly elected.

In this case the master returns an InvalidArgument status to the client, which 
is not retried.  This indicates a real issue that could occur in a production 
cluster, if the leader master were restarted and quickly regained leadership.  
I'm not sure yet what the right fix is, I can think of at least a few:
 * Change the return status to be ServiceUnavailable. The client will retry up 
to the timeout.  The downside is that in legitimate scenarios where there 
aren't enough tablet servers the operation will take the full timeout to fail, 
and probably have a less useful error status type.  Perhaps we could have a 
heuristic which says that if the leader hasn't been active for at least n * 
heartbeat_interval (where n is a small integer), then ServiceUnavailable is 
used.
 * Change master-stress-test to use replication 1 tables. This makes it much 
less likely for the race to occur, although it's still possible.  This also 
doesn't fix the underlying issue.
 * Introduce a special case in the table creating thread of master-stress-test 
to retry the specific InvalidError status.  Also doesn't fix the underlying 
issue.


> master-stress-test flaky with failure to create table due to not enough 
> tservers
> --------------------------------------------------------------------------------
>
>                 Key: KUDU-2472
>                 URL: https://issues.apache.org/jira/browse/KUDU-2472
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Dan Burkert
>            Priority: Major
>
> Currently {{master-stress-test}} is 5-7% flaky, failing during a create table 
> operation:
> {code:java}
> F0611 20:58:01.335697 23508 master-stress-test.cc:217] Check failed: _s.ok() 
> Bad status: Invalid argument: Error creating table 
> default.table_6473953088b54f90af982172a0471cf6 on the master: not enough live 
> tablet servers to create a table with the requested replication factor 3; 2 
> tablet servers are alive{code}
> Due to the frequent master failovers introduced by the test, CREATE TABLE 
> operations are failing because not enough tablet servers are known to be 
> alive by the current leader master, who likely was just started and quickly 
> elected.
> In this case the master returns an InvalidArgument status to the client, 
> which is not retried.  This indicates a real issue that could occur in a 
> production cluster, if the leader master were restarted and quickly regained 
> leadership.  I'm not sure yet what the right fix is, I can think of at least 
> a few:
>  * Change the return status to be ServiceUnavailable. The client will retry 
> up to the timeout.  The downside is that in legitimate scenarios where there 
> aren't enough tablet servers the operation will take the full timeout to 
> fail, and probably have a less useful error status type.  Perhaps we could 
> have a heuristic which says that if the leader hasn't been active for at 
> least {{n * heartbeat_interval}} (where n is a small integer), then 
> ServiceUnavailable is used.
>  * Change master-stress-test to use replication 1 tables. This makes it much 
> less likely for the race to occur, although it's still possible.  This also 
> doesn't fix the underlying issue.
>  * Introduce a special case in the table creating thread of 
> master-stress-test to retry the specific InvalidError status.  Also doesn't 
> fix the underlying issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to