[ 
https://issues.apache.org/jira/browse/KUDU-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong resolved KUDU-2472.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.8.0

Dan merged this as c1c15ad, noting a 1% flaky rate due to another issue.

> master-stress-test flaky with failure to create table due to not enough 
> tservers
> --------------------------------------------------------------------------------
>
>                 Key: KUDU-2472
>                 URL: https://issues.apache.org/jira/browse/KUDU-2472
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: Dan Burkert
>            Assignee: Dan Burkert
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Currently {{master-stress-test}} is 5-7% flaky, failing during a create table 
> operation:
> {code:java}
> F0611 20:58:01.335697 23508 master-stress-test.cc:217] Check failed: _s.ok() 
> Bad status: Invalid argument: Error creating table 
> default.table_6473953088b54f90af982172a0471cf6 on the master: not enough live 
> tablet servers to create a table with the requested replication factor 3; 2 
> tablet servers are alive{code}
> Due to the frequent master failovers introduced by the test, CREATE TABLE 
> operations are failing because not enough tablet servers are known to be 
> alive by the current leader master, who likely was just started and quickly 
> elected.
> In this case the master returns an InvalidArgument status to the client, 
> which is not retried.  This indicates a real issue that could occur in a 
> production cluster, if the leader master were restarted and quickly regained 
> leadership.  I'm not sure yet what the right fix is, I can think of at least 
> a few:
>  * Change the return status to be ServiceUnavailable. The client will retry 
> up to the timeout.  The downside is that in legitimate scenarios where there 
> aren't enough tablet servers the operation will take the full timeout to 
> fail, and probably have a less useful error status type.  Perhaps we could 
> have a heuristic which says that if the leader hasn't been active for at 
> least {{n * heartbeat_interval}} (where n is a small integer), then 
> ServiceUnavailable is used.
>  * Change master-stress-test to use replication 1 tables. This makes it much 
> less likely for the race to occur, although it's still possible.  This also 
> doesn't fix the underlying issue.
>  * Introduce a special case in the table creating thread of 
> master-stress-test to retry the specific {{InvalidArgument}} status.  Also 
> doesn't fix the underlying issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to