[ https://issues.apache.org/jira/browse/KUDU-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Wong resolved KUDU-2472. ------------------------------- Resolution: Fixed Fix Version/s: 1.8.0 Dan merged this as c1c15ad, noting a 1% flaky rate due to another issue. > master-stress-test flaky with failure to create table due to not enough > tservers > -------------------------------------------------------------------------------- > > Key: KUDU-2472 > URL: https://issues.apache.org/jira/browse/KUDU-2472 > Project: Kudu > Issue Type: Bug > Reporter: Dan Burkert > Assignee: Dan Burkert > Priority: Major > Fix For: 1.8.0 > > > Currently {{master-stress-test}} is 5-7% flaky, failing during a create table > operation: > {code:java} > F0611 20:58:01.335697 23508 master-stress-test.cc:217] Check failed: _s.ok() > Bad status: Invalid argument: Error creating table > default.table_6473953088b54f90af982172a0471cf6 on the master: not enough live > tablet servers to create a table with the requested replication factor 3; 2 > tablet servers are alive{code} > Due to the frequent master failovers introduced by the test, CREATE TABLE > operations are failing because not enough tablet servers are known to be > alive by the current leader master, who likely was just started and quickly > elected. > In this case the master returns an InvalidArgument status to the client, > which is not retried. This indicates a real issue that could occur in a > production cluster, if the leader master were restarted and quickly regained > leadership. I'm not sure yet what the right fix is, I can think of at least > a few: > * Change the return status to be ServiceUnavailable. The client will retry > up to the timeout. The downside is that in legitimate scenarios where there > aren't enough tablet servers the operation will take the full timeout to > fail, and probably have a less useful error status type. Perhaps we could > have a heuristic which says that if the leader hasn't been active for at > least {{n * heartbeat_interval}} (where n is a small integer), then > ServiceUnavailable is used. > * Change master-stress-test to use replication 1 tables. This makes it much > less likely for the race to occur, although it's still possible. This also > doesn't fix the underlying issue. > * Introduce a special case in the table creating thread of > master-stress-test to retry the specific {{InvalidArgument}} status. Also > doesn't fix the underlying issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)