Alexey Serbin has submitted this change and it was merged.

Change subject: raft_consensus-itest: fix failing 
TestMemoryRemainsConstantDespiteTwoDeadFollowers
......................................................................


raft_consensus-itest: fix failing 
TestMemoryRemainsConstantDespiteTwoDeadFollowers

This test started failing recently with a seemingly unrelated change to
the minicluster (e6758739a90adeb2e4d0c6cf76185cf90cc7d2b0). It's still a
mystery as to the relationship here, after a couple hours of debugging.

But, looking at the failures with some extra tracing added, I determined
that the issue was something like the following:

- the test sets up a scenario in which writes are expected to time out
  by killing 2/3 replicas
- due to recent client changes by Alexey, the client now considers a
  Timeout error on writes to mean that the server is dead, and marks it
  as no longer a leader in the cache
- this causes the next write on other threads to go to the master to
  lookup the "new location" assuming that a leader election probably has
  happened
- in fact, _all_ of the threads quickly go to the master to perform the
  same lookup, exhausing the "master lookup permits" counter. Any excess
  threads beyond that count end up backing off sleeping
- the master lookups in TSAN builds can sometimes take tens of
  milliseconds, especially in the GCE test environment which uses low
  core-count workers. Thus, some of the lookups to the masters
  themselves time out.
- those that do succeed, of course, return the same location information
  as we had before
- any writers which happen to wake up from back-offs to get new location
  information then try to do a write, but it's quick likely that before
  they get a chance to do so, another thread has already experienced
  another timeout and marked the replica as bad.

Essentially, the test can get into a kind of spiral where it's flooding
the master with lookups, each of which is taking longer than 50ms, and
thus cause more timeouts, which cause more lookups, etc.

As identified elsewhere, the metacache probably needs a pretty major
re-working to fix these sorts of problems, but this scenario is also
fairly contrived. So, this patch just bumps the timeout to 150ms instead
of 50ms, and changes the payload of the writes to be much larger, so the
desired backpressure kicks in faster.

Before this change, TSAN test runs on gce were failing nearly 100%. With
the change I passed 100/100.

Change-Id: Ifa7d3d7655c2ecf376e894b8a1412e2fe3df0753
Reviewed-on: http://gerrit.cloudera.org:8080/7337
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <[email protected]>
---
M src/kudu/integration-tests/raft_consensus-itest.cc
1 file changed, 2 insertions(+), 1 deletion(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7337
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ifa7d3d7655c2ecf376e894b8a1412e2fe3df0753
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>

Reply via email to