Alexey Serbin has submitted this change and it was merged. Change subject: raft_consensus-itest: fix failing TestMemoryRemainsConstantDespiteTwoDeadFollowers ......................................................................
raft_consensus-itest: fix failing TestMemoryRemainsConstantDespiteTwoDeadFollowers This test started failing recently with a seemingly unrelated change to the minicluster (e6758739a90adeb2e4d0c6cf76185cf90cc7d2b0). It's still a mystery as to the relationship here, after a couple hours of debugging. But, looking at the failures with some extra tracing added, I determined that the issue was something like the following: - the test sets up a scenario in which writes are expected to time out by killing 2/3 replicas - due to recent client changes by Alexey, the client now considers a Timeout error on writes to mean that the server is dead, and marks it as no longer a leader in the cache - this causes the next write on other threads to go to the master to lookup the "new location" assuming that a leader election probably has happened - in fact, _all_ of the threads quickly go to the master to perform the same lookup, exhausing the "master lookup permits" counter. Any excess threads beyond that count end up backing off sleeping - the master lookups in TSAN builds can sometimes take tens of milliseconds, especially in the GCE test environment which uses low core-count workers. Thus, some of the lookups to the masters themselves time out. - those that do succeed, of course, return the same location information as we had before - any writers which happen to wake up from back-offs to get new location information then try to do a write, but it's quick likely that before they get a chance to do so, another thread has already experienced another timeout and marked the replica as bad. Essentially, the test can get into a kind of spiral where it's flooding the master with lookups, each of which is taking longer than 50ms, and thus cause more timeouts, which cause more lookups, etc. As identified elsewhere, the metacache probably needs a pretty major re-working to fix these sorts of problems, but this scenario is also fairly contrived. So, this patch just bumps the timeout to 150ms instead of 50ms, and changes the payload of the writes to be much larger, so the desired backpressure kicks in faster. Before this change, TSAN test runs on gce were failing nearly 100%. With the change I passed 100/100. Change-Id: Ifa7d3d7655c2ecf376e894b8a1412e2fe3df0753 Reviewed-on: http://gerrit.cloudera.org:8080/7337 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin <[email protected]> --- M src/kudu/integration-tests/raft_consensus-itest.cc 1 file changed, 2 insertions(+), 1 deletion(-) Approvals: Alexey Serbin: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/7337 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ifa7d3d7655c2ecf376e894b8a1412e2fe3df0753 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]>
