Todd Lipcon has submitted this change and it was merged.

Change subject: exactly_once_rpc-test: fix gc stress test flakiness
......................................................................


exactly_once_rpc-test: fix gc stress test flakiness

This test involves two threads:

1) the 'stubborn writer' thread retries a request with the same sequence
number over and over. It expects that eventually the cached result will
go stale, and then later that the client will be entirely GCed and thus
the request will start to succeed again.

2) the 'long write' thread, which uses the normal RetriableRpc mechanism
to send requests, each with increasing sequence numbers. We expect that,
since each of these requests is a new one, and isn't retried once it's
successful, we won't see any 'stale' responses.

The test was flaky, however, because the 'stubborn writer' thread was
always sending its own sequence number as the last_incomplete sequence
number, and we also didn't ensure that it started before the 'long
write' thread. Given that, it was possible to have this interleaving:

  1) start the 'long write' thread, which is assigned seq number 1
  2) before the write is sent, the 'stubborn writer' thread assigns
     itself seq number 2, and sends a request indicating last_incomplete=2.
  3) when the 'long write' thread sends its request, it immediately gets
     a 'stale' response, causing a test failure.

One fix would have been to make the 'stubborn writer' thread send the
first_incomplete calculated by the RequestTracker. However, that would
have involved modifying a bunch of other tests to properly update the
RequestTracker.

So instead this test takes the approach of assigning the 'stubborn
writer's sequence number before starting the 'long writer' thread. This
ensures that the 'stubborn writer' won't explicitly GC any request made
by the 'long writer'.

With the patch, I looped this test 500 times and it passed[1]. Without
the patch, it failed 64/500[2].

[1] http://dist-test.cloudera.org//job?job_id=todd.1480926593.3793
[2] http://dist-test.cloudera.org//job?job_id=todd.1480926999.4126

Change-Id: I30a7d06928973964c5285e5e86503e5871ea5995
Reviewed-on: http://gerrit.cloudera.org:8080/5358
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <dral...@apache.org>
---
M src/kudu/rpc/exactly_once_rpc-test.cc
1 file changed, 23 insertions(+), 12 deletions(-)

Approvals:
  David Ribeiro Alves: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5358
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I30a7d06928973964c5285e5e86503e5871ea5995
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to