Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/4409 to look at the new patch set (#4). Change subject: consensus: properly truncate all state when aborting operations ...................................................................... consensus: properly truncate all state when aborting operations This fixes a consensus bug which was causing exactly_once_writes-itest to be slightly flaky. The issue was the following sequence: - a node A is a follower, and has some operations appended (eg 10.5 through 10.7) - a node B is elected for term 11, and sends node 'A' a status-only request containing preceding_op_id=11.6 -- node 'A' aborts operations 10.6 and 10.7 -- HOWEVER: it was not explicitly removing these operations from the LogCache or the Log. Removal was only happening on an actual operation _replacement_. - node 'B' loses its leadership before it is able to replicate anything to a majority - node 'A' gets elected for term 12 -- it calls Queue::SetLeaderMode() -- this triggers the first requests to be sent to the peer -- we hit a race where the first request is being constructed _before_ the leader appends its initial NO_OP to the queue --- because we never truncated the log cache or queue, we see operations 10.6 and 10.7 in the queue, and send them to a follower -- we now append the NO_OP 12.6 which replaces the aborted 10.6. In this case, the peer who received the fauly request from the leader may end up committing those operations whereas the rest of the nodes commit operations from term 12. The fix in this patch is to explicitly truncate the queue and the LogCache state when we are aborting operations. WIP because it needs a few more comments. To test, I looped exactly_once_writes-itest --gtest_filter=\*Churny\* 1000 times before and after. Without the patch[1], I got 17 failures, 16 of which were verification errors that one of the committed op terms did not match. With the patch[2], I got 5 failures, all of which were checksum errors while verifying the logs. Since seeing those failures, I fixed the verifier to run only after shutting down the cluster. [1] http://dist-test.cloudera.org/job?job_id=todd.1473812577.12216 [2] http://dist-test.cloudera.org/job?job_id=todd.1473811112.9830 Change-Id: I2fb95b447991b7cadc2c403bc2596fead0bd31ad --- M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/log_cache.cc M src/kudu/consensus/log_cache.h M src/kudu/consensus/raft_consensus-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/CMakeLists.txt M src/kudu/integration-tests/cluster_verifier.cc M src/kudu/integration-tests/exactly_once_writes-itest.cc A src/kudu/integration-tests/log_verifier.cc A src/kudu/integration-tests/log_verifier.h 15 files changed, 346 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/09/4409/4 -- To view, visit http://gerrit.cloudera.org:8080/4409 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2fb95b447991b7cadc2c403bc2596fead0bd31ad Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org>