Hello Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/2300
to look at the new patch set (#4).
Change subject: WIP: KUDU-1341. Fix bootstrap to properly handle
delete/insert/update sequences
......................................................................
WIP: KUDU-1341. Fix bootstrap to properly handle delete/insert/update sequences
This adds a new test which reproduces the following bootstrap error:
- a row is written and flushed to a DRS
- the row is updated and deleted, but the DMS is not flushed
- the row is inserted again into MRS and flushed again
- the TS restarts
When bootstrap begins, we have the following tablet state:
- DRS 1: contains the row (original inserted version)
- DRS 2: contains the row (second inserted version)
and the following log:
REPLICATE 1.1: INSERT orig version
COMMIT 1.1: dms id = 1
- bootstrap correctly decides that this already flushed
REPLICATE 1.2: UPDATE
COMMIT 1.2: drs_id = 1, dms_id = 1
- bootstrap correctly decides that the update is not flushed
- it applies the update to the tablet
- the tablet MutateRow code checks the interval tree and finds
that this row could be in either DRS 1 or DRS 2
- it tries applying the update to them in whatever order the
RowSetTree returns the DiskRowSets
- if it happens to try DRS 2 first, then we incorrectly apply
the update to the new version of the row instead of the old
We haven't seen this bug in most of our testing because the RowSetTree
returns rows in a deterministic order. We only test the
insert-update-delete-insert sequence in a few test cases, and it appears
that those test cases either:
1) do not restart the tablet server, and thus don't see bootstrap issues
2) happen to result in RowSetTrees that give back the DRS in the lucky
order that doesn't trigger the bug.
This patch changes the tablet code so that, in debug mode, the results
returned by RowSetTree are shuffled before being used. This makes the
bug more likely to reproduce. It adds a specific test case which reproduces
the bug reliably in debug builds with the shuffling enabled.
Additionally, this patch moves the "fuzz test" from tablet_random-access-test
into a new 'fuzz-itest', and changes it to write to a single-TS minicluster
rather than directly to a tablet. The test disables the maintenance manager
and manually manages background operations, so that it has the same capabilities
as before, but the fact that it's writing to the "full stack" allows us to
add a new fuzz test action: restarting the tablet server. Additionally, the
improved test now supports generating writes to multiple rows, since some
errors don't show up if you have single-row rowsets.
With the improved fuzz test, I was able to generate sequences that reliably
reproduce the crash described in KUDU-1341 as well as several other bugs:
incorrect results on reads, other CHECK failures in compactions, etc.
The fuzz tester should give us good confidence in the bug fix.
The WIP fix in this patch is to pass the operation result from the COMMIT
message
in the WAL back into the tablet while replaying. The tablet code then uses
this to specifically pick the correct rowset to reapply the update.
It's WIP because the code is missing various error handling, should probably
be refactored, etc, and not 100% sure this is the best solution.
Change-Id: I6017ef67ae236021f7e6bd19d21b89310b8e6894
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/fuzz-itest.cc
M src/kudu/tablet/deltafile.cc
M src/kudu/tablet/deltafile.h
M src/kudu/tablet/row_op.cc
M src/kudu/tablet/row_op.h
M src/kudu/tablet/rowset_tree.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_random_access-test.cc
M src/kudu/tserver/tablet_server-test.cc
11 files changed, 697 insertions(+), 337 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/00/2300/4
--
To view, visit http://gerrit.cloudera.org:8080/2300
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6017ef67ae236021f7e6bd19d21b89310b8e6894
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins