[kudu-CR] WIP: bug in exactly-once during tablet bootstrap

2016-12-09 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: bug in exactly-once during tablet bootstrap
..


Patch Set 2:

I figured out the bug. Will post a fix soon

-- 
To view, visit http://gerrit.cloudera.org:8080/5417
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] WIP: bug in exactly-once during tablet bootstrap

2016-12-09 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: bug in exactly-once during tablet bootstrap
..


Patch Set 2:

After digging around a bit more. I think the problem is a bit more insidious 
than us just not keeping the error. After fixing that bug I still got a 
miscount.

For instance this log entry;
5.24@6067477835358027776REPLICATE WRITE_OP
Tablet: ca254b0152444e97be34aa60658961da
RequestId: client_id: "5885ac11fece4646a9a014031c6ce856" seq_no: 21 
first_incomplete_seq_no: 8 attempt_no: 21
Consistency: CLIENT_PROPAGATED
Propagated TS: 6067477833427873792
op 0: INSERT (int32 key=11, int32 int_val=1590579801, string 
string_val="hello world")
op 1: INSERT (int32 key=11, int32 int_val=1908549293, string 
string_val="hello world")
op 2: INSERT (int32 key=12, int32 int_val=1209943213, string 
string_val="hello world")
Has this commit message:
COMMIT 5.24
op_type: WRITE_OP commited_op_id { term: 5 index: 24 } result { ops { 
mutated_stores { mrs_id: 0 } } ops { flushed: true } ops { flushed: true } }

Note that the second key is repeated. so it's weird that the first op on the 
commit message applies it to the mrs but the the second op says flushed.
I added a few extra log statements to tablet bootstrap. investingating this 
further.

-- 
To view, visit http://gerrit.cloudera.org:8080/5417
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] WIP: bug in exactly-once during tablet bootstrap

2016-12-08 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: bug in exactly-once during tablet bootstrap
..


Patch Set 2:

oops sorry ended up rebasing your patch on master

-- 
To view, visit http://gerrit.cloudera.org:8080/5417
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] WIP: bug in exactly-once during tablet bootstrap

2016-12-08 Thread David Ribeiro Alves (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/5417

to look at the new patch set (#2).

Change subject: WIP: bug in exactly-once during tablet bootstrap
..

WIP: bug in exactly-once during tablet bootstrap

Here's a regression test for the bug which is causing
raft_consensus-itest to occasionally think it has inserted 23 rows when
in fact it has only inserted 20.

The issue is in the rewriting of logs during bootstrap. If we do a write
which gets a duplicate key error, the first time the COMMIT message is
written, it includes the error.

When the server restarts, it writes the COMMIT message again with only
'flushed: true' in the commit message. This is enough for bootstrap to
know not to bother to replay it on subsequent restarts, but it has lost
the error messages themselves.

If the server restarts again, at this point it doesn't rebuild a proper
response, but instead puts an errorless response into the ResultTracker.

So, if an operation hits an error, and then the tablet server restarts
twice while the client is still retrying, the client will falsely think
that its operation has succeeded.

This includes a regression test which shows the bug, but haven't looked
into fixing it yet.

Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a
---
M src/kudu/integration-tests/test_workload.cc
M src/kudu/tserver/tablet_server-test.cc
2 files changed, 73 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/5417/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5417
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins