[kudu-CR] WIP: bug in exactly-once during tablet bootstrap
David Ribeiro Alves has posted comments on this change. Change subject: WIP: bug in exactly-once during tablet bootstrap .. Patch Set 2: I figured out the bug. Will post a fix soon -- To view, visit http://gerrit.cloudera.org:8080/5417 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] WIP: bug in exactly-once during tablet bootstrap
David Ribeiro Alves has posted comments on this change. Change subject: WIP: bug in exactly-once during tablet bootstrap .. Patch Set 2: After digging around a bit more. I think the problem is a bit more insidious than us just not keeping the error. After fixing that bug I still got a miscount. For instance this log entry; 5.24@6067477835358027776REPLICATE WRITE_OP Tablet: ca254b0152444e97be34aa60658961da RequestId: client_id: "5885ac11fece4646a9a014031c6ce856" seq_no: 21 first_incomplete_seq_no: 8 attempt_no: 21 Consistency: CLIENT_PROPAGATED Propagated TS: 6067477833427873792 op 0: INSERT (int32 key=11, int32 int_val=1590579801, string string_val="hello world") op 1: INSERT (int32 key=11, int32 int_val=1908549293, string string_val="hello world") op 2: INSERT (int32 key=12, int32 int_val=1209943213, string string_val="hello world") Has this commit message: COMMIT 5.24 op_type: WRITE_OP commited_op_id { term: 5 index: 24 } result { ops { mutated_stores { mrs_id: 0 } } ops { flushed: true } ops { flushed: true } } Note that the second key is repeated. so it's weird that the first op on the commit message applies it to the mrs but the the second op says flushed. I added a few extra log statements to tablet bootstrap. investingating this further. -- To view, visit http://gerrit.cloudera.org:8080/5417 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] WIP: bug in exactly-once during tablet bootstrap
David Ribeiro Alves has posted comments on this change. Change subject: WIP: bug in exactly-once during tablet bootstrap .. Patch Set 2: oops sorry ended up rebasing your patch on master -- To view, visit http://gerrit.cloudera.org:8080/5417 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No
[kudu-CR] WIP: bug in exactly-once during tablet bootstrap
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/5417 to look at the new patch set (#2). Change subject: WIP: bug in exactly-once during tablet bootstrap .. WIP: bug in exactly-once during tablet bootstrap Here's a regression test for the bug which is causing raft_consensus-itest to occasionally think it has inserted 23 rows when in fact it has only inserted 20. The issue is in the rewriting of logs during bootstrap. If we do a write which gets a duplicate key error, the first time the COMMIT message is written, it includes the error. When the server restarts, it writes the COMMIT message again with only 'flushed: true' in the commit message. This is enough for bootstrap to know not to bother to replay it on subsequent restarts, but it has lost the error messages themselves. If the server restarts again, at this point it doesn't rebuild a proper response, but instead puts an errorless response into the ResultTracker. So, if an operation hits an error, and then the tablet server restarts twice while the client is still retrying, the client will falsely think that its operation has succeeded. This includes a regression test which shows the bug, but haven't looked into fixing it yet. Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a --- M src/kudu/integration-tests/test_workload.cc M src/kudu/tserver/tablet_server-test.cc 2 files changed, 73 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/5417/2 -- To view, visit http://gerrit.cloudera.org:8080/5417 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] WIP: bug in exactly-once during tablet bootstrap
Hello David Ribeiro Alves, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/5417 to review the following change. Change subject: WIP: bug in exactly-once during tablet bootstrap .. WIP: bug in exactly-once during tablet bootstrap Here's a regression test for the bug which is causing raft_consensus-itest to occasionally think it has inserted 23 rows when in fact it has only inserted 20. The issue is in the rewriting of logs during bootstrap. If we do a write which gets a duplicate key error, the first time the COMMIT message is written, it includes the error. When the server restarts, it writes the COMMIT message again with only 'flushed: true' in the commit message. This is enough for bootstrap to know not to bother to replay it on subsequent restarts, but it has lost the error messages themselves. If the server restarts again, at this point it doesn't rebuild a proper response, but instead puts an errorless response into the ResultTracker. So, if an operation hits an error, and then the tablet server restarts twice while the client is still retrying, the client will falsely think that its operation has succeeded. This includes a regression test which shows the bug, but haven't looked into fixing it yet. Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a --- M src/kudu/integration-tests/test_workload.cc M src/kudu/tserver/tablet_server-test.cc 2 files changed, 72 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/5417/1 -- To view, visit http://gerrit.cloudera.org:8080/5417 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I60b3b30b0705b4f9063b0d505cb9ab1ca24e470a Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: David Ribeiro Alves