Mike Percy has posted comments on this change. Change subject: KUDU-871. Support tombstoned voting ......................................................................
Patch Set 10: (29 comments) http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: PS10, Line 1452: boost:: > nit: "using boost::optional" is already specified Done PS10, Line 1494: local_last_logged_opid = queue_->GetLastOpIdInLog(); > What if tombstone_last_logged_opid was provided as well? Is it some some o It's OK. I added a comment to explain. Line 1856: CHECK_EQ(kStopping, state_) << State_Name(state_); > this is already CHECKed in SetStateUnlocked, no? True, removed Line 1863: cmeta_->set_leader_uuid(""); > ClearLeaderUnlocked() ? Done http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/consensus/raft_consensus.h File src/kudu/consensus/raft_consensus.h: PS10, Line 351: | ^ : // | | : // +--------------------+ > nit: prefer the slimmer version at L424 Done PS10, Line 372: // Raft is in the process of stopping and will not accept writes. : kStopping, > nit: is it possible to request a vote if in this state? It seems it is, bu yes; done http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/integration-tests/cluster_itest_util.h File src/kudu/integration-tests/cluster_itest_util.h: PS10, Line 236: // Request the given replica to vote. > nit: could you document at least first 5 parameters of this method? It's just a thin wrapper around an RPC call. I added a comment referring to its definition. http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/integration-tests/delete_tablet-itest.cc File src/kudu/integration-tests/delete_tablet-itest.cc: Line 95: // Tablet bootstrap failure should result in a SHUTDOWN tablet with an error > per discussion on slack, let's revert this part of this patch since it isn' Done http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: PS10, Line 3044: back come > nit: come back removed this change per convo w/ Todd on slack and here in the comments http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/integration-tests/tombstoned_voting-itest.cc File src/kudu/integration-tests/tombstoned_voting-itest.cc: Line 18: #include "kudu/consensus/metadata.pb.h" > nit: from IWYU Done, thank you PS10, Line 46: using std::string; : using std::vector; > nit: include these? Done PS10, Line 167: TS1 > nit: the tablet? Or it's possible to tombstone a tablet server now? added a comment to explain PS10, Line 172: TServerDetails* ts1_ets > nit: consider moving this closer to the call site where it's used. Done Line 173: scoped_refptr<TabletReplica> replica; > nit: does it make sense to verify the scenario below works without errors e Makes sense. Done. PS10, Line 181: "A" > Could you add a comment explaining that the actual candidate UUID is not cr Done PS10, Line 249: ts0_replica->consensus()->StepDown(&resp) > nit: should it be some selective error handling based on the code set in th I'm not sure why else it would fail. This is a pretty controlled environment. If a disk fails, all the tests will fail on that Jenkins run so we don't really care about that. PS10, Line 253: ts0 > nit: 'TS0' since it has been referred as that already. Is your comment parser case-sensitive? :) Done PS10, Line 256: FOLLOWER > Could it change to something like 'LEARNER' in the middle and bring some fl Nope. PS10, Line 263: error (nee failed > per other comment, back out this change Done PS10, Line 294: Shut each TS > down Done http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/integration-tests/tombstoned_voting-stress-test.cc File src/kudu/integration-tests/tombstoned_voting-stress-test.cc: Line 18: #include <atomic> > nit: from IWYU report: thanks! PS10, Line 44: using std::atomic; : using std::string; : using std::thread; : using std::unique_lock; : using std::vector; > nit: includes? Done PS10, Line 57: : num_workers_(1), > this test seems to be implemented for multiple workers but only has one wor I initially wanted >1 worker but had trouble making it deterministic enough to assert on things I wanted to assert on, so I dropped it to 1 worker. When I tried to simplify the code for a single worker it didn't really seem better, so I kept the more general implementation. http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 2489: LOG(WARNING) << "Tablet " << tablet->ToString() << " has failed on TS " > wouldn't it be natural to have tablet::SHUTDOWN check here instead? Or it' I thought it was a little unwise to have a crash based on remote state reporting in, but I reverted this change per other comments related to FAILED, and since it's only a DCHECK and not a CHECK it's generally not a big deal to leave it in so I have left it like it was before. http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/master/sys_catalog.cc File src/kudu/master/sys_catalog.cc: PS10, Line 347: // TODO(mpercy): Set to failed if Init() fails? > Does this still apply? Without automatic re-replication of the master the utility of doing much of anything is limited. Anyway, thanks for the catch, I'll do that. http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/tablet/tablet_metadata.h File src/kudu/tablet/tablet_metadata.h: PS10, Line 357: Has no meaning for non-tombstoned tablets > Is this to say that this will be boost::none for non-tombstoned tablets? It will be boost::none for TABLET_DATA_READY tablets. http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: PS10, Line 304: LOG(DFATAL) > The rest of the CHECKs would work for non-debug builds as well. What is th changed to FATAL http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/tablet/tablet_replica.h File src/kudu/tablet/tablet_replica.h: PS10, Line 199: SetError > nit: sort of brought this up on Slack, I think the naming for this could be reverted per other convo http://gerrit.cloudera.org:8080/#/c/6960/10/src/kudu/tserver/tablet_copy_client.cc File src/kudu/tserver/tablet_copy_client.cc: PS10, Line 298: RETURN_NOT_OK(WriteConsensusMetadata()); > This change isn't noted in the commit message. Mind documenting it, or at l Added a comment here and a bullet point to the commit message. -- To view, visit http://gerrit.cloudera.org:8080/6960 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia19d75b185299443b27f41e468bbae20065e7570 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Edward Fancher <e...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes