Todd Lipcon has posted comments on this change. Change subject: KUDU-871. Support tombstoned voting ......................................................................
Patch Set 6: (25 comments) http://gerrit.cloudera.org:8080/#/c/6960/6/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: PS6, Line 1485: && state_ == kRunning why is this condition necesary? if we've just been tombstoned it still makes sense that we should withhold votes for a period of time, right? PS6, Line 1509: CheckSafeToVoteUnlocked the duplicate logic here with CheckSafeToVoteUnlocked is a little unnerving. What do you think about essentially inlining its logic here and merging with this if condition? eg something of this nature: switch (state_) { case kShutdown: return IllegalState... case kRunning: local_last_logged = GetLatestFromLog() break; default: if (FLAGS_allow_tombstone_voting && tombstone_last_logged_opid) { local_last_logged = tombstone_last_logged; } else { return IllegalState } break; } I think that might be easier to follow and net less lines of code PS6, Line 1514: DCHECK(tombstone_last_logged_opid) << LogPrefixUnlocked() << "expected last-logged opid"; : local_last_logged_opid = *tombstone_last_logged_opid; if this DCHECK fails, would we crash anyway trying to dereference the boost::none? if so, I think we should either use CHECK (so we get a nice crash log) or we should try to handle this case by just denying the vote (if we aren't 100% sure that this would always be the case, we can always vote conservatively) http://gerrit.cloudera.org:8080/#/c/6960/6/src/kudu/consensus/raft_consensus.h File src/kudu/consensus/raft_consensus.h: Line 130: RaftConsensus(ConsensusOptions options, can the ctor be made private? (may need ALLOW_MAKE_SHARED) I don't see any external usage of it but maybe my grepping was bad. PS6, Line 271: term > nit: current_term() would be more explicit I think CurrentTerm or Term would be better since this takes lock_ and that lock is often held for long periods of time (thus the call may be quite slow compared to your typical getter) PS6, Line 292: Tombstoned voting > not super clear what "tombstoned voting" is maybe explain it somewhere? agreed. Maybe better here to just say that it synchronously transitions to kStopped state, and refer to the kStopped definition in the state enum for what behavior that entails? Line 297: void Shutdown(); same Line 339: enum State { it would be nice to have a state transition diagram of sorts here. Or alternatively, for each state, list out the potential subsequent states and what triggers the transition? eg I can't remember now whether we can transition back from kStopped to kRunning or not. PS6, Line 340: // RaftConsensus has been freshly constructed. : kNew, is this state ever externally exposed, considering we now have the factory method that returns a post-init object? if not, worth noting as much PS6, Line 343: // RaftConsensus has been initialized. : kInitialized, in terms of behavior, how is this state different than Stopped? eg let's say I start up a server which has some tablets in tombstoned state, do their RaftConsensus instances end up in kInitialized state or kStopped state? Is it important to make a distinction? PS6, Line 391: // Initializes the RaftConsensus object. This should be called before : // publishing this object to any thread other than the thread that invoked : // the constructor. : Status Init(); since the constructor is only invoked by the factory method, maybe not necessary to be so explicit? PS6, Line 647: // existence of a known last logged opid. : Status CheckSafeToVoteUnlocked(const boost::optional<OpId>& tombstone_last_logged_opid) : similar to David's comment on the implementation: I think it would make more sense to pass a 'bool tombstoned_op_id_known' or somesuch, and in the implementation add a comment explaining the scenario(s) in which we would _not_ know the tombstoned op id and why it's not safe to vote in those scenarios. Use of a bool instead of the opid makes it more clear that we just care about whether we know a valid opid, and aren't deciding here based on its value. Line 783: AtomicBool stopped_; it seems it was already like this, but what's the distinction between this 'stopped_' member and just checking 'state_ == kStopped'? mind adding a comment explaining and perhaps a TODO to get rid of it if possible? (not going to make you get rid of it in this patch since it's pre-existing and I dont think you made it worse) http://gerrit.cloudera.org:8080/#/c/6960/6/src/kudu/integration-tests/tombstoned_voting-itest.cc File src/kudu/integration-tests/tombstoned_voting-itest.cc: PS6, Line 80: call Stop() on you mean 'manually tombstone' right? I was confused by reading this comment into thinking you were SIGSTOPping TS1 and then expecting it to vote or something, which didn't make much sense. http://gerrit.cloudera.org:8080/#/c/6960/4/src/kudu/integration-tests/tombstoned_voting-stress-test.cc File src/kudu/integration-tests/tombstoned_voting-stress-test.cc: Line 60: protected: this seems unused? Line 182: // We want to control leader election manually and we only want 2 replicas. would be nice to have another more stressful test that doesn't attempt to verify any kind of correctness or coordinate the lifecycle changes with the vote requests. I think the coordination here has some effect that reduces the probability of races http://gerrit.cloudera.org:8080/#/c/6960/6/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: PS6, Line 142: consensus_ = std::move(consensus); : set_state(INITIALIZED); : i'm surprised these lines don't need to take lock_? is this method always called before publishing a TabletReplica instance? Line 315: case STOPPED: hrm, no case for STOPPING? http://gerrit.cloudera.org:8080/#/c/6960/6/src/kudu/tablet/tablet_replica.h File src/kudu/tablet/tablet_replica.h: PS6, Line 81: // Initializes RaftConsensus. : Status Init(ThreadPool* raft_pool); left a similar question in the implementation: is it expected that this be called before publishing an instance? might also be nice to add a note that, even if this fails, it leaves the instance in a self-consistent state (which is in contrast to some other Init methods) so that it can be left registered in the replica map, etc? PS6, Line 96: tombstoned voting I think it's better to generalize this a bit. Is this _only_ used in the case of tombstoning? or can all stopped replicas vote? is it legal to call this without first putting the tablet into tombstoned state? Line 102: // Shut down this tablet replica permanently. does this require that Stop be called first? again would be nice to have some diagram of the state transitions to refer to for these lifecycle methods, perhaps in the class doc http://gerrit.cloudera.org:8080/#/c/6960/4/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: Line 947: // from the TabletMetadata into RaftConsensus::RequestVote(). > Yeah, it's racy, but it's safe because we're checking after we have a refer I guess my concern is less with the safety of the race but more with whether the state checks are even proper, given they are race. I did some hacking on the stress test locally to inject some sleeps after grabbing the state and managed to catch the case where the tablet transitioned from READY to TOMBSTONED. This resulted in a call into RaftConsensus::RequestVote() with no last_logged_opid, so it returned: > Illegal state: RaftConsensus is not running: State = Stopped given we already need to handle this racy check within the RaftConsensus vote code itself, I'm not sure what extra value this is bringing to check it additionally here? BTW, it would be really nice to include some class-level lifecycle comments somewhere. I keep forgetting what happens to the TabletReplica, Tablet, TabletMetadata, and RaftConsensus instances in the case of a tombstone and a re-copy -- which ones are reused, how we ensure that the old ones aren't used anymore after the new ones are created, etc. I remember drawing such a diagram with you a couple weeks ago, and would be nice to have it in the code for reference. PS4, Line 966: LOG(INFO) << "Received RequestConsensusVote() RPC: " << Secu > > when would this equal MinimumOpId? if it's never been tombstoned? > Additionally, just using max() seems wrong in case there was a log truncation > involved. We want to use our latest opid state. If there is truncation then the truncating OpId would be 'higher' than the other one, because it has a higher term, right? i.e comparison is lexicographic comparison of <term, index> tuples http://gerrit.cloudera.org:8080/#/c/6960/6/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: PS6, Line 979: replica->tablet_metadata()->tombstone_last_logged_opid maybe cleaner for it to just return an optional<OpId> itself rather than the special uninitialized value? http://gerrit.cloudera.org:8080/#/c/6960/6/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 663: Status s = DeleteTabletData(replica->tablet_metadata(), cmeta_manager_, delete_type, per discussion on Slack, we may need to modify this so that, if there was no valid log prior to the deletion, we explicitly _clear_ the last_logged_opid, or set to some sentinel value or something to indicate that it's not safe to vote due to potentially lost log state -- To view, visit http://gerrit.cloudera.org:8080/6960 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia19d75b185299443b27f41e468bbae20065e7570 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes