Mike Percy has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets ......................................................................
Patch Set 17: (8 comments) http://gerrit.cloudera.org:8080/#/c/7440/17//COMMIT_MSG Commit Message: PS17, Line 30: is added this is repeated PS17, Line 31: failed tablets while running tablets that fail while running (due to what?) http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 629: NotifyObserversOfFailedFollower(peer_uuid, current_term, reason); nit: No need to hold the lock while calling this method. http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true, Should be removed per below. See master_tombstone_evicted_tablet_replica PS17, Line 2473: if (FLAGS_master_tombstone_failed_tablet_replicas) { : SendDeleteReplicaRequest(report.tablet_id(), TABLET_DATA_TOMBSTONED, : boost::none, : tablet->table(), ts_desc->permanent_uuid(), : Substitute("Tablet failed: $0", s.ToString())); : } Is this required? The leader will now evict a failed follower because of the changes in the queue in this patch. Once that eviction is committed as a new config change, the master should find out and automatically delete this replica that is part of a stale config (in a safe way that passes in cas_config_opid_index_less_or_equal). See FLAGS_master_tombstone_evicted_tablet_replicas usage in this file. http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS17, Line 655: metadata Couldn't this simply happen if one of the data disks failed? PS17, Line 658: is unclear Shouldn't the contract of DeleteTabletData() be a crash-consistent one? In fact, I think it is (perhaps not well documented) from the perspective of the order in which we delete things. It's extensively tested in ts_recovery-itest. Line 752: auto fail_tablet = MakeScopedCleanup([&]() { I like this approach. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes