Mike Percy has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
......................................................................


Patch Set 17:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7440/17//COMMIT_MSG
Commit Message:

PS17, Line 30: is added 
this is repeated


PS17, Line 31: failed tablets while running
tablets that fail while running (due to what?)


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 629:     NotifyObserversOfFailedFollower(peer_uuid, current_term, reason);
nit: No need to hold the lock while calling this method.


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true,
Should be removed per below. See master_tombstone_evicted_tablet_replica


PS17, Line 2473:     if (FLAGS_master_tombstone_failed_tablet_replicas) {
               :       SendDeleteReplicaRequest(report.tablet_id(), 
TABLET_DATA_TOMBSTONED,
               :                                boost::none,
               :                                tablet->table(), 
ts_desc->permanent_uuid(),
               :                                Substitute("Tablet failed: $0", 
s.ToString()));
               :     }
Is this required? The leader will now evict a failed follower because of the 
changes in the queue in this patch. Once that eviction is committed as a new 
config change, the master should find out and automatically delete this replica 
that is part of a stale config (in a safe way that passes in 
cas_config_opid_index_less_or_equal). See 
FLAGS_master_tombstone_evicted_tablet_replicas usage in this file.


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS17, Line 655: metadata
Couldn't this simply happen if one of the data disks failed?


PS17, Line 658: is unclear
Shouldn't the contract of DeleteTabletData() be a crash-consistent one? In 
fact, I think it is (perhaps not well documented) from the perspective of the 
order in which we delete things. It's extensively tested in ts_recovery-itest.


Line 752:   auto fail_tablet = MakeScopedCleanup([&]() {
I like this approach.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to