Andrew Wong has posted comments on this change.

Change subject: disk failure: reassign failed tablets
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc
File src/kudu/client/scanner-internal.cc:

PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through
> would it make more sense to have this be like: TABLET_NOT_FOUND? How do we 
Hrm, maybe, but I'm keeping this as is for now. Reasoning here was that before 
when a tablet was in the FAILED state, we would treat it as TABLET_NOT_RUNNING. 
I'm looking in client/scanner-internal.cc and it seems like we blacklist the 
location for TNR (if there's somewhere else I should be looking, please let me 
know).

I'm not sure it makes sense to retry on TNR. I suppose it could retry if the 
tablet were NOT_STARTED or BOOTSTRAPPING, but tablets in QUIESCING and SHUTDOWN 
are also considered NOT_RUNNING.


http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS7, Line 284: sponse_.error().code() == TabletServerErrorPB::TABLET_FAILED) 
> maybe in this case we should directly call: NotifyObserversOfFailedFollower
Done.


http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS7, Line 638: // Initiate Tablet Copy on the peer if the tablet is not found.
             :     if (response.has_error()) {
             :       CHECK_EQ(tserver::TabletServerErrorPB::TABLET_NOT_FOUND, 
response.error().code());
             :       peer->needs_tablet_copy = true;
             :       VLOG_WITH_PREFIX_UNLOCKED(1) << "Marked peer as needing 
tablet copy: "
             :                                     << peer->ToString();
             :       *more_pending = true;
             :       return;
             :     }
             : 
             :     // Sanity checks.
             :     // Some of these can be eventually removed, but they are 
handy for now.
             :     DCHECK(response.status().IsInitialized())
             :         << "Error: Uninitialized: " << 
response.InitializationErrorString()
             :         << ". Response: "<< SecureShortDebugString(response);
             :     // TODO: Include uuid in error messages as well.
             :     DCHECK(response.has_responder_uuid() && 
!response.responder_uuid().empty())
             :      
> see my comment on the call site
Done


http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS7, Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true,
             :             "Whether the master should tombstone (delete) tablet 
replicas that "
             :             "are reporting a failed state. Only for testing!");
             : TAG_FLAG(master_tombstone_failed_tablet_replicas, hidden);
> is this a test only thing?
As of now, yes. Will update to make that clear.


http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/tablet/metadata.proto
File src/kudu/tablet/metadata.proto:

PS7, Line 161: the tablet will be evicted and
> ??
Should be evicted and replaced.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to