Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#6). Change subject: disk failure: reassign failed tablets ...................................................................... disk failure: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted. This is an issue because failed tablets don't get evicted and reassigned (e.g. if a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else). To remediate this, this patch changes the tserver response generated by FAILED tablets to a new TABLET_FAILED, which is ignored by leaders to promote eviction. Additionally, a new tablet state is added: FAILED_AND_SHUTDOWN. Like QUIESCING and SHUTDOWN, TabletReplica::Shutdown() can wait on FAILED_AND_SHUTDOWN. This is useful if a failed tablet needs to be shut down and still needs to be reassigned. Calling normal Shutdown() cannot leave the replica in the FAILED state, and the SHUTDOWN state cannot itself indicate the need for eviction. Prior to this patch, tablets were set to FAILED when they failed to delete metadata. This is no longer the case. Since error statuses during deletion are only returned during IO to the metadata directory, and because the metadata directory is a single point of failure, failures on this codepath are made fatal for now. Once this is no longer the case, these failures should be made benign, as proper error handling should make files in the failed metadata directory unreachable. This ensures the tablets that were meant to be deleted are not reassigned. The test raft_consensus-itest is updated to ensure that failed tablets are evicted and replaced. The test tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 12 files changed, 105 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/6 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>