Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#14). Change subject: KUDU-1407: reassign failed tablets ...................................................................... KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there may be an inconsistency on-disk. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest is added to test that failed tablets while running are evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 14 files changed, 160 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/14 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 14 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>