[kudu-CR] KUDU-1407: reassign failed tablets
Mike Percy has submitted this change and it was merged. Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there is no guarantee that the tablet's metadata reflects the deleted state. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest to test that a tablet that is manually failed while running is evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Reviewed-on: http://gerrit.cloudera.org:8080/7440 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy--- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 13 files changed, 154 insertions(+), 58 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 21 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
Mike Percy has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 20: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1407: reassign failed tablets
Mike Percy has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 19: (1 comment) http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS19, Line 646: replica->state() == tablet::FAILED); > Without your tombstoned voting patch, if the replica has been shut down, co Gotcha. OK, let's leave it and I will remove it in my tombstoned voting patch. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 20: (6 comments) http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 627: void PeerMessageQueue::NotifyPeerHasFailed(const string& peer_uuid, const string& reason) { > consider: Done PS19, Line 628: l(queue_lo > This should be protected by queue_lock_ Done PS19, Line 632: ignored. > This should be protected by queue_lock_ Done PS19, Line 633: int64_t current_term = queue_st > No need to hold queue_lock_ while calling NotifyObserversOfFailedFollower() Ah I see, thanks for clarifying. http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: Line 310: // its FAILED state. > nit: add a comment for why we are specifying this flag Done http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS19, Line 646: > Why this change? I don't think this is necessary or desired. Without your tombstoned voting patch, if the replica has been shut down, consensus will be shut down. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#20). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there is no guarantee that the tablet's metadata reflects the deleted state. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest to test that a tablet that is manually failed while running is evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 13 files changed, 154 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/20 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 20 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
Mike Percy has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 19: (1 comment) http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS19, Line 646: replica->state() == tablet::FAILED); Why this change? I don't think this is necessary or desired. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Mike Percy has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 19: (1 comment) http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 627: void PeerMessageQueue::NotifyPeerHasFailed(const string& peer_uuid, const string& reason) { consider: std::unique_lock l(queue_lock_); TrackedPeer* peer = FindPtrOrNull(peers_map_, peer_uuid); if (peer) { // Use the current term to ensure the peer will be evicted, otherwise this // notification may be ignored. int64_t current_term = queue_state_.current_term; l.unlock(); NotifyObserversOfFailedFollower(peer_uuid, current_term, reason); } -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Mike Percy has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 19: (6 comments) http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS19, Line 628: peers_map_ This should be protected by queue_lock_ PS19, Line 632: queue_state_.current_term This should be protected by queue_lock_ PS19, Line 633: NotifyObserversOfFailedFollower No need to hold queue_lock_ while calling NotifyObserversOfFailedFollower() because the thread pool it uses is set in the constructor http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: Line 310: { "--master_tombstone_evicted_tablet_replicas=false" })); nit: add a comment for why we are specifying this flag http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS17, Line 655: > Failures when writing the data directory are passed off as WARN_NOT_OK() (s ok PS17, Line 658: > Ah, I see. I'll update the comment. ok. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 19: The jenkins failure doesn't seem to be related to these changes. This patch is still good for review. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1407: reassign failed tablets
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#19). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there is no guarantee that the tablet's metadata reflects the deleted state. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest to test that a tablet that is manually failed while running is evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 13 files changed, 151 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/19 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 19 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 17: (8 comments) http://gerrit.cloudera.org:8080/#/c/7440/17//COMMIT_MSG Commit Message: PS17, Line 30: is added > this is repeated Done PS17, Line 31: failed tablets while running > tablets that fail while running (due to what?) Done http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 629: NotifyObserversOfFailedFollower(peer_uuid, current_term, reason); > nit: No need to hold the lock while calling this method. Done http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true, > Should be removed per below. See master_tombstone_evicted_tablet_replica Done PS17, Line 2473: if (FLAGS_master_tombstone_failed_tablet_replicas) { : SendDeleteReplicaRequest(report.tablet_id(), TABLET_DATA_TOMBSTONED, :boost::none, :tablet->table(), ts_desc->permanent_uuid(), :Substitute("Tablet failed: $0", s.ToString())); : } > Is this required? The leader will now evict a failed follower because of th I think you're right; when the leader sees the failed tablet, it should evict and config change, and then report to the master. http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS17, Line 655: metadata > Couldn't this simply happen if one of the data disks failed? Failures when writing the data directory are passed off as WARN_NOT_OK() (see TabletMetadata::DeleteOrphanedBlocks), since the blocks can always be removed in the future (eg when we next startup). PS17, Line 658: is unclear > Shouldn't the contract of DeleteTabletData() be a crash-consistent one? In Ah, I see. I'll update the comment. Line 752: auto fail_tablet = MakeScopedCleanup([&]() { > I like this approach. It is quite clean indeed! Credit to Adar for the suggestion. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#18). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there is no guarantee that the tablet's metadata reflects the deleted state. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest to test that a tablet that is manually failed while running is evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 13 files changed, 151 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/18 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 18 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
Mike Percy has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 17: (8 comments) http://gerrit.cloudera.org:8080/#/c/7440/17//COMMIT_MSG Commit Message: PS17, Line 30: is added this is repeated PS17, Line 31: failed tablets while running tablets that fail while running (due to what?) http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 629: NotifyObserversOfFailedFollower(peer_uuid, current_term, reason); nit: No need to hold the lock while calling this method. http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true, Should be removed per below. See master_tombstone_evicted_tablet_replica PS17, Line 2473: if (FLAGS_master_tombstone_failed_tablet_replicas) { : SendDeleteReplicaRequest(report.tablet_id(), TABLET_DATA_TOMBSTONED, :boost::none, :tablet->table(), ts_desc->permanent_uuid(), :Substitute("Tablet failed: $0", s.ToString())); : } Is this required? The leader will now evict a failed follower because of the changes in the queue in this patch. Once that eviction is committed as a new config change, the master should find out and automatically delete this replica that is part of a stale config (in a safe way that passes in cas_config_opid_index_less_or_equal). See FLAGS_master_tombstone_evicted_tablet_replicas usage in this file. http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS17, Line 655: metadata Couldn't this simply happen if one of the data disks failed? PS17, Line 658: is unclear Shouldn't the contract of DeleteTabletData() be a crash-consistent one? In fact, I think it is (perhaps not well documented) from the perspective of the order in which we delete things. It's extensively tested in ts_recovery-itest. Line 752: auto fail_tablet = MakeScopedCleanup([&]() { I like this approach. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
David Ribeiro Alves has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 17: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 17: This patch has been +2'd, but some of the logic around failing tablet replicas overlaps with the tombstoned voting patch. Overall, this patch shouldn't need to change much to make it compatible. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 15: Just rebased this in my series of start-time failure-handling patches, since it's necessary for https://gerrit.cloudera.org/#/c/7766/ to reassign tablets that fail to bootstrap. Currently still going through the tombstoned voting patch to see how this needs to be updated. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1407: reassign failed tablets
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#14). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there may be an inconsistency on-disk. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest is added to test that failed tablets while running are evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 14 files changed, 160 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/14 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 14 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 13: (1 comment) @dralves Opted out of doing the enums thing we discussed offline. I think there are few enough cases for now for it to be ok just add case handling as needed. http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS12, Line 753: replica->SetError(s); : replica->Shutdown(); > I don't think you have to lose the error messages: Ah, fair enough. Done. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 13 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#13). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there may be an inconsistency on-disk. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest is added to test that failed tablets while running are evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 14 files changed, 162 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/13 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 13 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
Adar Dembo has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS12, Line 753: replica->SetError(s); : replica->Shutdown(); > I'm going to keep it this way since it preserves some useful error messages I don't think you have to lose the error messages: Status s; auto cleanup = MakeScopedCleanup([&]() { replica->SetError(s); replica->Shutdown(); } s = cmeta_manager_->Load(...); if (!s.ok()) { LOG(ERROR) << ...; return; } This way you get a customized log message per error site, but you centralize the cleanup in one place. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 12: Code-Review-1 (1 comment) I'm a little suspicious of the build failure; I'm currently running a dist-test to make sure this won't flake delete_table-itest http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS12, Line 753: replica->SetError(s); : replica->Shutdown(); > Should put this in a ScopedCleanup (and cancel it at the end of the functio I'm going to keep it this way since it preserves some useful error messages. Scoped cleanup might make it a bit trickier. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Adar Dembo has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 12: (1 comment) Feel free to ignore my feedback if you weren't going to generate a new patch since it's minor. http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS12, Line 753: replica->SetError(s); : replica->Shutdown(); Should put this in a ScopedCleanup (and cancel it at the end of the function) so you don't have to repeat it over and over. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
David Ribeiro Alves has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 12: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 11: (3 comments) http://gerrit.cloudera.org:8080/#/c/7440/11/src/kudu/integration-tests/ts_tablet_manager-itest.cc File src/kudu/integration-tests/ts_tablet_manager-itest.cc: Line 76: using tserver::TSTabletManager; > warning: using decl 'TSTabletManager' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/7440/11/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 620: // TODO: There's actually a race here between the check and shutdown, but > warning: missing username/bug in TODO [google-readability-todo] Done Line 753: replica->SetFailed(s); > error: no member named 'SetFailed' in 'kudu::tablet::TabletReplica' [clang- Done -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#12). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there may be an inconsistency on-disk. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest is added to test that failed tablets while running are evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 14 files changed, 167 insertions(+), 67 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/12 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 7: (7 comments) http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc File src/kudu/client/scanner-internal.cc: PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through > the difference seems to be that TABLET_NOT_FOUND forces a new metadata fetc Ah, thanks for pointing that out. It seems like the the staleness needs to be refreshed when there is a config change. In this case, even though we know a config change is impending, it may not have happened yet, but it's a fair point that we may still want to retry. If we assume the time between a failure and the config change (maybe a bit longer than the heartbeat interval) is shorter than the time between the client staleness check and it doing the actual metadata re-fetch, this isn't too big a deal (and still not a big deal otherwise). Went with your suggestion of TABLET_NOT_FOUND. http://gerrit.cloudera.org:8080/#/c/7440/8/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: Line 291: return; > warning: do not use 'else' after 'return' [readability-else-after-return] Done http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS10, Line 618: ::str > nit remove Done PS10, Line 623: > nit: explain that otherwise this notification is ignored Done http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: PS10, Line 233: turn; > doesn't this have to include FAILED too? Done http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: PS10, Line 176: SetupErrorAndRespond(resp->mutable_error(), s, TabletServerErrorPB::TABLET_NOT_RUNNING, > likely slightly cleaner to do: Done http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS10, Line 659: usMessage("De > more informative status message Done -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#11). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there may be an inconsistency on-disk. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest is added to test that failed tablets while running are evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 14 files changed, 161 insertions(+), 61 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/11 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1407: reassign failed tablets
David Ribeiro Alves has posted comments on this change. Change subject: KUDU-1407: reassign failed tablets .. Patch Set 10: (6 comments) http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc File src/kudu/client/scanner-internal.cc: PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through > Hrm, maybe, but I'm keeping this as is for now. Reasoning here was that bef the difference seems to be that TABLET_NOT_FOUND forces a new metadata fetch (as well as blacklisting the server) whereas TABLET_NOT_RUNNING only blacklists. not sure whether it wouldn't be best to re-fetch the metadata since the tablet would have to be forcibly re-replicated. up to you. http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS10, Line 618: std:: nit remove PS10, Line 623: Use the current term to ensure the peer will be evicted. nit: explain that otherwise this notification is ignored http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: PS10, Line 233: state_ == QUIESCING || state_ == SHUTDOWN doesn't this have to include FAILED too? http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: PS10, Line 176: SetupErrorAndRespond(resp->mutable_error(), s, TabletServerErrorPB::TABLET_FAILED, context); likely slightly cleaner to do: TabletServerErrorPB::Code code = TabletServerErrorPB::TABLET_NOT_RUNNING; if (state == tablet::FAILED) { s = s.CloneAndAppend((*replica)->error().ToString()); code = TabletServerErrorPB::TABLET_FAILED; } SetupErrorAndRespond(resp->mutable_error(), s, code, context); http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS10, Line 659: s.ToString(); more informative status message -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1407: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#9). Change subject: KUDU-1407: reassign failed tablets .. KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there may be an inconsistency on-disk. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest is added to test that failed tablets while running are evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 14 files changed, 160 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/9 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon