[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-23 Thread Mike Percy (Code Review)
Mike Percy has submitted this change and it was merged.

Change subject: KUDU-1407: reassign failed tablets
..


KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there is
no guarantee that the tablet's metadata reflects the deleted state. This
has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest to test that a tablet that
  is manually failed while running is evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Reviewed-on: http://gerrit.cloudera.org:8080/7440
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy 
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
13 files changed, 154 insertions(+), 58 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-23 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 20: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-23 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS19, Line 646:  replica->state() == tablet::FAILED);
> Without your tombstoned voting patch, if the replica has been shut down, co
Gotcha. OK, let's leave it and I will remove it in my tombstoned voting patch.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 20:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 627: void PeerMessageQueue::NotifyPeerHasFailed(const string& peer_uuid, 
const string& reason) {
> consider:
Done


PS19, Line 628: l(queue_lo
> This should be protected by queue_lock_
Done


PS19, Line 632: ignored.
> This should be protected by queue_lock_
Done


PS19, Line 633: int64_t current_term = queue_st
> No need to hold queue_lock_ while calling NotifyObserversOfFailedFollower()
Ah I see, thanks for clarifying.


http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/integration-tests/ts_recovery-itest.cc
File src/kudu/integration-tests/ts_recovery-itest.cc:

Line 310:   // its FAILED state.
> nit: add a comment for why we are specifying this flag
Done


http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS19, Line 646: 
> Why this change? I don't think this is necessary or desired.
Without your tombstoned voting patch, if the replica has been shut down, 
consensus will be shut down.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-23 Thread Andrew Wong (Code Review)
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#20).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there is
no guarantee that the tablet's metadata reflects the deleted state. This
has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest to test that a tablet that
  is manually failed while running is evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
13 files changed, 154 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/20
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-23 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS19, Line 646:  replica->state() == tablet::FAILED);
Why this change? I don't think this is necessary or desired.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 627: void PeerMessageQueue::NotifyPeerHasFailed(const string& peer_uuid, 
const string& reason) {
consider:

  std::unique_lock l(queue_lock_);
  TrackedPeer* peer = FindPtrOrNull(peers_map_, peer_uuid);
  if (peer) {
// Use the current term to ensure the peer will be evicted, otherwise this
// notification may be ignored.
int64_t current_term = queue_state_.current_term;
l.unlock();
NotifyObserversOfFailedFollower(peer_uuid, current_term, reason);
  }


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 19:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS19, Line 628: peers_map_
This should be protected by queue_lock_


PS19, Line 632: queue_state_.current_term
This should be protected by queue_lock_


PS19, Line 633: NotifyObserversOfFailedFollower
No need to hold queue_lock_ while calling NotifyObserversOfFailedFollower() 
because the thread pool it uses is set in the constructor


http://gerrit.cloudera.org:8080/#/c/7440/19/src/kudu/integration-tests/ts_recovery-itest.cc
File src/kudu/integration-tests/ts_recovery-itest.cc:

Line 310:   { "--master_tombstone_evicted_tablet_replicas=false" }));
nit: add a comment for why we are specifying this flag


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS17, Line 655: 
> Failures when writing the data directory are passed off as WARN_NOT_OK() (s
ok


PS17, Line 658: 
> Ah, I see. I'll update the comment.
ok.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 19:

The jenkins failure doesn't seem to be related to these changes. This patch is 
still good for review.

-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Andrew Wong (Code Review)
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#19).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there is
no guarantee that the tablet's metadata reflects the deleted state. This
has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest to test that a tablet that
  is manually failed while running is evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
13 files changed, 151 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/19
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 17:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7440/17//COMMIT_MSG
Commit Message:

PS17, Line 30: is added 
> this is repeated
Done


PS17, Line 31: failed tablets while running
> tablets that fail while running (due to what?)
Done


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 629: NotifyObserversOfFailedFollower(peer_uuid, current_term, reason);
> nit: No need to hold the lock while calling this method.
Done


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true,
> Should be removed per below. See master_tombstone_evicted_tablet_replica
Done


PS17, Line 2473: if (FLAGS_master_tombstone_failed_tablet_replicas) {
   :   SendDeleteReplicaRequest(report.tablet_id(), 
TABLET_DATA_TOMBSTONED,
   :boost::none,
   :tablet->table(), 
ts_desc->permanent_uuid(),
   :Substitute("Tablet failed: $0", 
s.ToString()));
   : }
> Is this required? The leader will now evict a failed follower because of th
I think you're right; when the leader sees the failed tablet, it should evict 
and config change, and then report to the master.


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS17, Line 655: metadata
> Couldn't this simply happen if one of the data disks failed?
Failures when writing the data directory are passed off as WARN_NOT_OK() (see 
TabletMetadata::DeleteOrphanedBlocks), since the blocks can always be removed 
in the future (eg when we next startup).


PS17, Line 658: is unclear
> Shouldn't the contract of DeleteTabletData() be a crash-consistent one? In 
Ah, I see. I'll update the comment.


Line 752:   auto fail_tablet = MakeScopedCleanup([&]() {
> I like this approach.
It is quite clean indeed!

Credit to Adar for the suggestion.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Andrew Wong (Code Review)
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#18).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there is
no guarantee that the tablet's metadata reflects the deleted state. This
has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest to test that a tablet that
  is manually failed while running is evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
13 files changed, 151 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/18
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 17:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7440/17//COMMIT_MSG
Commit Message:

PS17, Line 30: is added 
this is repeated


PS17, Line 31: failed tablets while running
tablets that fail while running (due to what?)


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 629: NotifyObserversOfFailedFollower(peer_uuid, current_term, reason);
nit: No need to hold the lock while calling this method.


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true,
Should be removed per below. See master_tombstone_evicted_tablet_replica


PS17, Line 2473: if (FLAGS_master_tombstone_failed_tablet_replicas) {
   :   SendDeleteReplicaRequest(report.tablet_id(), 
TABLET_DATA_TOMBSTONED,
   :boost::none,
   :tablet->table(), 
ts_desc->permanent_uuid(),
   :Substitute("Tablet failed: $0", 
s.ToString()));
   : }
Is this required? The leader will now evict a failed follower because of the 
changes in the queue in this patch. Once that eviction is committed as a new 
config change, the master should find out and automatically delete this replica 
that is part of a stale config (in a safe way that passes in 
cas_config_opid_index_less_or_equal). See 
FLAGS_master_tombstone_evicted_tablet_replicas usage in this file.


http://gerrit.cloudera.org:8080/#/c/7440/17/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS17, Line 655: metadata
Couldn't this simply happen if one of the data disks failed?


PS17, Line 658: is unclear
Shouldn't the contract of DeleteTabletData() be a crash-consistent one? In 
fact, I think it is (perhaps not well documented) from the perspective of the 
order in which we delete things. It's extensively tested in ts_recovery-itest.


Line 752:   auto fail_tablet = MakeScopedCleanup([&]() {
I like this approach.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 17: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-22 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 17:

This patch has been +2'd, but some of the logic around failing tablet replicas 
overlaps with the tombstoned voting patch. Overall, this patch shouldn't need 
to change much to make it compatible.

-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 15:

Just rebased this in my series of start-time failure-handling patches, since 
it's necessary for https://gerrit.cloudera.org/#/c/7766/ to reassign tablets 
that fail to bootstrap.

Currently still going through the tombstoned voting patch to see how this needs 
to be updated.

-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-18 Thread Andrew Wong (Code Review)
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#14).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there
may be an inconsistency on-disk. This has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest is added to test that
  failed tablets while running are evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
14 files changed, 160 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/14
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-18 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 13:

(1 comment)

@dralves Opted out of doing the enums thing we discussed offline. I think there 
are few enough cases for now for it to be ok just add case handling as needed.

http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS12, Line 753: replica->SetError(s);
  : replica->Shutdown();
> I don't think you have to lose the error messages:
Ah, fair enough. Done.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-18 Thread Andrew Wong (Code Review)
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#13).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there
may be an inconsistency on-disk. This has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest is added to test that
  failed tablets while running are evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
14 files changed, 162 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/13
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-18 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS12, Line 753: replica->SetError(s);
  : replica->Shutdown();
> I'm going to keep it this way since it preserves some useful error messages
I don't think you have to lose the error messages:

  Status s;
  auto cleanup = MakeScopedCleanup([&]() {
replica->SetError(s);
replica->Shutdown();
  }

  s = cmeta_manager_->Load(...);
  if (!s.ok()) {
LOG(ERROR) << ...;
return;
  }

This way you get a customized log message per error site, but you centralize 
the cleanup in one place.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-18 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 12: Code-Review-1

(1 comment)

I'm a little suspicious of the build failure; I'm currently running a dist-test 
to make sure this won't flake delete_table-itest

http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS12, Line 753: replica->SetError(s);
  : replica->Shutdown();
> Should put this in a ScopedCleanup (and cancel it at the end of the functio
I'm going to keep it this way since it preserves some useful error messages. 
Scoped cleanup might make it a bit trickier.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-18 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 12:

(1 comment)

Feel free to ignore my feedback if you weren't going to generate a new patch 
since it's minor.

http://gerrit.cloudera.org:8080/#/c/7440/12/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS12, Line 753: replica->SetError(s);
  : replica->Shutdown();
Should put this in a ScopedCleanup (and cancel it at the end of the function) 
so you don't have to repeat it over and over.


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-18 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 12: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-17 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7440/11/src/kudu/integration-tests/ts_tablet_manager-itest.cc
File src/kudu/integration-tests/ts_tablet_manager-itest.cc:

Line 76: using tserver::TSTabletManager;
> warning: using decl 'TSTabletManager' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/7440/11/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 620:   // TODO: There's actually a race here between the check and 
shutdown, but
> warning: missing username/bug in TODO [google-readability-todo]
Done


Line 753: replica->SetFailed(s);
> error: no member named 'SetFailed' in 'kudu::tablet::TabletReplica' [clang-
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-17 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#12).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there
may be an inconsistency on-disk. This has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest is added to test that
  failed tablets while running are evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
14 files changed, 167 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-17 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 7:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc
File src/kudu/client/scanner-internal.cc:

PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through
> the difference seems to be that TABLET_NOT_FOUND forces a new metadata fetc
Ah, thanks for pointing that out. It seems like the the staleness needs to be 
refreshed when there is a config change. In this case, even though we know a 
config change is impending, it may not have happened yet, but it's a fair point 
that we may still want to retry.

If we assume the time between a failure and the config change (maybe a bit 
longer than the heartbeat interval) is shorter than the time between the client 
staleness check and it doing the actual metadata re-fetch, this isn't too big a 
deal (and still not a big deal otherwise). Went with your suggestion of 
TABLET_NOT_FOUND.


http://gerrit.cloudera.org:8080/#/c/7440/8/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 291: return;
> warning: do not use 'else' after 'return' [readability-else-after-return]
Done


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS10, Line 618: ::str
> nit remove
Done


PS10, Line 623: 
> nit: explain that otherwise this notification is ignored
Done


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

PS10, Line 233: turn;
> doesn't this have to include FAILED too?
Done


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

PS10, Line 176: SetupErrorAndRespond(resp->mutable_error(), s, 
TabletServerErrorPB::TABLET_NOT_RUNNING,
> likely slightly cleaner to do:
Done


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS10, Line 659: usMessage("De
> more informative status message
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-17 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#11).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there
may be an inconsistency on-disk. This has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest is added to test that
  failed tablets while running are evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
14 files changed, 161 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-17 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1407: reassign failed tablets
..


Patch Set 10:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc
File src/kudu/client/scanner-internal.cc:

PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through
> Hrm, maybe, but I'm keeping this as is for now. Reasoning here was that bef
the difference seems to be that TABLET_NOT_FOUND forces a new metadata fetch 
(as well as blacklisting the server) whereas TABLET_NOT_RUNNING only 
blacklists. not sure whether it wouldn't be best to re-fetch the metadata since 
the tablet would have to be forcibly re-replicated. up to you.


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

PS10, Line 618: std::
nit remove


PS10, Line 623: Use the current term to ensure the peer will be evicted.
nit: explain that otherwise this notification is ignored


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

PS10, Line 233: state_ == QUIESCING || state_ == SHUTDOWN
doesn't this have to include FAILED too?


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

PS10, Line 176: SetupErrorAndRespond(resp->mutable_error(), s, 
TabletServerErrorPB::TABLET_FAILED, context);
likely slightly cleaner to do:
TabletServerErrorPB::Code code = TabletServerErrorPB::TABLET_NOT_RUNNING;
if (state == tablet::FAILED) {
  s = s.CloneAndAppend((*replica)->error().ToString());
  code = TabletServerErrorPB::TABLET_FAILED;
}
SetupErrorAndRespond(resp->mutable_error(), s, code, context);


http://gerrit.cloudera.org:8080/#/c/7440/10/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS10, Line 659: s.ToString();
more informative status message


-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1407: reassign failed tablets

2017-08-10 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#9).

Change subject: KUDU-1407: reassign failed tablets
..

KUDU-1407: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted; they are not evicted and reassigned. If a tablet
fails to bootstrap, it will sit, responding to heartbeats, doing nothing
else. This patch ensures failed tablets will be reassigned.

As the tablets are not used, rather than directly setting replicas to
FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving
the final state as FAILED. A replica can no longer leave the FAILED
state (calls to Shutdown() leave it FAILED). The tserver response
generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a
leader will immediately evict the peer.

Prior to this patch, a tablet was marked FAILED if its WAL or metadata
failed to delete (after already shutting down). If this occurs, there
may be an inconsistency on-disk. This has been made fatal.

Testing is done in a few places:
- raft_consensus-itest is updated to ensure that tablets that fail to
  bootstrap are evicted and replaced.
- tablet_server-test is also updated to ensure that, instead of
  TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets.
- a test is added to ts_tablet_manager-itest is added to test that
  failed tablets while running are evicted and replaced.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
14 files changed, 160 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon