[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-07-07 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7030

to look at the new patch set (#11).

Change subject: WIP disk failure: coordinate disk failure handling
..

WIP disk failure: coordinate disk failure handling

This patch adds the logic required to prevent a crash on disk failure.

Disk failure handling happens in a few places:
- block/container-level functions that call env functions that may
  result in disk failure can run callbacks to fail/shutdown tablets in
  the parent data dir
- tablet-level functions that CHECK for failures are ended early if the
  tablet is known to have data on a bad disk
- transactions can now be canceled to force a shutdown of a tablet
  replica instead of waiting for transactions to complete
- tablets in FAILED or the new FAILED_AND_SHUTDOWN state will trigger
  replication
- scans that run into a disk failure will return with the new
  TABLET_FAILED state, or find another server if fault tolerant
- failure at startup (IO to instance files) is covered in a later patch

A set of basic tests are added in ts_disk_failure-itest.

TODO:
- crash if tablet metadata dir is bad and when opening the tablet
  metadata

Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.h
M src/kudu/tablet/transactions/transaction.h
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/transaction_tracker.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tserver/CMakeLists.txt
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
A src/kudu/tserver/ts_disk_failure-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/tserver/tserver.proto
M src/kudu/util/status.h
35 files changed, 585 insertions(+), 119 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-07-07 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: WIP disk failure: coordinate disk failure handling
..


Patch Set 9:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 141:   virtual void HandleError(const Status& s, DataDir* dir) const = 0;
> Why does HandleError() need to be part of the external API to WritableBlock
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 45:   RETURN_NOT_OK(s_); \
> Wrong patch, but shouldn't this be just 'return s'? We've already establish
Done


Line 56:   RETURN_NOT_OK(s_); \
> Same.
Done


Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \
> Since this is so similar to RETURN_NOT_OK_HANDLE would prefer if the macro 
I would agree, but RETURN_NOT_OK_HANDLE and RETURN_NOT_OK_HANDLE_ERROR are both 
non-specific to disk failures.


Line 64:   (err_handler); \
> Nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 370: if (PREDICT_FALSE(IsDiskFailure(sync))) {
> Shouldn't these be HANDLE_DISK_FAILURE calls instead?
HandleError() should be sufficient (without the PREDICT_FALSE here)


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 399:   void HandleError(const Status& s, DataDir* dir) const;
> This isn't an accessor; bring it up into the other section and doc it.
Done


Line 402:   friend class LogWritableBlock;
> Don't need? HandleError() is public after all.
Done


Line 498: Status LogBlockContainer::Create(LogBlockManager* block_manager,
> What about disk failures during container creation?
Since this is a static method, the handling wraps the Create() method. With the 
right input to Create(), this could also be handled in the Create() call itself 
to better align with the idea of only wrapping env calls.


Line 604:   // Open the existing metadata and data files for writing.
> What about failures here?
Done


Line 648:   
RETURN_NOT_OK(block_manager()->env()->NewRandomAccessFile(metadata_path, 
_reader));
> What about failures here?
Done


Line 656: read_status = pb_reader.ReadNextPB();
> And here?
Done


Line 858:   return metadata_file_->Append(pb);
> Failures here?
Done


Line 881: return metadata_file_->Sync();
> What about this one?
Done


Line 886: Status LogBlockContainer::ReopenMetadataWriter() {
> And failures here?
Done


Line 1090:   DataDir* mutable_data_dir() {
> Where is this used?
It was used in calls to HandleError (probably should've been private!). Since 
HandleError() doesn't need a specified dir, I've taken it out.


Line 1305:   DataDir* mutable_data_dir() {
> And this?
Same as above.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/delta_tracker.cc
File src/kudu/tablet/delta_tracker.cc:

Line 696:   Status s = FlushDMS(old_dms.get(), , flush_type);
> Shouldn't we just RETURN_NOT_OK here and drop the CHECK? Isn't the expectat
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/metadata.proto
File src/kudu/tablet/metadata.proto:

Line 160:   // The Tablet has been stopped after a failure. In this state, like 
FAILED,
> Why do we need a new state for this? Why isn't FAILED sufficient?
When running into a disk failure, there's eventually a call to 
TabletReplica::Shutdown(). The final state of this has indicate that the tablet 
is also shutdown (to ensure Shutdown() remains idempotent).


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

Line 138: tablet_replica_->tablet()->rowsets_flush_sem_.unlock();
> Can you use an RAII-style guard for the acquisition forof rowsets_flush_sem
Hrm, I suppose it could pass rowsets_flush_sem_ ownership to some ScopedLock.


Line 196: max_idx_to_replay_size);
> Nit: indentation.
Done


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tserver/tserver.proto
File src/kudu/tserver/tserver.proto:

Line 97: // The tablet is in a failed state.
> TABLET_NOT_RUNNING is insufficiently descriptive?
This is here to indicate to the leader which tablets that need to be replicated 
(TABLET_FAILED). TABLET_NOT_RUNNING shouldn't trigger replication.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/util/status.h
File src/kudu/util/status.h:

Line 433: inline bool IsDiskFailure(const Status& s) {
> Would be nice to add a comment to each explaining why it's included.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 

[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-07-07 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7030

to look at the new patch set (#10).

Change subject: WIP disk failure: coordinate disk failure handling
..

WIP disk failure: coordinate disk failure handling

This patch adds the logic required to prevent a crash on disk failure.

Disk failure handling happens in a few places:
- block/container-level functions that call env functions that may
  result in disk failure can run callbacks to fail/shutdown tablets in
  the parent data dir
- tablet-level functions that CHECK for failures are ended early if the
  tablet is known to have data on a bad disk
- transactions can now be canceled to force a shutdown of a tablet
  replica instead of waiting for transactions to complete
- tablets in FAILED or the new FAILED_AND_SHUTDOWN state will trigger
  replication
- failure at startup (IO to instance files) is covered in a later patch

A set of basic tests are added in ts_disk_failure-itest.

TODO:
- crash if tablet metadata dir is bad

Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.h
M src/kudu/tablet/transactions/transaction.h
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/transaction_tracker.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tserver/CMakeLists.txt
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
A src/kudu/tserver/ts_disk_failure-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
M src/kudu/util/status.h
32 files changed, 511 insertions(+), 101 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-06-30 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: WIP disk failure: coordinate disk failure handling
..


Patch Set 9:

(23 comments)

I focused on the block manager and some MM ops. You should definitely get a 
review from Todd/David/Mike on the transaction changes, the tablet read-path 
changes, and the consensus changes.

http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/block_manager.h
File src/kudu/fs/block_manager.h:

Line 141:   virtual void HandleError(const Status& s, DataDir* dir) const = 0;
Why does HandleError() need to be part of the external API to WritableBlock and 
ReadableBlock? IIUC the macros correctly, it should be sufficient to expose 
them in LogReadableBlock/LogWritableBlock as non-virtual methods.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h
File src/kudu/fs/error_manager.h:

Line 45:   RETURN_NOT_OK(s_); \
Wrong patch, but shouldn't this be just 'return s'? We've already established 
that s_ isn't OK.


Line 56:   RETURN_NOT_OK(s_); \
Same.


Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \
Since this is so similar to RETURN_NOT_OK_HANDLE would prefer if the macro 
names were a little more similar. Maybe RETURN_NOT_OK_HANDLE_DISK_FAILURE and 
just HANDLE_DISK_FAILURE?


Line 64:   (err_handler); \
Nit: indentation


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

Line 370: if (PREDICT_FALSE(IsDiskFailure(sync))) {
Shouldn't these be HANDLE_DISK_FAILURE calls instead?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 399:   void HandleError(const Status& s, DataDir* dir) const;
This isn't an accessor; bring it up into the other section and doc it.


Line 402:   friend class LogWritableBlock;
Don't need? HandleError() is public after all.


Line 498: Status LogBlockContainer::Create(LogBlockManager* block_manager,
What about disk failures during container creation?


Line 604:   // Open the existing metadata and data files for writing.
What about failures here?


Line 648:   
RETURN_NOT_OK(block_manager()->env()->NewRandomAccessFile(metadata_path, 
_reader));
What about failures here?


Line 656: read_status = pb_reader.ReadNextPB();
And here?


Line 858:   return metadata_file_->Append(pb);
Failures here?


Line 881: return metadata_file_->Sync();
What about this one?


Line 886: Status LogBlockContainer::ReopenMetadataWriter() {
And failures here?


Line 1090:   DataDir* mutable_data_dir() {
Where is this used?


Line 1305:   DataDir* mutable_data_dir() {
And this?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/delta_tracker.cc
File src/kudu/tablet/delta_tracker.cc:

Line 696:   Status s = FlushDMS(old_dms.get(), , flush_type);
Shouldn't we just RETURN_NOT_OK here and drop the CHECK? Isn't the expectation 
that the MM op will figure out whether the failure is fatal or not? Why do we 
also have to consider it here?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/metadata.proto
File src/kudu/tablet/metadata.proto:

Line 160:   // The Tablet has been stopped after a failure. In this state, like 
FAILED,
Why do we need a new state for this? Why isn't FAILED sufficient?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

Line 138: tablet_replica_->tablet()->rowsets_flush_sem_.unlock();
Can you use an RAII-style guard for the acquisition forof rowsets_flush_sem_?


Line 196: max_idx_to_replay_size);
Nit: indentation.


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tserver/tserver.proto
File src/kudu/tserver/tserver.proto:

Line 97: // The tablet is in a failed state.
TABLET_NOT_RUNNING is insufficiently descriptive?


http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/util/status.h
File src/kudu/util/status.h:

Line 433: inline bool IsDiskFailure(const Status& s) {
Would be nice to add a comment to each explaining why it's included.


-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-06-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: WIP disk failure: coordinate disk failure handling
..


Patch Set 8:

(10 comments)

This patch is getting a bit big and there are a few pieces that can be 
separated. Will break it apart.

http://gerrit.cloudera.org:8080/#/c/7030/3/src/kudu/cfile/cfile_reader.cc
File src/kudu/cfile/cfile_reader.cc:

PS3, Line 184: RETURN_NOT_OK(block_->Read(0, ));
> don't quite follow this. if the read fails with an EIO will this continue t
This acts like a RETURN_NOT_OK with an extra step to check the posix code.

You're right though, they shouldn't be in this layer; removed.


http://gerrit.cloudera.org:8080/#/c/7030/4/src/kudu/fs/fs-test-util.h
File src/kudu/fs/fs-test-util.h:

Line 86:   void HandleError(const Status& /* s */, DataDir* /* dir */) const 
override {
> warning: parameter 'dir' is unused [misc-unused-parameters]
Done


Line 86:   void HandleError(const Status& /* s */, DataDir* /* dir */) const 
override {
> warning: parameter 's' is unused [misc-unused-parameters]
Done


Line 87:   }
> warning: redundant return statement at the end of a function with a void re
Done


http://gerrit.cloudera.org:8080/#/c/7030/4/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

Line 851:   // a batch which contains some duplicate keys) we need to do so now.
> warning: do not use 'else' after 'return' [readability-else-after-return]
Done


http://gerrit.cloudera.org:8080/#/c/7030/4/src/kudu/tserver/ts_disk_failure-test.cc
File src/kudu/tserver/ts_disk_failure-test.cc:

Line 9: #include "kudu/gutil/strings/substitute.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


Line 28: using strings::Substitute;
> warning: using decl 'set' is unused [misc-unused-using-decls]
Done


Line 34: class TSDiskFailureTest : public TabletServerTestBase {
> warning: using decl 'RpcController' is unused [misc-unused-using-decls]
Done


Line 36:   virtual void SetUp() override {
> warning: using decl 'TabletReplica' is unused [misc-unused-using-decls]
Done


Line 37: TabletServerTestBase::SetUp();
> warning: using decl 'TabletStatePB' is unused [misc-unused-using-decls]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-06-22 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7030

to look at the new patch set (#8).

Change subject: WIP disk failure: coordinate disk failure handling
..

WIP disk failure: coordinate disk failure handling

This patch adds the logic required to prevent a crash on disk failure.

Disk failure handling happens in a few places:
- block/container-level functions that call env functions that may result in
disk failure can run callbacks to fail/shutdown tablets in the parent data dir
- tablet-level functions that CHECK for failures are ended early if the tablet
is known to have data on a bad disk
- transactions can now be canceled to force a shutdown of a tablet replica
- tablets in FAILED or the new FAILED_AND_SHUTDOWN state will trigger
replication

A set of basic tests are added in ts_disk_failure-itest.

TODO:
- crash if tablet metadata dir is bad
- don't crash if a disk can't be read at startup and we try to
  CheckIntegrity, but it's missing

Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/fs-test-util.h
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.h
M src/kudu/tablet/transactions/transaction.h
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/transaction_tracker.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tserver/CMakeLists.txt
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
A src/kudu/tserver/ts_disk_failure-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
M src/kudu/util/status.h
35 files changed, 487 insertions(+), 79 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-06-21 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7030

to look at the new patch set (#6).

Change subject: WIP disk failure: coordinate disk failure handling
..

WIP disk failure: coordinate disk failure handling

This patch adds the logic required to prevent a crash on disk failure.
Checks that previously depended on successful disk IO now yield if the
returned status' POSIX code matches one corresponding to disk failure.

For the most part, failure handling is done by the lowest abstraction to
touch disk: blocks and containers.

A set of basic tests are added in ts_disk_failure-itest.

TODO:
- crash if tablet metadata dir is bad
- don't crash if a disk can't be read at startup and we try to
  CheckIntegrity, but it's missing

Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/fs-test-util.h
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.h
M src/kudu/tablet/transactions/transaction.h
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/transaction_tracker.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tserver/CMakeLists.txt
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
A src/kudu/tserver/ts_disk_failure-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
M src/kudu/util/status.h
35 files changed, 488 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-06-20 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7030

to look at the new patch set (#5).

Change subject: WIP disk failure: coordinate disk failure handling
..

WIP disk failure: coordinate disk failure handling

This patch adds the logic required to prevent a crash on disk failure.
Checks that previously depended on successful disk IO now yield if the
returned status' POSIX code matches one corresponding to disk failure.

For the most part, failure handling is done by the lowest abstraction to
touch disk: blocks and containers.

A set of basic tests are added in ts_disk_failure-itest.

TODO:
- crash if tablet metadata dir is bad
- don't crash if a disk can't be read at startup and we try to
  CheckIntegrity, but it's missing

Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/fs/block_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/fs-test-util.h
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_metadata.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/CMakeLists.txt
M src/kudu/tserver/tablet_service.cc
A src/kudu/tserver/ts_disk_failure-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
24 files changed, 408 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] WIP disk failure: coordinate disk failure handling

2017-06-16 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7030

to look at the new patch set (#4).

Change subject: WIP disk failure: coordinate disk failure handling
..

WIP disk failure: coordinate disk failure handling

This patch adds the logic required to prevent failure on disk failure.
Checks that previously depended on successful disk IO now yield if the
returned status' POSIX code matches one corresponding to disk failure.

For the most part, failure handling is done by the lowest abstraction to
touch disk: blocks and containers.

A set of tests are added in ts_disk_failure-itest.

TODO:
- crash if tablet metadata dir is bad
- don't crash if a disk can't be read at startup and we try to
  CheckIntegrity, but it's missing

Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
---
M src/kudu/fs/block_manager.h
M src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/fs-test-util.h
M src/kudu/fs/log_block_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/diskrowset.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/CMakeLists.txt
A src/kudu/tserver/ts_disk_failure-test.cc
14 files changed, 327 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon