[kudu-CR] WIP disk failure: coordinate disk failure handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7030 to look at the new patch set (#11). Change subject: WIP disk failure: coordinate disk failure handling .. WIP disk failure: coordinate disk failure handling This patch adds the logic required to prevent a crash on disk failure. Disk failure handling happens in a few places: - block/container-level functions that call env functions that may result in disk failure can run callbacks to fail/shutdown tablets in the parent data dir - tablet-level functions that CHECK for failures are ended early if the tablet is known to have data on a bad disk - transactions can now be canceled to force a shutdown of a tablet replica instead of waiting for transactions to complete - tablets in FAILED or the new FAILED_AND_SHUTDOWN state will trigger replication - scans that run into a disk failure will return with the new TABLET_FAILED state, or find another server if fault tolerant - failure at startup (IO to instance files) is covered in a later patch A set of basic tests are added in ts_disk_failure-itest. TODO: - crash if tablet metadata dir is bad and when opening the tablet metadata Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/log_block_manager.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/alter_schema_transaction.h M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/transaction_tracker-test.cc M src/kudu/tablet/transactions/transaction_tracker.cc M src/kudu/tablet/transactions/transaction_tracker.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h M src/kudu/tools/kudu-tool-test.cc M src/kudu/tserver/CMakeLists.txt M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc A src/kudu/tserver/ts_disk_failure-test.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h M src/kudu/tserver/tserver.proto M src/kudu/util/status.h 35 files changed, 585 insertions(+), 119 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/11 -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] WIP disk failure: coordinate disk failure handling
Andrew Wong has posted comments on this change. Change subject: WIP disk failure: coordinate disk failure handling .. Patch Set 9: (23 comments) http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 141: virtual void HandleError(const Status& s, DataDir* dir) const = 0; > Why does HandleError() need to be part of the external API to WritableBlock Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 45: RETURN_NOT_OK(s_); \ > Wrong patch, but shouldn't this be just 'return s'? We've already establish Done Line 56: RETURN_NOT_OK(s_); \ > Same. Done Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \ > Since this is so similar to RETURN_NOT_OK_HANDLE would prefer if the macro I would agree, but RETURN_NOT_OK_HANDLE and RETURN_NOT_OK_HANDLE_ERROR are both non-specific to disk failures. Line 64: (err_handler); \ > Nit: indentation Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 370: if (PREDICT_FALSE(IsDiskFailure(sync))) { > Shouldn't these be HANDLE_DISK_FAILURE calls instead? HandleError() should be sufficient (without the PREDICT_FALSE here) http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 399: void HandleError(const Status& s, DataDir* dir) const; > This isn't an accessor; bring it up into the other section and doc it. Done Line 402: friend class LogWritableBlock; > Don't need? HandleError() is public after all. Done Line 498: Status LogBlockContainer::Create(LogBlockManager* block_manager, > What about disk failures during container creation? Since this is a static method, the handling wraps the Create() method. With the right input to Create(), this could also be handled in the Create() call itself to better align with the idea of only wrapping env calls. Line 604: // Open the existing metadata and data files for writing. > What about failures here? Done Line 648: RETURN_NOT_OK(block_manager()->env()->NewRandomAccessFile(metadata_path, _reader)); > What about failures here? Done Line 656: read_status = pb_reader.ReadNextPB(); > And here? Done Line 858: return metadata_file_->Append(pb); > Failures here? Done Line 881: return metadata_file_->Sync(); > What about this one? Done Line 886: Status LogBlockContainer::ReopenMetadataWriter() { > And failures here? Done Line 1090: DataDir* mutable_data_dir() { > Where is this used? It was used in calls to HandleError (probably should've been private!). Since HandleError() doesn't need a specified dir, I've taken it out. Line 1305: DataDir* mutable_data_dir() { > And this? Same as above. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/delta_tracker.cc File src/kudu/tablet/delta_tracker.cc: Line 696: Status s = FlushDMS(old_dms.get(), , flush_type); > Shouldn't we just RETURN_NOT_OK here and drop the CHECK? Isn't the expectat Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: Line 160: // The Tablet has been stopped after a failure. In this state, like FAILED, > Why do we need a new state for this? Why isn't FAILED sufficient? When running into a disk failure, there's eventually a call to TabletReplica::Shutdown(). The final state of this has indicate that the tablet is also shutdown (to ensure Shutdown() remains idempotent). http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/tablet_replica_mm_ops.cc File src/kudu/tablet/tablet_replica_mm_ops.cc: Line 138: tablet_replica_->tablet()->rowsets_flush_sem_.unlock(); > Can you use an RAII-style guard for the acquisition forof rowsets_flush_sem Hrm, I suppose it could pass rowsets_flush_sem_ ownership to some ScopedLock. Line 196: max_idx_to_replay_size); > Nit: indentation. Done http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tserver/tserver.proto File src/kudu/tserver/tserver.proto: Line 97: // The tablet is in a failed state. > TABLET_NOT_RUNNING is insufficiently descriptive? This is here to indicate to the leader which tablets that need to be replicated (TABLET_FAILED). TABLET_NOT_RUNNING shouldn't trigger replication. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/util/status.h File src/kudu/util/status.h: Line 433: inline bool IsDiskFailure(const Status& s) { > Would be nice to add a comment to each explaining why it's included. Done -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong
[kudu-CR] WIP disk failure: coordinate disk failure handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7030 to look at the new patch set (#10). Change subject: WIP disk failure: coordinate disk failure handling .. WIP disk failure: coordinate disk failure handling This patch adds the logic required to prevent a crash on disk failure. Disk failure handling happens in a few places: - block/container-level functions that call env functions that may result in disk failure can run callbacks to fail/shutdown tablets in the parent data dir - tablet-level functions that CHECK for failures are ended early if the tablet is known to have data on a bad disk - transactions can now be canceled to force a shutdown of a tablet replica instead of waiting for transactions to complete - tablets in FAILED or the new FAILED_AND_SHUTDOWN state will trigger replication - failure at startup (IO to instance files) is covered in a later patch A set of basic tests are added in ts_disk_failure-itest. TODO: - crash if tablet metadata dir is bad Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/log_block_manager.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/alter_schema_transaction.h M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/transaction_tracker-test.cc M src/kudu/tablet/transactions/transaction_tracker.cc M src/kudu/tablet/transactions/transaction_tracker.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h M src/kudu/tools/kudu-tool-test.cc M src/kudu/tserver/CMakeLists.txt M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc A src/kudu/tserver/ts_disk_failure-test.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto M src/kudu/util/status.h 32 files changed, 511 insertions(+), 101 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/10 -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] WIP disk failure: coordinate disk failure handling
Adar Dembo has posted comments on this change. Change subject: WIP disk failure: coordinate disk failure handling .. Patch Set 9: (23 comments) I focused on the block manager and some MM ops. You should definitely get a review from Todd/David/Mike on the transaction changes, the tablet read-path changes, and the consensus changes. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/block_manager.h File src/kudu/fs/block_manager.h: Line 141: virtual void HandleError(const Status& s, DataDir* dir) const = 0; Why does HandleError() need to be part of the external API to WritableBlock and ReadableBlock? IIUC the macros correctly, it should be sufficient to expose them in LogReadableBlock/LogWritableBlock as non-virtual methods. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/error_manager.h File src/kudu/fs/error_manager.h: Line 45: RETURN_NOT_OK(s_); \ Wrong patch, but shouldn't this be just 'return s'? We've already established that s_ isn't OK. Line 56: RETURN_NOT_OK(s_); \ Same. Line 61: #define HANDLE_DISK_FAILURE(status_expr, err_handler) do { \ Since this is so similar to RETURN_NOT_OK_HANDLE would prefer if the macro names were a little more similar. Maybe RETURN_NOT_OK_HANDLE_DISK_FAILURE and just HANDLE_DISK_FAILURE? Line 64: (err_handler); \ Nit: indentation http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/file_block_manager.cc File src/kudu/fs/file_block_manager.cc: Line 370: if (PREDICT_FALSE(IsDiskFailure(sync))) { Shouldn't these be HANDLE_DISK_FAILURE calls instead? http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 399: void HandleError(const Status& s, DataDir* dir) const; This isn't an accessor; bring it up into the other section and doc it. Line 402: friend class LogWritableBlock; Don't need? HandleError() is public after all. Line 498: Status LogBlockContainer::Create(LogBlockManager* block_manager, What about disk failures during container creation? Line 604: // Open the existing metadata and data files for writing. What about failures here? Line 648: RETURN_NOT_OK(block_manager()->env()->NewRandomAccessFile(metadata_path, _reader)); What about failures here? Line 656: read_status = pb_reader.ReadNextPB(); And here? Line 858: return metadata_file_->Append(pb); Failures here? Line 881: return metadata_file_->Sync(); What about this one? Line 886: Status LogBlockContainer::ReopenMetadataWriter() { And failures here? Line 1090: DataDir* mutable_data_dir() { Where is this used? Line 1305: DataDir* mutable_data_dir() { And this? http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/delta_tracker.cc File src/kudu/tablet/delta_tracker.cc: Line 696: Status s = FlushDMS(old_dms.get(), , flush_type); Shouldn't we just RETURN_NOT_OK here and drop the CHECK? Isn't the expectation that the MM op will figure out whether the failure is fatal or not? Why do we also have to consider it here? http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: Line 160: // The Tablet has been stopped after a failure. In this state, like FAILED, Why do we need a new state for this? Why isn't FAILED sufficient? http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tablet/tablet_replica_mm_ops.cc File src/kudu/tablet/tablet_replica_mm_ops.cc: Line 138: tablet_replica_->tablet()->rowsets_flush_sem_.unlock(); Can you use an RAII-style guard for the acquisition forof rowsets_flush_sem_? Line 196: max_idx_to_replay_size); Nit: indentation. http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/tserver/tserver.proto File src/kudu/tserver/tserver.proto: Line 97: // The tablet is in a failed state. TABLET_NOT_RUNNING is insufficiently descriptive? http://gerrit.cloudera.org:8080/#/c/7030/9/src/kudu/util/status.h File src/kudu/util/status.h: Line 433: inline bool IsDiskFailure(const Status& s) { Would be nice to add a comment to each explaining why it's included. -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] WIP disk failure: coordinate disk failure handling
Andrew Wong has posted comments on this change. Change subject: WIP disk failure: coordinate disk failure handling .. Patch Set 8: (10 comments) This patch is getting a bit big and there are a few pieces that can be separated. Will break it apart. http://gerrit.cloudera.org:8080/#/c/7030/3/src/kudu/cfile/cfile_reader.cc File src/kudu/cfile/cfile_reader.cc: PS3, Line 184: RETURN_NOT_OK(block_->Read(0, )); > don't quite follow this. if the read fails with an EIO will this continue t This acts like a RETURN_NOT_OK with an extra step to check the posix code. You're right though, they shouldn't be in this layer; removed. http://gerrit.cloudera.org:8080/#/c/7030/4/src/kudu/fs/fs-test-util.h File src/kudu/fs/fs-test-util.h: Line 86: void HandleError(const Status& /* s */, DataDir* /* dir */) const override { > warning: parameter 'dir' is unused [misc-unused-parameters] Done Line 86: void HandleError(const Status& /* s */, DataDir* /* dir */) const override { > warning: parameter 's' is unused [misc-unused-parameters] Done Line 87: } > warning: redundant return statement at the end of a function with a void re Done http://gerrit.cloudera.org:8080/#/c/7030/4/src/kudu/tablet/tablet.cc File src/kudu/tablet/tablet.cc: Line 851: // a batch which contains some duplicate keys) we need to do so now. > warning: do not use 'else' after 'return' [readability-else-after-return] Done http://gerrit.cloudera.org:8080/#/c/7030/4/src/kudu/tserver/ts_disk_failure-test.cc File src/kudu/tserver/ts_disk_failure-test.cc: Line 9: #include "kudu/gutil/strings/substitute.h" > warning: #includes are not sorted properly [llvm-include-order] Done Line 28: using strings::Substitute; > warning: using decl 'set' is unused [misc-unused-using-decls] Done Line 34: class TSDiskFailureTest : public TabletServerTestBase { > warning: using decl 'RpcController' is unused [misc-unused-using-decls] Done Line 36: virtual void SetUp() override { > warning: using decl 'TabletReplica' is unused [misc-unused-using-decls] Done Line 37: TabletServerTestBase::SetUp(); > warning: using decl 'TabletStatePB' is unused [misc-unused-using-decls] Done -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] WIP disk failure: coordinate disk failure handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7030 to look at the new patch set (#8). Change subject: WIP disk failure: coordinate disk failure handling .. WIP disk failure: coordinate disk failure handling This patch adds the logic required to prevent a crash on disk failure. Disk failure handling happens in a few places: - block/container-level functions that call env functions that may result in disk failure can run callbacks to fail/shutdown tablets in the parent data dir - tablet-level functions that CHECK for failures are ended early if the tablet is known to have data on a bad disk - transactions can now be canceled to force a shutdown of a tablet replica - tablets in FAILED or the new FAILED_AND_SHUTDOWN state will trigger replication A set of basic tests are added in ts_disk_failure-itest. TODO: - crash if tablet metadata dir is bad - don't crash if a disk can't be read at startup and we try to CheckIntegrity, but it's missing Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/fs/block_manager.h M src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/fs-test-util.h M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/alter_schema_transaction.h M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/transaction_tracker-test.cc M src/kudu/tablet/transactions/transaction_tracker.cc M src/kudu/tablet/transactions/transaction_tracker.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h M src/kudu/tools/kudu-tool-test.cc M src/kudu/tserver/CMakeLists.txt M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc A src/kudu/tserver/ts_disk_failure-test.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto M src/kudu/util/status.h 35 files changed, 487 insertions(+), 79 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/8 -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] WIP disk failure: coordinate disk failure handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7030 to look at the new patch set (#6). Change subject: WIP disk failure: coordinate disk failure handling .. WIP disk failure: coordinate disk failure handling This patch adds the logic required to prevent a crash on disk failure. Checks that previously depended on successful disk IO now yield if the returned status' POSIX code matches one corresponding to disk failure. For the most part, failure handling is done by the lowest abstraction to touch disk: blocks and containers. A set of basic tests are added in ts_disk_failure-itest. TODO: - crash if tablet metadata dir is bad - don't crash if a disk can't be read at startup and we try to CheckIntegrity, but it's missing Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/fs/block_manager.h M src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/fs-test-util.h M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/alter_schema_transaction.h M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/transaction_tracker-test.cc M src/kudu/tablet/transactions/transaction_tracker.cc M src/kudu/tablet/transactions/transaction_tracker.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h M src/kudu/tools/kudu-tool-test.cc M src/kudu/tserver/CMakeLists.txt M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc A src/kudu/tserver/ts_disk_failure-test.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto M src/kudu/util/status.h 35 files changed, 488 insertions(+), 80 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/6 -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] WIP disk failure: coordinate disk failure handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7030 to look at the new patch set (#5). Change subject: WIP disk failure: coordinate disk failure handling .. WIP disk failure: coordinate disk failure handling This patch adds the logic required to prevent a crash on disk failure. Checks that previously depended on successful disk IO now yield if the returned status' POSIX code matches one corresponding to disk failure. For the most part, failure handling is done by the lowest abstraction to touch disk: blocks and containers. A set of basic tests are added in ts_disk_failure-itest. TODO: - crash if tablet metadata dir is bad - don't crash if a disk can't be read at startup and we try to CheckIntegrity, but it's missing Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/fs/block_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/fs-test-util.h M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/mvcc.cc M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_metadata.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tserver/CMakeLists.txt M src/kudu/tserver/tablet_service.cc A src/kudu/tserver/ts_disk_failure-test.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 24 files changed, 408 insertions(+), 72 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/5 -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] WIP disk failure: coordinate disk failure handling
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7030 to look at the new patch set (#4). Change subject: WIP disk failure: coordinate disk failure handling .. WIP disk failure: coordinate disk failure handling This patch adds the logic required to prevent failure on disk failure. Checks that previously depended on successful disk IO now yield if the returned status' POSIX code matches one corresponding to disk failure. For the most part, failure handling is done by the lowest abstraction to touch disk: blocks and containers. A set of tests are added in ts_disk_failure-itest. TODO: - crash if tablet metadata dir is bad - don't crash if a disk can't be read at startup and we try to CheckIntegrity, but it's missing Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c --- M src/kudu/fs/block_manager.h M src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/fs-test-util.h M src/kudu/fs/log_block_manager.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/diskrowset.cc M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tserver/CMakeLists.txt A src/kudu/tserver/ts_disk_failure-test.cc 14 files changed, 327 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/7030/4 -- To view, visit http://gerrit.cloudera.org:8080/7030 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia03bfb711a1b022d7516f4adb37fe9fb28ec949c Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon