Mike Percy has posted comments on this change. Change subject: disk failure: don't open tablets on failed disks ......................................................................
Patch Set 3: (11 comments) http://gerrit.cloudera.org:8080/#/c/7766/3//COMMIT_MSG Commit Message: PS3, Line 21: Testing is done by loading data into a cluster with multi-disk : servers, failing a single directory of one of the servers, and ensuring : that the tablets spread across the failed disk get replicated upon the : next startup. how about: Testing is done by loading data into a cluster configured to use multiple directories for data blocks, failing a single directory on one of the tablet servers, and ensuring that the tablets with blocks on the failed directory get re-replicated at startup time. http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 1702: return Status::OK(); I'm not sure why we are returning OK here. Also, new API semantics should be documented at the interface level. http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: PS3, Line 43: TabletServerIntegrationTestBase Would you mind inheriting from ExternalMiniClusterITestBase instead in this class? The newer tests are inheriting from that instead. PS3, Line 96: server a tablet server PS3, Line 97: server tablet server PS3, Line 98: . while it is shut down. Line 109: write_workload.Setup(); This creates a table. Why are you creating the table yourself above? Line 110: write_workload.Start(); You should call workload.stopAndJoin() at some point during the test to shut the writer thread down again. Did you want it running this whole time? PS3, Line 114: WaitForTSAndReplicas what is the purpose of calling this function? PS3, Line 124: NO_FATALS(SetServerSurvivalFlags(ext_tservers)); > why is this not set on boot? agree http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 765: LOG(ERROR) << "Exiting bootstrapping early; tablet is in a failed directory"; how about: LOG(ERROR) << LogPrefix(tablet_id) << "aborting tablet bootstrap: tablet has data in a failed directory"; -- To view, visit http://gerrit.cloudera.org:8080/7766 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes