Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/7270 to look at the new patch set (#16). Change subject: disk failure: add persistent disk states ...................................................................... disk failure: add persistent disk states This patch adds a new version of the path set so failed disks are not used the next time the server is run. A previously failed disk may continue to produce IOErrors after restart, but we may still like to start the server without the disk. To accomplish this, disk states, path names, and a timestamp are added to the path set. Additionally, if any disks are failed when loading instances from disk, an 'unhealthy' instance with no path set metadata will be created instead of failing the startup process. CheckIntegrity() is updated to accomodate this. Rather than comparing all instances against an agreed-upon set of UUIDs, the single path set with the largest timestamp is used as the main one to compare against (if only old path sets are available, the first healthy one is used). If no instances are healthy, CheckIntegrity() will fail, as all disks are failed. Additionally, the notion of a unhealthy instance is added to allow startup in the presence of disk failures, e.g. in failing to canonicalize or in failing to read a path instance. The main path set is checked to ensure its integrity (e.g. no duplicates), after which it is upgraded with the extra metadata if needed. It is then checked to ensure it is consistent with the healthy instances (e.g. having the right UUIDs and paths). Testing is done in a new iteration of CheckIntegrity(). Further testing is done in data_dirs-test to ensure the directory manager can be opened with failed disks. Testing is also added to fs_manager-test to ensure the FS layout can be loaded with a failed directory. Some notes: - Disk failures during FS layout creation are not tolerated. In these cases, there is presumably no data on the server anyway, so operators should easily be able to fix their cluster. - In the case of a server restart where all healthy disks fail to start up and some known failed disks start working again, the server will successfully start up with the bad disks (may have partially-written data). - If there are any unhealthy instances when upgrading the path sets (i.e. adding disk states, paths, timestamp), a complete mapping of UUIDs to paths will not be available, and CheckIntegrity() will fail. - The main path set's disk states are updated to reflect the failure of the instances. Since data on a failed disk cannot be trusted, disks that are successfully read from but are already marked FAILED by the main path set are not marked HEALTHY. This patch is a part of a series of patches to handle disk failures. To see how this fits, see 2.6 in: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing Change-Id: Ifddf0817fe1a82044077f5544c400c88de20769f --- M src/kudu/fs/block_manager_util-test.cc M src/kudu/fs/block_manager_util.cc M src/kudu/fs/block_manager_util.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/fs.proto M src/kudu/fs/fs_manager-test.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc 12 files changed, 1,200 insertions(+), 272 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/7270/16 -- To view, visit http://gerrit.cloudera.org:8080/7270 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ifddf0817fe1a82044077f5544c400c88de20769f Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>