[kudu-CR] open FS layout in presence of disk failure
Mike Percy has submitted this change and it was merged. Change subject: open FS layout in presence of disk failure .. open FS layout in presence of disk failure Currently, if a Kudu server starts up with a failed disk, the server will crash. There are a number of reasons for this, but the pressing one is that the path instance files may not be readable, meaning the directories' UUIDs may not be available. This patch changes this by introducing an "unhealthy" instance in-memory for each instance that fails to lock, load, canonicalize, etc. Such instances are ignored when it comes to checking the integrity of the FS layout, and are simply marked failed by the directory manager. Testing is done in data_dirs-test, log_block_manager-test, and fs_manager-test to ensure failed directories do not impede the managers' startups. Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Reviewed-on: http://gerrit.cloudera.org:8080/7784 Reviewed-by: Alexey Serbin Tested-by: Kudu Jenkins Reviewed-by: Mike Percy Reviewed-by: David Ribeiro Alves --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager_util-test.cc M src/kudu/fs/block_manager_util.cc M src/kudu/fs/block_manager_util.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/fs_manager-test.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc 13 files changed, 645 insertions(+), 173 deletions(-) Approvals: David Ribeiro Alves: Looks good to me, approved Mike Percy: Looks good to me, approved Alexey Serbin: Looks good to me, but someone else must approve Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] open FS layout in presence of disk failure
David Ribeiro Alves has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 6: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Mike Percy has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 6: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Mike Percy has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 6: I retriggered the build -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 6: Failure seems to be a bunch of clock sync errors in DEBUG mode. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Alexey Serbin has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 6: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Mike Percy has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 6: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 6: (6 comments) http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/block_manager_util.cc File src/kudu/fs/block_manager_util.cc: PS5, Line 68: > nit: this is not needed, right? Usually those macros are used as Done http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS4, Line 56: // On success, either 'metadata_' is overwritten with the contents of the : // file, or, in the case of disk failure, returns OK and sets : // 'health_status_' to a non-OK value. : Status LoadFromDisk(); > I'm not sure I fully agree. For one the error message is confusing: "LOG(ER Hrm, I'd prefer to keep this internal. If anything of the PIMF fails, it gets set to failed. I'll improve the error message though. PS4, Line 56: // On success, either 'metadata_' is overwritten with the contents of the : // file, or, in the case of disk failure, returns OK and sets : // 'health_status_' to a non-OK value. : Status LoadFromDisk(); > Andrew and I previously discussed this approach. You can see our comments h Thanks for resurfacing this. http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/data_dirs-test.cc File src/kudu/fs/data_dirs-test.cc: PS5, Line 244: Status s = dd_manager_->MarkDataDirFailed(kNumDirs - 1); : ASSERT_STR_CONTAINS(s.ToString(), "All data dirs have failed"); > As I see the call to CreateDataDirGroup() is removed. Is that correct? I Ah, good point, I suppose I can add it in again. Done PS5, Line 363: // manager. : KuduTest::SetUp(); : } : > nit: consider making this method returning Status and then using ASSERT_OK( Done http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/log_block_manager-test.cc File src/kudu/fs/log_block_manager-test.cc: PS5, Line 1315: ASSERT_NE(data_dir, test_dirs[failed_idx]); > paranoid nit: does it make sense to check that the failed dir is exactly te Sure, although there is test coverage for this in data_dirs-test -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Hello Mike Percy, Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7784 to look at the new patch set (#6). Change subject: open FS layout in presence of disk failure .. open FS layout in presence of disk failure Currently, if a Kudu server starts up with a failed disk, the server will crash. There are a number of reasons for this, but the pressing one is that the path instance files may not be readable, meaning the directories' UUIDs may not be available. This patch changes this by introducing an "unhealthy" instance in-memory for each instance that fails to lock, load, canonicalize, etc. Such instances are ignored when it comes to checking the integrity of the FS layout, and are simply marked failed by the directory manager. Testing is done in data_dirs-test, log_block_manager-test, and fs_manager-test to ensure failed directories do not impede the managers' startups. Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager_util-test.cc M src/kudu/fs/block_manager_util.cc M src/kudu/fs/block_manager_util.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/fs_manager-test.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc 13 files changed, 645 insertions(+), 173 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/6 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] open FS layout in presence of disk failure
David Ribeiro Alves has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS4, Line 56: // On success, either 'metadata_' is overwritten with the contents of the : // file, or, in the case of disk failure, returns OK and sets : // 'health_status_' to a non-OK value. : Status LoadFromDisk(); > Andrew and I previously discussed this approach. You can see our comments h Even if internalizing the failure would be ok (which I still think it isn't, though I don't feel super strong about it), the error message that we're ignoring the failure is still wrong, since we arent' -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Alexey Serbin has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/log_block_manager-test.cc File src/kudu/fs/log_block_manager-test.cc: PS5, Line 1315: ASSERT_EQ(1, dd_manager_->GetFailedDataDirs().size()); paranoid nit: does it make sense to check that the failed dir is exactly test_dirs[failed_idx] ? -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Alexey Serbin has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 5: (3 comments) http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/block_manager_util.cc File src/kudu/fs/block_manager_util.cc: PS5, Line 68: ; nit: this is not needed, right? Usually those macros are used as RETURN_NOT_OK_FAIL_INSTANCE_PREPEND(...); so the semicolon is added at the call site. http://gerrit.cloudera.org:8080/#/c/7784/5/src/kudu/fs/data_dirs-test.cc File src/kudu/fs/data_dirs-test.cc: PS5, Line 244: Status s = dd_manager_->MarkDataDirFailed(kNumDirs - 1); : ASSERT_STR_CONTAINS(s.ToString(), "All data dirs have failed"); As I see the call to CreateDataDirGroup() is removed. Is that correct? I mean, I would expect to see an attempt to do something with the object after marking all directories as failed (that looks like just a preparation for some meaningful action). PS5, Line 363: void OpenDataDirManager() { : ASSERT_OK(DataDirManager::OpenExistingForTests(env_, test_roots_, : DataDirManagerOptions(), &dd_manager_)); : } nit: consider making this method returning Status and then using ASSERT_OK() -- that way it would be more easier to find the place where it fails, if it does so. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Mike Percy has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 5: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Adar Dembo has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS4, Line 56: // On success, either 'metadata_' is overwritten with the contents of the : // file, or, in the case of disk failure, returns OK and sets : // 'health_status_' to a non-OK value. : Status LoadFromDisk(); > I'm not sure I fully agree. For one the error message is confusing: "LOG(ER Andrew and I previously discussed this approach. You can see our comments here: https://gerrit.cloudera.org/#/c/7270/6/src/kudu/fs/data_dirs.cc@408. FWIW, I find the idea of "internalizing" the disk failure to PIMF to be reasonable. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
David Ribeiro Alves has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS4, Line 56: // On success, either 'metadata_' is overwritten with the contents of the : // file, or, in the case of disk failure, returns OK and sets : // 'health_status_' to a non-OK value. : Status LoadFromDisk(); > Disk failure errors here would be ignored anyway. By adding this to the con I'm not sure I fully agree. For one the error message is confusing: "LOG(ERROR) << "Ignoring error with status: " << _s_prepended.ToString(); \" seems like we're ignoring the error when we really aren't. Then I'd expect that upon getting an OK here the metadata had been loaded, when it's not the case. Given there is a single call site I think it would make sense to be explicit. That is to return a non-ok status and to special case the isDiskFailure() status along with logging the error message there. Otherwise you're pushing logic from the call site onto this class which is a bit dirtier. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 4: (16 comments) http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG Commit Message: PS3, Line 20: mangers' > looks like a typo It is, indeed, "managers'," referring to the startups of the directory manager, log block manager, and fs manager. http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.cc File src/kudu/fs/block_manager_util.cc: PS4, Line 189: it's > nit its Done PS4, Line 189: them > nit s/them/it Done http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS4, Line 56: // On success, either 'metadata_' is overwritten with the contents of the : // file, or, in the case of disk failure, returns OK and sets : // 'health_status_' to a non-OK value. : Status LoadFromDisk(); > this is weird. whats the reason for this not to return a non-ok status righ Disk failure errors here would be ignored anyway. By adding this to the contract, we don't have to sully the call-site with disk-handling code, and we can keep the error handling (i.e. setting health_status_) internal. PS4, Line 83: // Whether or not the instance is healthy. If the instance file lives on a : // disk that has failed, this should return false. : bool healthy() const { : return health_status_.ok(); : } > we usually don't have this extra getter (i.e. consider just doing pimf.heal As per Mike's comment, I think it helps for readability. Also, I get the feeling that we don't use this elsewhere (at least for tablets/replicas) because other components have multiple states. In this case, it's pretty cut-and-dry: healthy, not healthy. PS4, Line 83: // Whether or not the instance is healthy. If the instance file lives on a : // disk that has failed, this should return false. : bool healthy() const { : return health_status_.ok(); : } > I had the same initial reaction but it seems to make the code that uses it Done http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: Line 116: using std::make_pair; > warning: using decl 'make_pair' is unused [misc-unused-using-decls] Done PS4, Line 794: if (FindUuidByRoot(root, &uuid)) { : return FindUuidIndexByUuid(uuid, uuid_idx); : } : return false; > there are no chances of races here, right? Nope, the first call is guarded by a scoped lock, as is the second. These values don't change after they're first set in Open(). http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS4, Line 366: Finds the UUID for the given canonicalized root directory name. > nit, for consistency append the same suffix as the previous new method (ret Done http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager-test.cc File src/kudu/fs/fs_manager-test.cc: Line 279: FLAGS_env_inject_eio = 0; > nit: shouldn't need this due to flag saver in KuduTest unless I'm missing s See Adar's comment. Line 279: FLAGS_env_inject_eio = 0; > Unfortunately, KuduTest doesn't restore the flags until _after_ TearDown() Thanks for the explanation. Line 302: FLAGS_env_inject_eio = 0; > same See above. http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.cc File src/kudu/fs/fs_manager.cc: PS3, Line 180: set all_roots; : all_roots.insert(wal_fs_root_); : for (const string& data_fs_root : data_fs_roots_) { : all_roots.insert(data_fs_root); : } > nit: consider Done PS3, Line 228: if (!ContainsKey(unique_roots, root.path)) { : canonicalized_data_fs_roots_.emplace_back(root); : canonicalized_all_fs_roots_.emplace_back(root); : InsertOrDie(&unique_roots, root.path); : } > Consider using InsertIfNotPresent() instead to avoid double look-ups: Done PS3, Line 248: string wal_root > Is it necessary to make a copy here? Why not to use const reference here i Done PS3, Line 251: string meta_root > ditto Done -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Hello Mike Percy, Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7784 to look at the new patch set (#5). Change subject: open FS layout in presence of disk failure .. open FS layout in presence of disk failure Currently, if a Kudu server starts up with a failed disk, the server will crash. There are a number of reasons for this, but the pressing one is that the path instance files may not be readable, meaning the directories' UUIDs may not be available. This patch changes this by introducing an "unhealthy" instance in-memory for each instance that fails to lock, load, canonicalize, etc. Such instances are ignored when it comes to checking the integrity of the FS layout, and are simply marked failed by the directory manager. Testing is done in data_dirs-test, log_block_manager-test, and fs_manager-test to ensure failed directories do not impede the managers' startups. Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager_util-test.cc M src/kudu/fs/block_manager_util.cc M src/kudu/fs/block_manager_util.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/fs_manager-test.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc 13 files changed, 635 insertions(+), 174 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/5 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] open FS layout in presence of disk failure
Adar Dembo has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 4: Code-Review+1 Don't have anything to add to David/Mike's comments. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Mike Percy has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 4: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS4, Line 83: // Whether or not the instance is healthy. If the instance file lives on a : // disk that has failed, this should return false. : bool healthy() const { : return health_status_.ok(); : } > we usually don't have this extra getter (i.e. consider just doing pimf.heal I had the same initial reaction but it seems to make the code that uses it a little more reader-friendly. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Adar Dembo has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager-test.cc File src/kudu/fs/fs_manager-test.cc: Line 279: FLAGS_env_inject_eio = 0; > nit: shouldn't need this due to flag saver in KuduTest unless I'm missing s Unfortunately, KuduTest doesn't restore the flags until _after_ TearDown() runs, and env_inject_eio!=0 can cause test directory cleanup to fail. That said, having to do this for every test is annoying, so maybe we should manually destroy the FlagSaver at the beginning of TearDown instead? -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Alexey Serbin has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 3: (5 comments) A quick first pass http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG Commit Message: PS3, Line 20: mangers' looks like a typo Is that supposed to be "directory managers'" or "managers'" ? http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.cc File src/kudu/fs/fs_manager.cc: PS3, Line 180: set all_roots; : all_roots.insert(wal_fs_root_); : for (const string& data_fs_root : data_fs_roots_) { : all_roots.insert(data_fs_root); : } nit: consider set all_roots(data_fs_roots_.begin(), data_fs_roots_.end()); all_roots.insert(wal_fs_root_); PS3, Line 228: if (!ContainsKey(unique_roots, root.first)) { : canonicalized_data_fs_roots_.emplace_back(root); : canonicalized_all_fs_roots_.emplace_back(root); : InsertOrDie(&unique_roots, root.first); : } Consider using InsertIfNotPresent() instead to avoid double look-ups: if (InsertIfNotPresent(&unique_roots, root.first)) { canonicalized_data_fs_roots_.emplace_back(root); canonicalized_all_fs_roots_.emplace_back(root); } PS3, Line 248: string wal_root Is it necessary to make a copy here? Why not to use const reference here instead? PS3, Line 251: string meta_root ditto -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Mike Percy has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 3: (3 comments) couple other minor nits, overall lgtm http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: PS3, Line 831: dirs_to_update unused? http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager-test.cc File src/kudu/fs/fs_manager-test.cc: Line 279: FLAGS_env_inject_eio = 0; nit: shouldn't need this due to flag saver in KuduTest unless I'm missing something Line 302: FLAGS_env_inject_eio = 0; same -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
David Ribeiro Alves has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.cc File src/kudu/fs/block_manager_util.cc: PS4, Line 189: them nit s/them/it PS4, Line 189: it's nit its http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS4, Line 56: // On success, either 'metadata_' is overwritten with the contents of the : // file, or, in the case of disk failure, returns OK and sets : // 'health_status_' to a non-OK value. : Status LoadFromDisk(); this is weird. whats the reason for this not to return a non-ok status right away? PS4, Line 83: // Whether or not the instance is healthy. If the instance file lives on a : // disk that has failed, this should return false. : bool healthy() const { : return health_status_.ok(); : } we usually don't have this extra getter (i.e. consider just doing pimf.healt_status().ok()) http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: PS4, Line 794: if (FindUuidByRoot(root, &uuid)) { : return FindUuidIndexByUuid(uuid, uuid_idx); : } : return false; there are no chances of races here, right? http://gerrit.cloudera.org:8080/#/c/7784/4/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS4, Line 366: Finds the UUID for the given canonicalized root directory name. nit, for consistency append the same suffix as the previous new method (returning false...) -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7784 to look at the new patch set (#4). Change subject: open FS layout in presence of disk failure .. open FS layout in presence of disk failure Currently, if a Kudu server starts up with a failed disk, the server will crash. There are a number of reasons for this, but the pressing one is that the path instance files may not be readable, meaning the directories' UUIDs may not be available. This patch changes this by introducing an "unhealthy" instance in-memory for each instance that fails to lock, load, canonicalize, etc. Such instances are ignored when it comes to checking the integrity of the FS layout, and are simply marked failed by the directory manager. Testing is done in data_dirs-test, log_block_manager-test, and fs_manager-test to ensure failed directories do not impede the mangers' startups. Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager_util-test.cc M src/kudu/fs/block_manager_util.cc M src/kudu/fs/block_manager_util.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/fs_manager-test.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc 13 files changed, 636 insertions(+), 169 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/4 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 3: (12 comments) http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG Commit Message: PS3, Line 15: failsto > fails to Done PS3, Line 15: all each > just "each" Done PS3, Line 15: failsto > fails to Done PS3, Line 15: all > s/all// Done http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS3, Line 57: not OK() > 1. Why not just return an error? 1. Errors here would be ignored anyway. By adding this to the contract, we don't have to dirty the call-site with disk-handling code and we can keep the error handling (i.e. setting health_status_) internal. 2. Done PS3, Line 88: Status > nit: const Status& ? Done http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs-test.cc File src/kudu/fs/data_dirs-test.cc: PS3, Line 244: ASSERT_DEATH({ : dd_manager_->MarkDataDirFailed(kNumDirs - 1); : }, ".*All data dirs have failed"); > I still don't like this change; I prefer that "library code" like the Direc Done http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: Line 385: return Status::IOError("Could not create directory manager with disks failed"); > Maybe include root.second in the error message? Done Line 607: MarkDataDirFailed(uuid_idx, refresh_status.ToString()); > As future cleanup, it would be nice if MarkDataDirFailed() took a Status ra Fair point. I'll add a TODO in MarkDataDirFailed. http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS3, Line 53: typedef std::pair CanonicalizedRoot; > I think it would be easier to understand the code that uses this if it was Done http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: PS3, Line 302: Returned values > This comment is a bit confusing. I think it would make more sense to say it Done http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 1592: if (dd_manager_->IsDataDirFailed(uuid_idx)) { > If we stored the status passed to us in MarkDataDirFailed, we could return Added TODOs here and in MarkDataDirFailed(). I also wouldn't mind putting them in this patch, but for now, TODOs. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Mike Percy has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 3: (6 comments) I'm still looking through this patch but so far I only have nits http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG Commit Message: PS3, Line 15: all s/all// PS3, Line 15: failsto fails to http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/block_manager_util.h File src/kudu/fs/block_manager_util.h: PS3, Line 57: not OK() 1. Why not just return an error? 2. If we really want to return OK, be a little more explicit: "... or, in the case of disk failure, returns OK but will set health_status_ to a non-OK value." PS3, Line 88: Status nit: const Status& ? http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS3, Line 53: typedef std::pair CanonicalizedRoot; I think it would be easier to understand the code that uses this if it was renamed and made a struct: struct CanonicalizedRootAndStatus { std::string path; Status status; }; http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: PS3, Line 302: Returned values This comment is a bit confusing. I think it would make more sense to say it like: Canonicalized forms of the root directories. Canonicalized during Init() with ordering maintained. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Adar Dembo has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/7784/3//COMMIT_MSG Commit Message: PS3, Line 15: all each just "each" PS3, Line 15: failsto fails to http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs-test.cc File src/kudu/fs/data_dirs-test.cc: PS3, Line 244: ASSERT_DEATH({ : dd_manager_->MarkDataDirFailed(kNumDirs - 1); : }, ".*All data dirs have failed"); I still don't like this change; I prefer that "library code" like the DirectoryManager pass back an appropriate Status that would let callers decide whether to crash or not. http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: Line 385: return Status::IOError("Could not create directory manager with disks failed"); Maybe include root.second in the error message? Line 607: MarkDataDirFailed(uuid_idx, refresh_status.ToString()); As future cleanup, it would be nice if MarkDataDirFailed() took a Status rather than a string. The former has more information in it. http://gerrit.cloudera.org:8080/#/c/7784/3/src/kudu/fs/log_block_manager.cc File src/kudu/fs/log_block_manager.cc: Line 1592: if (dd_manager_->IsDataDirFailed(uuid_idx)) { If we stored the status passed to us in MarkDataDirFailed, we could return it to callers here so we wouldn't need to create artificial Statuses like this. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 3: Just rebased. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 1: (1 comment) Ah, it seems the issue was with some error-handling wiring that I changed. Should be good now. http://gerrit.cloudera.org:8080/#/c/7784/1/src/kudu/fs/fs_manager.cc File src/kudu/fs/fs_manager.cc: Line 35: #include "kudu/fs/error_manager.h" > warning: #includes are not sorted properly [llvm-include-order] Done -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7784 to look at the new patch set (#2). Change subject: open FS layout in presence of disk failure .. open FS layout in presence of disk failure Currently, if a Kudu server starts up with a failed disk, the server will crash. There are a number of reasons for this, but the pressing ones are that the path instance files may not be readable, meaning the directories' UUIDs may not be available. This patch changes this by introducing an "unhealthy" instance in-memory for all each instance that failsto load, lock, canonicalize, etc. Such instances are ignored when it comes to checking the integrity of the FS layout, and are simply marked failed by the directory manager. Testing is done in data_dirs-test, log_block_manager-test, and fs_manager-test to ensure failed directories do not impede the mangers' startups. Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager_util-test.cc M src/kudu/fs/block_manager_util.cc M src/kudu/fs/block_manager_util.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/fs_manager-test.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc 13 files changed, 612 insertions(+), 162 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/2 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 1: looking into the jenkins failure. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has posted comments on this change. Change subject: open FS layout in presence of disk failure .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/7784/1/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: PS1, Line 592: uid_by_root_.swap(uuid_by_root); : : // From this point onwards, the above in-memory maps must be consistent with : // the main path set. : : // Initialize the 'fullness' status of the data directories. : for (const auto& dd : data_dirs_) { : uint16_t uuid_idx; : CHECK(FindUuidIndexByDataDir(dd.get(), &uuid_idx)); : if (ContainsKey(failed_data_dirs_, uuid_idx)) { : continue; : } : Status refresh_status = dd->RefreshIsFull(DataDir::RefreshMode::ALWAYS); : if (PREDICT_FALSE(!refresh_status.ok())) { : if (refresh_status.IsDiskFailure()) { : MarkDataDirFailed(uuid_idx, refresh_status.ToString()); : continue; : } : return refresh_status; : } : } re: why is this here and not higher up, yes this is here because MarkDataDirFailed needs the above to work. -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] open FS layout in presence of disk failure
Andrew Wong has uploaded a new change for review. http://gerrit.cloudera.org:8080/7784 Change subject: open FS layout in presence of disk failure .. open FS layout in presence of disk failure Currently, if a Kudu server starts up with a failed disk, the server will crash. There are a number of reasons for this, but the pressing ones are that the path instance files may not be readable, meaning the directories' UUIDs may not be available. This patch changes this by introducing an "unhealthy" instance in-memory for all each instance that failsto load, lock, canonicalize, etc. Such instances are ignored when it comes to checking the integrity of the FS layout, and are simply marked failed by the directory manager. Testing is done in data_dirs-test, log_block_manager-test, and fs_manager-test to ensure failed directories do not impede the mangers' startups. Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 --- M src/kudu/fs/block_manager-stress-test.cc M src/kudu/fs/block_manager-test.cc M src/kudu/fs/block_manager_util-test.cc M src/kudu/fs/block_manager_util.cc M src/kudu/fs/block_manager_util.h M src/kudu/fs/data_dirs-test.cc M src/kudu/fs/data_dirs.cc M src/kudu/fs/data_dirs.h M src/kudu/fs/fs_manager-test.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc 13 files changed, 607 insertions(+), 162 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/7784/1 -- To view, visit http://gerrit.cloudera.org:8080/7784 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ib2a1c824526ed52a6b90ddfbc735cecc4c491118 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong