Andrew Wong has posted comments on this change.

Change subject: disk failure: add persistent disk states
......................................................................


Patch Set 15:

(30 comments)

http://gerrit.cloudera.org:8080/#/c/7270/15//COMMIT_MSG
Commit Message:

Line 11: failed disk may be partially written and thus should not be used.
> I dont quite follow this logic. If we crashed in the middle of writing a bl
Right, so the Status.IsDiskFailure() doesn't account for "out of space" errors. 
In those cases, we'll still crash. There was a bit of discussion about this a 
couple months ago and it seems like the "out of space" behavior _should_ be 
different. For now, we just crash, but I don't think it'd be terribly 
unreasonable to drop some tablets on the out-of-space disk.

In any case, yeah, I've updated this.


http://gerrit.cloudera.org:8080/#/c/7270/15/src/kudu/fs/block_manager_util-test.cc
File src/kudu/fs/block_manager_util-test.cc:

Line 151: TEST_F(KuduTest, CheckIntegrity) {
> There's a lot to process here. In total, do you get full coverage of every 
All paths out of CheckIntegrity(), yes. "Update these instance files" is 
covered in DataDirManagerTest::TestOpenWithFailedDirs.


http://gerrit.cloudera.org:8080/#/c/7270/6/src/kudu/fs/block_manager_util.cc
File src/kudu/fs/block_manager_util.cc:

Line 88: 
> I don't see this change in PS15.
Done.


http://gerrit.cloudera.org:8080/#/c/7270/15/src/kudu/fs/block_manager_util.cc
File src/kudu/fs/block_manager_util.cc:

PS15, Line 54: CloneAndPrepend(msg)
> CloneAndPrepend() is work; can we defer it to if !s.ok() and we want to ret
Done


PS15, Line 69: Status::OK()
> This is the default value anyway; you can omit health_status_ from the list
Done


PS15, Line 121:   // If the write fails, don't update the timestamp.
              :   uint64_t original_timestamp = pb->path_set().timestamp_us();
              :   auto reset_timestamp = MakeScopedCleanup([&] {
              :     
pb->mutable_path_set()->set_timestamp_us(original_timestamp);
              :   });
> This doesn't seem that important, but OK.
Sure, removed.


PS15, Line 137:   // Some instances on failed disks (e.g. those that failed to 
canonicalize
              :   // their path) are prefixed with illegal whitespaces to 
signify failure.
> As we discussed offline, rather than modifying the path to indicate a failu
Done


PS15, Line 197:   // Identify the instance that is the most up-to-date and 
check that there
              :   // are no duplicate UUIDs among the healthy input instances.
> Should add that the old invariant was that all instances have to have ident
Since that and the TODO are more function-wide comments, I pushed it to the top.


PS15, Line 205: max_timestamp
> Nit: add the units (max_timestamp_us?).
Done


Line 208:       const PathSetPB path_set = instances[i]->metadata()->path_set();
> Store a cref; no need to make a copy.
Done


Line 218:         updated_instances->insert(instances[i]);
> Can we modify a temporary set first, so that if we return an error we avoid
Done


Line 231:   int path_set_uuids_size;
> Isn't this always the same as main_uuids.size() after L232-244?
Done


Line 232:   if (main_path_set->all_paths_size() == 0) {
> It's rather easy to forget that this means "legacy". Perhaps store it in an
Done


PS15, Line 234:     
main_uuids.insert(main_path_set->deprecated_all_uuids().begin(),
              :                       
main_path_set->deprecated_all_uuids().end());
              :     path_set_uuids_size = 
main_path_set->deprecated_all_uuids_size();
> Nit: flip the order so it's the same as L239-243.
Done


Line 240:     DCHECK_GT(path_set_uuids_size, 0);
> I think this is trivially correct (by virtue of being in this else statemen
Done


Line 242:       main_uuids.insert(path.uuid());
> Should this InsertOrDie? Or can we expect duplicates?
No, this entire if statement is meant to deduplicate the UUIDs. I've added a 
comment since it seems that was the source of confusion of one of the above 
comments.


PS15, Line 265: depcreated
> deprecated
Done


PS15, Line 267: main_path_updated
> main_path_set_updated?
Done


PS15, Line 302: main_path_set->mutable_all_paths(i)
> How about using pb_at_i here (though make it a pointer, not a cref).
Done


PS15, Line 325: path
> path_set
Done


http://gerrit.cloudera.org:8080/#/c/7270/15/src/kudu/fs/data_dirs-test.cc
File src/kudu/fs/data_dirs-test.cc:

PS15, Line 61: // Use log block mode. This should not have much effect on the 
directory
             : // manager itself, but log blocks are expected to be more common 
in production.
             : static const char* kBlockType = "log";
> This is persisted (and compared) in PathInstanceMetadataFile, so it doesn't
Done


PS15, Line 94:   vector<string> GetDirNames(int num_dirs) {
             :     vector<string> ret;
             :     for (int i = 0; i < num_dirs; i++) {
             :       string dir_name = Substitute("$0-$1", kDirNamePrefix, i);
             :       ret.push_back(GetTestPath(dir_name));
             :       CHECK_OK(env_->CreateDir(ret[i]));
             :     }
             :     return ret;
             :   }
> Don't need this anymore.
Done


Line 367: class DataDirManagerTest : public DataDirsTest {
> Is this class hierarchy absolutely necessary? I'd rather have one fixture t
Good point, not particularly. I do think two different SetUp()s are needed, but 
that warrants two classes, not two subclasses and a superclass.


Line 399:   FLAGS_env_inject_eio = 1.0;
> Does this need to be reset at the end? Or is it OK because of eio_globs?
Ah, yeah it should be reset.


http://gerrit.cloudera.org:8080/#/c/7270/15/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 159:   void SetFailed();
> Doc.
Removed.


Line 394:   // Synchronizes the in-memory path sets to match the currently 
failed dirs.
> What's the significance of the return value?
Done


http://gerrit.cloudera.org:8080/#/c/7270/15/src/kudu/fs/fs_manager.cc
File src/kudu/fs/fs_manager.cc:

Line 356:     LOG(INFO) << root;
> Still want this?
Done


http://gerrit.cloudera.org:8080/#/c/7270/15/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

Line 1287:     ASSERT_OK(env_util::CreateDirIfMissing(env_, dir));
> CreateDir() should suffice.
Done


http://gerrit.cloudera.org:8080/#/c/7270/15/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS15, Line 1537:       // Blocks should not be placed in failed directories.
               :       limit = 0;
> I don't think this has the effect you want. If we use a limit of 0, we'll s
Ah, fair point. Actually in this case, I don't think the limit 
matters--containers shouldn't be put on a failed disk.


PS15, Line 1619: FindDataDirByUuidIndex(uuid_indices[i])
> Isn't this the same iteration order as dd_manager_->data_dirs()? That is, c
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7270
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifddf0817fe1a82044077f5544c400c88de20769f
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to