Adar Dembo has posted comments on this change.

Change subject: disk failure: make DataDirManager failure-aware
......................................................................


Patch Set 1:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs-test.cc
File src/kudu/fs/data_dirs-test.cc:

Line 86:   CreateBlockOptions non_existent_tablet_opts(test_block_opts_);
Why do we need to copy test_block_opts_? Can't we just pass it directly to 
GetNextDataDir()?


http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 443:   shared_lock<rw_spinlock> health_lock(dir_health_lock_.get_lock());
This is why I asked about the contention profiles of the locks: use and 
acquisition of multiple locks should only be done if absolutely necessary, 
because it's really easy to make a mistake and cause a deadlock via lock 
inversion.


PS1, Line 460:       group_target_size = std::min(group_target_size,
             :           static_cast<uint32_t>(data_dirs_.size() - 
failed_data_dirs_.size()));
If someone requested a group of a particular size, why not fail that request if 
there have been too many failures? Why be permissive?


PS1, Line 484:     // This should only be reached by some tests; in cases where 
there is no
             :     // natural tablet_id, select a data dir from any of the 
directories.
So in this case there's no point in considering failed_data_dirs_?


Line 503:         return Status::IOError("No healthy directories exist in 
tablet's "
No ENODEV here?


Line 587: bool DataDirManager::FindTabletsByDataDirUuidIdx(uint16_t uuid_idx, 
set<string>** tablet_ids) {
This needs some locking.


Line 601:   LOG(ERROR) << Substitute("Dir with uuid index $0 ($1) marked as 
failed", uuid_idx, dd->dir());
Perhaps avoid logging this multiple times if failed_data_dirs_ already has this 
uuid_idx?


http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 263:   bool FindTabletsByDataDirUuidIdx(uint16_t uuid_idx, 
std::set<std::string>**
I imagine you're trying to avoid copies by passing the caller a pointer into 
the map, but this makes synchronization and lifecycle management much harder. 
To simplify, just have tablet_ids be a pointer and copy the set's contents.

Also, I don't understand the difference between returning false and returning 
true with an empty set. It seems to me that uuid_idx is guaranteed to be in the 
map; only the contents of tablet_ids can vary (from empty to non-empty). So you 
could return the set directly instead of passing it as an argument.


PS1, Line 268: MarkDataDirFailure
Nit: how about MarkDataDirFailed, for better symmetry with IsDirectoryFailed?


Line 271:   bool IsDirectoryFailed(uint16_t uuid_idx) const;
Nit: Likewise, IsDataDirFailed (or HasDataDirFailed)


Line 273:   void GetFailedDataDirs(std::set<uint16_t>* data_dirs) const {
How about just returning the data_dirs? Easier for callers.


Line 274:     *data_dirs = failed_data_dirs_;
This needs to acquire a lock, no?


Line 329:   // Lock protecting changes to failed_data_dirs_.
Why do we need a different lock for this? Do you expect either this lock or 
dir_group_lock_ to be highly contended?


-- 
To view, visit http://gerrit.cloudera.org:8080/7028
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iee212793152de5de5198751d649ab34fb97f6aa2
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to