Todd Lipcon has posted comments on this change.

Change subject: Add implementation of the block cache on NVM running in 
persistent mode.
......................................................................


Patch Set 6:

(32 comments)

Spent a few hours and left some preliminary comments.

I still think the refactoring's not quite right - leaking some implementation 
details into block_cache and cfile_reader in a way that's not super clean. I'll 
take a look at that part to make a concrete suggestion, but left a bunch of 
smaller things on the implementation of the cache itself that are worth looking 
at in parallel.

http://gerrit.cloudera.org:8080/#/c/1347/6/cmake_modules/FindPmem.cmake
File cmake_modules/FindPmem.cmake:

Line 1: # - Find Required PMEM libraries (libvmem, libpmem, libpmemobj)
add apache license


http://gerrit.cloudera.org:8080/#/c/1347/6/src/kudu/cfile/block_cache.h
File src/kudu/cfile/block_cache.h:

Line 77:   bool Lookup(const Slice& key, Cache::CacheBehavior behavior,
since you've moved the CacheKey struct to be a part of the public interface, it 
makes more sense to use it as the argument here instead of the slice (since I 
think at the call site you're just passing CacheKey.slice)


Line 87:   bool Insert(void* key, size_t key_size, Slice &block_data,
this API also now looks pretty strange, since we expect the key is supposed to 
just be a CacheKey, and yet you're making the user pass in a size.


Line 111:   void Free(const Slice& key, void* p);
odd that you have to pass a key here, when Allocate doesn't either take or 
return a key. It's confusing how I might use this.


http://gerrit.cloudera.org:8080/#/c/1347/6/src/kudu/util/nvm_cache.cc
File src/kudu/util/nvm_cache.cc:

Line 91: struct KeyVal {
add comment explaining that this is a variable-length structure which stores 
the actual key/value pairs in pmem


Line 94:   uint64_t  key_size;
given that this is persistent memory, we should try to be space-efficient, so I 
think my earlier comment stands that we shouldn't store redundant info, and 
instead should add an accessor somewhere here which calculates kv_size in terms 
of key_size + value_size + sizeof(KeyVal)


Line 104: void KvConstructor(PMEMobjpool* pop, void* ptr, void* arg) {
should be in an anonymous namespace (because this symbol isn't used outside of 
nvm_cache.cc)


Line 110:     pmemobj_persist(pop, &kv->valid, sizeof(kv->valid));
I spent some time reading the libpmemobj source, and it wasn't clear whether 
these pmemobj_persist calls are necessary. It says it makes the constructor 
atomic and persistent, but couldn't quite tell whether the whole memory area is 
persisted or just the existence of the object.

Looking at the test code, it seems like there's a mix of some constructs which 
do persist their data and others which don't. It would be good to clarify this 
in the libpmemobj man pages as well.

Anyway, if it _is_ required, then:
- you're persisting kv->kv_size and kv->valid but not the other fields. That 
seems not quite right. It actually seems like maybe you _only_ need to persist 
the 'valid' flag, since the rest can be arbitrary junk which we'll overwrite 
later before persisting again during Insert, right?

- each pmemobj_persist call requires a clwb/sfence/pcommit, which is somewhat 
expensive, so if in fact we do need to persist multiple fields, we should 
probably just do a single call to persist(pop, kv, sizeof(*kv));


Line 113: // Finds the persistent memory object id for the give ptr and offset.
typo: 'given'


Line 115: static PMEMoid FindOid(const uint8_t* ptr, size_t offset, 
PMEMobjpool* pop,
- 'offset' here is always offsetof(KeyValue, key_value), so shouldn't be a 
parameter.
- 'pop' and 'root' are always pop_ and root_, so this should probably be a 
member function so they don't need to be parameters


Line 120:   uintptr_t kv_ptr = (uintptr_t)(ptr - offset);
- a DCHECK_GT(ptr, offset) is probably useful here
- style:use C++ casts


Line 122:   newoid.pool_uuid_lo = root.pool_uuid_lo;
here you're using internals of the PMEMoid type. Is there no API in libpmemobj 
that will do this for you? eg something like PMEMoid 
pmemobj_fromptr(PMEMobjpool* pool, void* ptr)? Seems like it would be a good 
API to request adding (and if it's not something they want to support then 
maybe we shouldn't be reaching into the internals to do it ourselves)


Line 129: POBJ_LAYOUT_BEGIN(kudublockcache);
style: kudu_block_cache (we try to avoid runningwordstogether)


Line 163:   uint8_t* key_ptr; // Only used in persistent mode.
seems potentially redundant - in persistent mode, aren't the key and value 
always contiguous?

in non-persistent mode, will this be a pointer to 'key_data'? it seems like 
you're treating it that way in the implementation of key() below, but should be 
noted.

Some docs/ASCII diagrams on the various structures and layouts would be very 
useful to add at the top of this file, since we've got several different VLAs, 
which sometimes seem to have inline data and sometimes have pointed-to data, 
and it gets really hard to remember in which modes the various pointers are 
used.


Line 373:           FreeEntry(e, true);
nit: indentation of above four lines is off


Line 398:   ScopedLeakCheckDisabler disabler;
I tried removing the ScopedLeakCheckDisabler and I dont see any leaks. Still 
not following why our leak checker would think that persistent allocations are 
allocations (because they aren't memory allocations via malloc, after all, and 
the ASAN checker works by trapping malloc).

Which test are you seeing a leak on when you dont have this?


Line 405:   struct KeyVal *tmp = static_cast<struct 
KeyVal*>(pmemobj_direct(oid));
- under what circumstances would pmemobj_direct fail on a just-allocated pmem 
obj? wouldn't this be a bug in the pmem library? (in which case maybe CHECK is 
fine)
- no need to say 'struct KeyVal' - can just say 'KeyVal'


Line 429:       e->deleter->Delete(e->key(), e->value);
nit: indented too much


Line 453:                                     void (*constructor)(PMEMobjpool*, 
void*, void*)) {
use a typedef for this function pointer type


Line 584:       FreeEntry(e, false);
Not following this. If the last reference is held by a user, this means it's 
not in the LRU list anymore, (i.e. it was evicted while a user still had a ref 
to it). So, when that caller releases it, the PMEM space should be reclaimed.


Line 642:     pmemobj_persist(pop_, &kv->key_value, kv->value_size);
instead of separately persisting each subfield, probably more efficient to do a 
single call to persist the whole struct (including its appended VLA)


Line 644:     pmemobj_persist(pop_, &kv->valid, sizeof(kv->valid));
Here we only care about an ordering property that all of the KeyValue fields 
(minus 'valid') get persisted before 'valid=1' gets persisted. Doing two 
separate 'persist' calls ensures that the first is fully durable before we make 
'valid' durable, which is stricter than necessary (and expensive on 
PCIe-attached)

I _think_ you could use pmem_flush for the all the data you want to persist, 
then an sfence, and then pmem_flush for 'valid', and then pmem_drain() to issue 
the final sfence pcommit sequence.

Worth chatting with the libpmemobj people to see what the minimum number of 
flush/drains calls we can get away with for a caching use case where we don't 
really care about durability, and only need weaker ordering guarantees. On real 
pmem it's not so important, but it would be awesome if this code could work on 
PCIe-attached devices where pmem_flush and pmem_persist turn into actual 
msync() calls


Line 683:       FreeEntry(e, false);
also doesn't seem right - if you explicitly erase something from the cache, 
you're OK with it being gone-gone (not just gone until the next restart)


Line 776:       for (int i = 0; i < num; i++) {
this loop looks like a lot of repeated code from above... why can't we just use 
a POBJ_FOREACH_TYPE(pop_, kv) {
  ... loop content ...
}

instead of the double-loop-through?


Line 792:       // Volatile mode
this is mostly repeated code from up above, can't it be moved out of the 
conditional part?


Line 914:         PLOG_IF(FATAL, pop == NULL)
you dont need the PLOG_IF since you just checked the same condition above


Line 916:             << FLAGS_nvm_cache_path.c_str();
no need to call c_str() (you can concat the std::string)


Line 926:                 << FLAGS_nvm_cache_path.c_str();
same


Line 943:             << FLAGS_nvm_cache_path.c_str();
same


http://gerrit.cloudera.org:8080/#/c/1347/6/src/kudu/util/nvm_cache.h
File src/kudu/util/nvm_cache.h:

Line 24: #include <libvmem.h>
shouldn't need to include any of these in the header (since users dont need any 
of the implementation details)


http://gerrit.cloudera.org:8080/#/c/1347/6/thirdparty/build-definitions.sh
File thirdparty/build-definitions.sh:

Line 312:       for LIB in libvmem libpmem libpmemobj; do
nit: tab


Line 313:     EXTRA_CFLAGS="$DEBUG_CFLAGS" make -j$PARALLEL $LIB DEBUG=0
should say EXTRA_CFLAGS=$EXTRA_CFLAGS

(DEBUG_CFLAGS is no longer set)


-- 
To view, visit http://gerrit.cloudera.org:8080/1347
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2dcf474bce9f2375b0ddde38312db7d82a81835c
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Sarah Jelinek <[email protected]>
Gerrit-Reviewer: Anonymous Coward #90
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Sarah Jelinek <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to