Todd Lipcon has posted comments on this change. Change subject: Add implementation of the block cache on NVM running in persistent mode. ......................................................................
Patch Set 6: (32 comments) Spent a few hours and left some preliminary comments. I still think the refactoring's not quite right - leaking some implementation details into block_cache and cfile_reader in a way that's not super clean. I'll take a look at that part to make a concrete suggestion, but left a bunch of smaller things on the implementation of the cache itself that are worth looking at in parallel. http://gerrit.cloudera.org:8080/#/c/1347/6/cmake_modules/FindPmem.cmake File cmake_modules/FindPmem.cmake: Line 1: # - Find Required PMEM libraries (libvmem, libpmem, libpmemobj) add apache license http://gerrit.cloudera.org:8080/#/c/1347/6/src/kudu/cfile/block_cache.h File src/kudu/cfile/block_cache.h: Line 77: bool Lookup(const Slice& key, Cache::CacheBehavior behavior, since you've moved the CacheKey struct to be a part of the public interface, it makes more sense to use it as the argument here instead of the slice (since I think at the call site you're just passing CacheKey.slice) Line 87: bool Insert(void* key, size_t key_size, Slice &block_data, this API also now looks pretty strange, since we expect the key is supposed to just be a CacheKey, and yet you're making the user pass in a size. Line 111: void Free(const Slice& key, void* p); odd that you have to pass a key here, when Allocate doesn't either take or return a key. It's confusing how I might use this. http://gerrit.cloudera.org:8080/#/c/1347/6/src/kudu/util/nvm_cache.cc File src/kudu/util/nvm_cache.cc: Line 91: struct KeyVal { add comment explaining that this is a variable-length structure which stores the actual key/value pairs in pmem Line 94: uint64_t key_size; given that this is persistent memory, we should try to be space-efficient, so I think my earlier comment stands that we shouldn't store redundant info, and instead should add an accessor somewhere here which calculates kv_size in terms of key_size + value_size + sizeof(KeyVal) Line 104: void KvConstructor(PMEMobjpool* pop, void* ptr, void* arg) { should be in an anonymous namespace (because this symbol isn't used outside of nvm_cache.cc) Line 110: pmemobj_persist(pop, &kv->valid, sizeof(kv->valid)); I spent some time reading the libpmemobj source, and it wasn't clear whether these pmemobj_persist calls are necessary. It says it makes the constructor atomic and persistent, but couldn't quite tell whether the whole memory area is persisted or just the existence of the object. Looking at the test code, it seems like there's a mix of some constructs which do persist their data and others which don't. It would be good to clarify this in the libpmemobj man pages as well. Anyway, if it _is_ required, then: - you're persisting kv->kv_size and kv->valid but not the other fields. That seems not quite right. It actually seems like maybe you _only_ need to persist the 'valid' flag, since the rest can be arbitrary junk which we'll overwrite later before persisting again during Insert, right? - each pmemobj_persist call requires a clwb/sfence/pcommit, which is somewhat expensive, so if in fact we do need to persist multiple fields, we should probably just do a single call to persist(pop, kv, sizeof(*kv)); Line 113: // Finds the persistent memory object id for the give ptr and offset. typo: 'given' Line 115: static PMEMoid FindOid(const uint8_t* ptr, size_t offset, PMEMobjpool* pop, - 'offset' here is always offsetof(KeyValue, key_value), so shouldn't be a parameter. - 'pop' and 'root' are always pop_ and root_, so this should probably be a member function so they don't need to be parameters Line 120: uintptr_t kv_ptr = (uintptr_t)(ptr - offset); - a DCHECK_GT(ptr, offset) is probably useful here - style:use C++ casts Line 122: newoid.pool_uuid_lo = root.pool_uuid_lo; here you're using internals of the PMEMoid type. Is there no API in libpmemobj that will do this for you? eg something like PMEMoid pmemobj_fromptr(PMEMobjpool* pool, void* ptr)? Seems like it would be a good API to request adding (and if it's not something they want to support then maybe we shouldn't be reaching into the internals to do it ourselves) Line 129: POBJ_LAYOUT_BEGIN(kudublockcache); style: kudu_block_cache (we try to avoid runningwordstogether) Line 163: uint8_t* key_ptr; // Only used in persistent mode. seems potentially redundant - in persistent mode, aren't the key and value always contiguous? in non-persistent mode, will this be a pointer to 'key_data'? it seems like you're treating it that way in the implementation of key() below, but should be noted. Some docs/ASCII diagrams on the various structures and layouts would be very useful to add at the top of this file, since we've got several different VLAs, which sometimes seem to have inline data and sometimes have pointed-to data, and it gets really hard to remember in which modes the various pointers are used. Line 373: FreeEntry(e, true); nit: indentation of above four lines is off Line 398: ScopedLeakCheckDisabler disabler; I tried removing the ScopedLeakCheckDisabler and I dont see any leaks. Still not following why our leak checker would think that persistent allocations are allocations (because they aren't memory allocations via malloc, after all, and the ASAN checker works by trapping malloc). Which test are you seeing a leak on when you dont have this? Line 405: struct KeyVal *tmp = static_cast<struct KeyVal*>(pmemobj_direct(oid)); - under what circumstances would pmemobj_direct fail on a just-allocated pmem obj? wouldn't this be a bug in the pmem library? (in which case maybe CHECK is fine) - no need to say 'struct KeyVal' - can just say 'KeyVal' Line 429: e->deleter->Delete(e->key(), e->value); nit: indented too much Line 453: void (*constructor)(PMEMobjpool*, void*, void*)) { use a typedef for this function pointer type Line 584: FreeEntry(e, false); Not following this. If the last reference is held by a user, this means it's not in the LRU list anymore, (i.e. it was evicted while a user still had a ref to it). So, when that caller releases it, the PMEM space should be reclaimed. Line 642: pmemobj_persist(pop_, &kv->key_value, kv->value_size); instead of separately persisting each subfield, probably more efficient to do a single call to persist the whole struct (including its appended VLA) Line 644: pmemobj_persist(pop_, &kv->valid, sizeof(kv->valid)); Here we only care about an ordering property that all of the KeyValue fields (minus 'valid') get persisted before 'valid=1' gets persisted. Doing two separate 'persist' calls ensures that the first is fully durable before we make 'valid' durable, which is stricter than necessary (and expensive on PCIe-attached) I _think_ you could use pmem_flush for the all the data you want to persist, then an sfence, and then pmem_flush for 'valid', and then pmem_drain() to issue the final sfence pcommit sequence. Worth chatting with the libpmemobj people to see what the minimum number of flush/drains calls we can get away with for a caching use case where we don't really care about durability, and only need weaker ordering guarantees. On real pmem it's not so important, but it would be awesome if this code could work on PCIe-attached devices where pmem_flush and pmem_persist turn into actual msync() calls Line 683: FreeEntry(e, false); also doesn't seem right - if you explicitly erase something from the cache, you're OK with it being gone-gone (not just gone until the next restart) Line 776: for (int i = 0; i < num; i++) { this loop looks like a lot of repeated code from above... why can't we just use a POBJ_FOREACH_TYPE(pop_, kv) { ... loop content ... } instead of the double-loop-through? Line 792: // Volatile mode this is mostly repeated code from up above, can't it be moved out of the conditional part? Line 914: PLOG_IF(FATAL, pop == NULL) you dont need the PLOG_IF since you just checked the same condition above Line 916: << FLAGS_nvm_cache_path.c_str(); no need to call c_str() (you can concat the std::string) Line 926: << FLAGS_nvm_cache_path.c_str(); same Line 943: << FLAGS_nvm_cache_path.c_str(); same http://gerrit.cloudera.org:8080/#/c/1347/6/src/kudu/util/nvm_cache.h File src/kudu/util/nvm_cache.h: Line 24: #include <libvmem.h> shouldn't need to include any of these in the header (since users dont need any of the implementation details) http://gerrit.cloudera.org:8080/#/c/1347/6/thirdparty/build-definitions.sh File thirdparty/build-definitions.sh: Line 312: for LIB in libvmem libpmem libpmemobj; do nit: tab Line 313: EXTRA_CFLAGS="$DEBUG_CFLAGS" make -j$PARALLEL $LIB DEBUG=0 should say EXTRA_CFLAGS=$EXTRA_CFLAGS (DEBUG_CFLAGS is no longer set) -- To view, visit http://gerrit.cloudera.org:8080/1347 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2dcf474bce9f2375b0ddde38312db7d82a81835c Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Sarah Jelinek <[email protected]> Gerrit-Reviewer: Anonymous Coward #90 Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sarah Jelinek <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
