Todd Lipcon has posted comments on this change. Change subject: log_block_manager: fix corruption after re-opening compacted metadata ......................................................................
Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/7113/3//COMMIT_MSG Commit Message: Line 32: The bulk of the changes here are to tests: > As this is a severe bug, I'd like to better understand why our existing bla block_manager-stress-test missed it for a couple reasons from what I can tell: - it only restarted once, as you mentioned - it set the fd cache to 512, which was enough to cause some churn in the FBM, but not much in the LBM. For this bug you trigger, you need "just the right amount" of churn -- it has to be big enough that you will reuse a cached FD when re-appending to the metadata after restart, but small enough that you will then evict and re-open the same FD again before the next restart - block deletion is faster than block creation, so the deleter threads were "keeping up with" the writer threads, and on restart we always had only a very small number of live containers. Most were 100% dead and therefore rarely got compacted, but rather just removed. Even with the changes I made, it seemed to reproduce reliably with some runtimes (eg 4 seconds) but not with slightly longer or shorter. It was hard to get it to reproduce reliably, so I went and added the more specific test. As for other tests (non-specific to block manager), the typical amounts of data are small enough that the FD cache doesn't really churn. Given you need the specific interleaving described above, we didn't have any likelihood of hitting it in our other various tests which do multiple restarts. As for tests outside of the automated ctests, we did actually hit this on our automated long-lived YCSB runs that we run internally at Cloudera. We didn't hit it sooner due to some test infra subtleties where it wasn't properly tracking master for the last week or two. When it rains, it pours, though, and it seems that at least 3 of our test clusters have now hit it, so we do at least have some ability to feel confident in our tests. http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/fs/log_block_manager-test.cc File src/kudu/fs/log_block_manager-test.cc: PS3, Line 1242: Since we have more metadata files than : // we have file_cache capacity, this will also generate a mix of cache hits, : // misses, and re-insertions. > Without the fix, how often does this test trigger the bug? pretty much every time (at least it triggered it every time I ran it) http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache-test.cc File src/kudu/util/file_cache-test.cc: Line 242: // fd. > Nit: odd wrapping? Done Line 245: ASSERT_EQ(size, kData1.size()); > Nit: expected should come first (as per L259). Done http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache.cc File src/kudu/util/file_cache.cc: Line 143: void MarkInvalidated() { > Nit: add a comment explaining what it means to be invalidated. Done Line 525: void FileCache<FileType>::Invalidate(const string& file_name) { > There are three interesting things going on here and I want to make sure I Yes, your understanding is correct. I think to handle the race we can do the following order: 1. under the lock: mark any existing descriptor as invalidated, or insert an invalidated descriptor if there is nothing in the map 2. erase the entry from the cache 3. under the lock again: erase the descriptor from the map Thinking about a concurrent OpenExistingFile() call at each of the points: before 1: continues to open the old file (i.e serialized before Invalidate()) before 2: fetches the invalidated descriptor and gets an Invalidated on first access (i.e serialized before Invalidate()) before 3: continues to get the invalidated descriptor (same as above) after 3: creates a new Descriptor, and gets a cache miss, thus opening the _new_ path (serialized after Invalidate()) For concurrent access to _existing_ descriptors: before 1: nothing notable (serialized before) any time after 1: gets an error because it's invalidated If we expected this to be used concurrently I'd write up a stress test for this but the expected use case is that we would externally synchronize this, so don't think it's worth the effort. -- To view, visit http://gerrit.cloudera.org:8080/7113 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
