Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )
Change subject: IMPALA-8341: Data cache for remote reads ...................................................................... Patch Set 1: (23 comments) I think it would be good to have some tests that make sure that data gets freed up on the disk by the hole punching mechanism as expected. Additionally, maybe a test to make sure that mtime and other parts of the key are properly considered would be nice to have. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@52 PS1, Line 52: // The callback is invoked by each thread in the multi-threaded tests below. callback reads like it's called when something is done, how about ThreadFn? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@157 PS1, Line 157: // Read the same same range inserted previously and they should still all in the cache. nit: be http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@196 PS1, Line 196: // Create a buffer way larger than the cache. Why does it need to be larger? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@201 PS1, Line 201: ASSERT_TRUE(cache.Store("foobar", 12345, 0, large_buffer.get(), cache_size)); Use variables instead of inline constants? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@220 PS1, Line 220: const int64_t cache_size = 1 << 22; I find these easier to read in the form of 4 * 1024 * 1024 http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@248 PS1, Line 248: differen typo http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@301 PS1, Line 301: ootprint) { Can you add a comment what this test does? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330 PS1, Line 330: int main(int argc, char **argv) { This is not needed anymore if you add your test to the unified backend tests (but better confirm with Joe). http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56 PS1, Line 56: /// We should mention in the comment how lookup works, in particular what happens when we look up data that is a sub-piece of something in the cache and the lookup by offset wouldn't work (i.e. state that only exact matches are implemented) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110 PS1, Line 110: /// is installed successfully. Maybe point out explicitly that it returns false for errors and if the content was already in the cache? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@129 PS1, Line 129: /// Creates a partition at the given directory 'path' with quota 'capacity'. Mention that capacity is bytes? Maybe even add a suffix that indicates it's bytes (and not number of entries, here and elsewhere) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@157 PS1, Line 157: void EvictedEntry(kudu::Slice key, kudu::Slice value) override; Can you add the virtual and override keywords to all virtual methods to make it obvious which ones are part of the interface? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@226 PS1, Line 226: /// failure. Please note that the backing file is unlinked right after it's created How will these files look to an administrator? Will this make it harder to debug out-of-space issues? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@239 PS1, Line 239: /// Please note that this function will CHECK fail if there is any checksum mismatch. Does that mean "crash a release build"? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254 PS1, Line 254: std::unique_ptr<kudu::faststring>* key); You could return the pointer by value http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@66 PS1, Line 66: DRAM_CACHE Should we add DISK_CACHE here? Otherwise, can you add a comment to explain why we use DRAM_CACHE? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@193 PS1, Line 193: ssize_t bytes_written = 0; declare right before usage http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@222 PS1, Line 222: done: You could use a scoped cleanup lambda and returns instead of goto http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@262 PS1, Line 262: for (const string& cache_config : all_cache_configs) { gutil has SplitStringIntoKeyValuePairs() in strings.h that could be helpful here. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295 PS1, Line 295: LOG(ERROR) << "Failed to initialize data cache."; Have you considered making this fatal? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h File be/src/runtime/io/hdfs-file-reader.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h@48 PS1, Line 48: virtual void CachedFile(uint8_t** data, int64_t* length) override; Maybe rename to ReadHdfsCachedFile to make its purpose more clear at the call-sites? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc File be/src/runtime/io/hdfs-file-reader.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@124 PS1, Line 124: if (exclusive_hdfs_fh_ != nullptr) { I think a comment here would help explain why we do this http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@193 PS1, Line 193: remote_data_cache->Store(*scan_range_->file_string(), scan_range_->mtime(), I think this will not be effective if the cache already has a smaller entry. Should we remove the smaller entry before adding the larger one? In either case it's probably good to add a comment that states the intent. It'd probably be good to document here or in the comment of Store() why it's safe to ignore the return value -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Wed, 10 Apr 2019 19:23:35 +0000 Gerrit-HasComments: Yes