Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19532 )

Change subject: IMPALA-11904: Data cache support dumping for reloading
......................................................................


Patch Set 9:

(5 comments)

Sorry for the delay, I'm back to reviewing this.

>From a testing point of view, it would be nice to have a custom cluster test 
>that starts up Impala with data cache enabled (along with this functionality), 
>does some queries, does a graceful shutdown, then starts back up, verifies 
>metrics, and then does some queries and verifies hits/misses/etc.

It would basically do a mixture of tests/custom_cluster/test_data_cache.py and 
tests/custom_cluster/test_restart_services.py (which does some cases of 
shutting down gracefully and then starting back up).

Here are some things it would be useful to verify:
1. The metrics for the size of the data cache and number of entries are the 
same before graceful shutdown (for an idle system) and after startup.
2. If we know the data cache has entries for a particular query, then after 
startup we should be able to run the same query and see it get cache hits like 
usual.
3. It would be nice to do a sanity check on the read-only aspect. One way to do 
that would be to start a query that will try to write to the cache in one 
thread, have another thread immediately do shutdown, they race and we want to 
make sure nothing crashes and the cache is usable after startup. In some custom 
cluster tests, we process the impalad logs to look for a particular message, so 
it might be useful to look for messages and indicate errors in loading the 
cache files. We use that technique in custom_cluster/test_thrift_socket.py 
https://github.com/apache/impala/blob/master/tests/custom_cluster/test_thrift_socket.py#L93-L95
4. Maybe fill up a cache of one size and then restart with a smaller size and 
verify that things got evicted.

In my hand tests, things work as expected, and it is nice to see this 
functionality.

http://gerrit.cloudera.org:8080/#/c/19532/9/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/19532/9/be/src/runtime/io/data-cache.cc@131
PS9, Line 131: DEFINE_bool(data_cache_enable_loading, false,
             :     "(Experimental) With loading enabled, data cache will try to 
load the previously dump"
             :     " file (created by the dumping feature enabled by the 
'data_cache_enable_dumping') "
             :     "before create a empty cache. if loading fails, it will 
proceed with regular "
             :     "initialization.");
I'm thinking that we can combine data_cache_enable_dumping and 
data_cache_enable_loading into a single parameter. I don't think you would set 
one without the other, but if there is a use case that I'm not thinking about, 
let me know.

Maybe something like "data_cache_keep_across_restarts"?


http://gerrit.cloudera.org:8080/#/c/19532/9/be/src/runtime/io/data-cache.cc@684
PS9, Line 684:   // Check if there is enough space available at this point in 
time.
             :   uint64_t available_bytes;
             :   RETURN_IF_ERROR(FileSystemUtil::GetSpaceAvailable(path_, 
&available_bytes));
             :   if (available_bytes < capacity_) {
             :     const string& err = Substitute("Insufficient space for $0. 
Required $1. Only $2 is "
             :         "available", path_, PrettyPrinter::PrintBytes(capacity_),
             :         PrettyPrinter::PrintBytes(available_bytes));
             :     LOG(ERROR) << err;
             :     return Status(err);
             :   }
I think this check can fire if we have the cache on its own device with the 
cache capacity close to the size of the device. The cache files are still there 
consuming space, and this check doesn't understand that the files are part of 
the cache. i.e. If the capacity is 200GB and the device has 250GB but the 
existing files use 200GB, it will look at the remaining 50GB and say it isn't 
enough.


http://gerrit.cloudera.org:8080/#/c/19532/9/be/src/runtime/io/data-cache.cc@1056
PS9, Line 1056: Status DataCache::Partition::Load() {
I think we want to delete files that are not listed in the metadata. It 
shouldn't happen, but if it does, we want to have the cache directory to only 
have files that we know about. Anything left over that we don't know about is 
taking up space on the device and could cause problems for our calculations of 
space usage.

Another question is whether we should delete the metadata file after we have 
loaded it. Once we start writing to the data cache, the metadata file can't be 
used any more, so it might be better to remove it to free up space.


http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/cache-internal.h
File be/src/util/cache/cache-internal.h:

http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/cache-internal.h@316
PS4, Line 316:         handles.emplace_back(
             :             reinterpret_cast<Cache::Handle*>(handle), 
Cache::HandleDeleter(this));
> > I haven't tried this, but I think we should be able to move the
Nevermind, I thought handle was a unique_ptr, and it isn't. This is fine as-is.


http://gerrit.cloudera.org:8080/#/c/19532/9/be/src/util/cache/cache.h
File be/src/util/cache/cache.h:

http://gerrit.cloudera.org:8080/#/c/19532/9/be/src/util/cache/cache.h@288
PS9, Line 288:  Traverse
Nit: Since we are exporting all entries as a single vector, let's call this 
Dump rather than Traverse (and apply the name change to the other pieces of the 
cache implementation).



--
To view, visit http://gerrit.cloudera.org:8080/19532
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id867f4fc7343898e4906332c3caa40eb57a03101
Gerrit-Change-Number: 19532
Gerrit-PatchSet: 9
Gerrit-Owner: Zihao Ye <eyiz...@163.com>
Gerrit-Reviewer: David Rorke <dro...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Zihao Ye <eyiz...@163.com>
Gerrit-Comment-Date: Sat, 06 May 2023 18:44:41 +0000
Gerrit-HasComments: Yes

Reply via email to