[Impala-ASF-CR] IMPALA-11470: Add Cache For Codegen Functions

Yida Wu (Code Review) Tue, 22 Nov 2022 18:49:18 -0800

Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19181 )


Change subject: IMPALA-11470: Add Cache For Codegen Functions
......................................................................


Patch Set 10:

(28 comments)

Thanks Qifan, Csaba and Michael. The patch mainly fixes minor issues. Will add 
more testcases in next patch.

http://gerrit.cloudera.org:8080/#/c/19181/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19181/9//COMMIT_MSG@34
PS9, Line 34: Codegen cache is disabled if the fragment contains a udf, because
> This paragraph is not clear to me. Does this mean that if any UDF is used s
Yes, if any udf is in the fragment, so the execution engine of that fragment 
won't be cached.

I think the bitcode hasn't been changed, because it can hit the cache.

The way to reproduce the issue is to run twice or three times on 
query_test/test_udfs.py, it recreates the database, and recreates the udf 
functions with a random function name, then it crashes in the udf shared 
library when calling the udf function. So far, I don't see issues when rerun 
other testcases, only the udf tests crash. The theory is that the replacement 
of the udf shared library during the recreation leads to the crash. I think it 
needs more investigation, but if blocking the udf for codegen cache doesn't 
bring problems, it could be a safe way for the patch.
Rewrote the comment.


http://gerrit.cloudera.org:8080/#/c/19181/9//COMMIT_MSG@42
PS9, Line 42:  the cache for udf, we add a logic in the expression
            : analysis, if a fragment contains a
> Do the tpch tests use UDFs?
I think the test wants to bring the information that it doesn't affect the 
regular performance, because the logic also affects regular paths. Therefore 
the functional part could be the only concern.
Looks it is not necessary to mention, removed.


http://gerrit.cloudera.org:8080/#/c/19181/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19181/10//COMMIT_MSG@36
PS10, Line 36: cenario
nit. typo scenario


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.h
File be/src/codegen/llvm-codegen-cache.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.h@38
PS9, Line 38: hash code
> of sizeof(struct HashCode) (or two uint64_t)
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.h@69
PS9, Line 69: 
> This was a little confusing, a longer explanation would be helpful.
My understanding is correct, the unordered_set requires a structure to hash the 
customized type, which is HashCode in our case, with a return of "long unsigned 
int" compiled in my env, so I pick up the first 64bits of HashCode as the hash. 
Since the unodered_set will handle collisions, it should be okay.
Tried to integrate it to the HashCode struct, but not working, not sure if 
there is a better way.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.h@80
PS9, Line 80:   /// The type of the length of a general element in the codegen 
cache key.
> This could be implemented in terms of lhs == rhs since you implemented oper
Seems we don't need the equal function, because unordered_set can call it from 
struct HashCode. Removed.


http://gerrit.cloudera.org:8080/#/c/19181/5/be/src/codegen/llvm-codegen-cache.cc
File be/src/codegen/llvm-codegen-cache.cc:

http://gerrit.cloudera.org:8080/#/c/19181/5/be/src/codegen/llvm-codegen-cache.cc@66
PS5, Line 66:   string result;
> I don't think this was actually addressed. What I was suggesting is to hash
Sorry that I misunderstood, changed.


http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.cc
File be/src/codegen/llvm-codegen-cache.cc:

http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.cc@162
PS8, Line 162:
> It seems that there are two cases here when a hash collision takes place.
Actually, we don't expect a different content for a specific key. It is sort of 
a trade-off.
Even with the function names, we can't guranttee the case is unique if using 
the hash code as key, even though the chance is very small.
But with the verification of the function names, the engine should not crash 
due to no function found in the cache.
The worst case could be wrong results. In that case, we may suggest the user to 
use the debug mode, which contains the full key (could be over 50KB each 
compared to 128bits) to ensure the key is unique, but would be slower and 
consumes more memory.
The future plan would be to find out a better key for it.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.cc
File be/src/codegen/llvm-codegen-cache.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.cc@185
PS9, Line 185:   {
             :     // Because the Cache::In
> A similar restriction is mentioned for lookup:
Thanks for pointing this out. I didn't notice it, and surprised the cache can't 
handle a read-read conflict on the same key.
From the implementation, 
https://github.com/apache/impala/blob/c3ec9272c591624e8dfc130a85c98da318735f14/be/src/kudu/util/cache.cc#L398
, it looks like a false comment, because I don't see the reason why it can't 
support multiple access to the same key, it is holding a mutex during the 
lookup, and the mutex is also held during insertion, should not be a problem 
for write-read conflict. In the tests, I don't see any issues so far.

I agree that the client should not need a complex locking, but a mutex in a 
daemon level without shards for every lookup and store, seems too big. I can 
add some concurrent testcases to make sure it is working without the look up 
mutex, from the code, we should not need to put a big mutex at least for cache 
look up.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.h
File be/src/codegen/llvm-codegen.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.h@933
PS9, Line 933: code
> typo
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1130
PS9, Line 1130: ent
> nit. entry_exist
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1131
PS9, Line 1131:   LOG(INFO) << DebugCacheEntryString(cache_key, true 
/*is_lookup*/,
              :       state_->query_options().codegen_cache_debug_mode, entry
> This will run once per fragment per query on a node, right? Just a bit worr
Yeah, but non-debug mode mainly contains the query id and the hash, they are 
useful when issue happens that we can track down the issue by the hash code. 
Because we can turn on the debug mode in the query option after issue happens, 
and use the hash code to know more details about the problematic fragment. 
Maybe there could be a better log level?


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1138
PS9, Line 1138:       LOG(WARNING)
> I wish we could guarantee no incorrect use, but this plus the increased ent
I think the debug mode should not have this issue, but consumes much more 
memory and slower. A trade off. Hope to find a reliable solution using a 
smaller key.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1156
PS9, Line 1156: test,
> nit typo?
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1160
PS9, Line 1160:       void* jitted_function = reinterpret_cast<void*>(
              :           
execution_engine_cached_->getFunctionAddress(function->getName()));
> Would it be possible to check the argument and return types of the function
Not sure how to get those from the jitted_function, but took a look at the 
llvm::Function, there could be some good interfaces to get the arguments and 
return type from the llvm::Function.
We may need to store these info to the cache entry, and verify them after 
looking up.
It may requires some work and currently the module bitcode and the check of 
function names should largely ensure the safety of getting the correct 
functions. Since the current patch is big, would it be possible to put it as an 
safety enhancement in later patch?


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1191
PS9, Line 1191: llvm::Function* function = fns_to_jit
> We may need to beef up the formation of function name list here since the f
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1191
PS9, Line 1191: llvm::Function* function = fns_to_jit
> We may need to beef up the formation of function name list here since the f
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1980
PS9, Line 1980: p cod
> nit. Store to codegen cache
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/exprs/agg-fn.cc
File be/src/exprs/agg-fn.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/exprs/agg-fn.cc@89
PS9, Line 89:  Disable the codegen cache if the fragment co
> nit. describe the reason here.
Added comments.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/exprs/scalar-expr.cc
File be/src/exprs/scalar-expr.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/exprs/scalar-expr.cc@213
PS9, Line 213:   // Disable the codegen cache if the expression contains a udf, 
beca
> Can you add a comment here about disabling codegen cache for UDFs?
Added comments.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/exec-env.cc@459
PS9, Line 459:
> nit. better to print like 2.0Gb, 250Mb. etc to make it simple.
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/exec-env.cc@466
PS9, Line 466: e_->Init(
> nit . typo.
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/fragment-state.h
File be/src/runtime/fragment-state.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/fragment-state.h@231
PS9, Line 231: is disall
> nit is
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/krpc-data-stream-sender.h
File be/src/runtime/krpc-data-stream-sender.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/krpc-data-stream-sender.h@176
PS9, Line 176: uint64_t seed
> Why is this necessary WRT codegen cache?
There were some discussions in patch 3: 
https://gerrit.cloudera.org/#/c/19181/3/be/src/runtime/krpc-data-stream-sender.cc@93.
The main idea is that the hash seed is changed according to different query id, 
as a result, if the seed is stored in the codegen function, because we use the 
bitcode of the codegen function as key, the dynamic seed would change the 
bitcode, then we can't hit the cache even they are the same codegen function. 
So the solution is moving the seed out of the function, and pass it as an 
argument.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/service/query-options.h@279
PS9, Line 279:
> Probably ADVANCED would be more fitting?
Thanks, it is better.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/service/query-options.h@281
PS9, Line 281:
> This should be DEVELOPMENT IMO
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/19181/9/common/thrift/ImpalaService.thrift@748
PS9, Line 748: 128
> Phrasing here confused my, I think 'mere' is superfluous.
Done


http://gerrit.cloudera.org:8080/#/c/19181/9/tests/custom_cluster/test_codegen_cache.py
File tests/custom_cluster/test_codegen_cache.py:

http://gerrit.cloudera.org:8080/#/c/19181/9/tests/custom_cluster/test_codegen_cache.py@57
PS9, Line 57: n 0.
> nit. this and the following line are not mentioned in the comment at line 5
Done



--
To view, visit http://gerrit.cloudera.org:8080/19181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If42c78a7f51fd582e5fe331fead494dadf544eb1
Gerrit-Change-Number: 19181
Gerrit-PatchSet: 10
Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qfc...@hotmail.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Comment-Date: Wed, 23 Nov 2022 02:49:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11470: Add Cache For Codegen Functions

Reply via email to