[Impala-ASF-CR] IMPALA-11470: Add Cache For Codegen Functions

Qifan Chen (Code Review) Fri, 18 Nov 2022 13:46:15 -0800

Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19181 )


Change subject: IMPALA-11470: Add Cache For Codegen Functions
......................................................................


Patch Set 9:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.h
File be/src/codegen/llvm-codegen-cache.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen-cache.h@38
PS9, Line 38: hash code
of sizeof(struct HashCode) (or two uint64_t)


http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.h
File be/src/codegen/llvm-codegen-cache.h:

http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.h@69
PS8, Line 69: d as the has
> Added comments. The structure is used for customized unordered set. It is a
Yeah, this is one way to customize a data structure. You may keep it but IMHO 
it is a complex way.


http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.h@76
PS8, Line 76:
            :   /// Used as the comparator function in unordered_set for struct 
HashCode.
            :   struct HashCodeEqual {
            :     bool operator()(const HashCode& lhs, const HashCode& rhs) 
const {
            :
> Added comments. Same as above.
The operator() here does the same as HashCode::operator==().  Code duplication 
smell.

Suggest to remove this struct and use the HashCode::operator==() instead.


http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.cc
File be/src/codegen/llvm-codegen-cache.cc:

http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.cc@162
PS8, Line 162: override the existing entry
> For normal procedure, the caller would call a Lookup() before StoreInternal
It seems that there are two cases here when a hash collision takes place.

1). The two sources are not the same;
2). The two sources are the same.

Since we can not avoid collision, my comment above is for case 1).

To distinguish case 1 from case 2, we need to compare the original keys. 
Comparing hash code of the original keys may not be sufficient.


http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.cc@171
PS8, Line 171: codegen_cache_entries_in_use_->Increment(1);
             :   codegen_cache_entries_in_use_bytes_->Increment(mem_charge);
> When the old entry with the same key is replaced, it will be evicted, so th
Done


http://gerrit.cloudera.org:8080/#/c/19181/8/be/src/codegen/llvm-codegen-cache.cc@185
PS8, Line 185: ler to avoid multiple handles fo
> Added comments. I think one case is that when we run two same queries at th
I see. Thanks for the example.

It may also be useful to compare the size of the entries to minimize the chance 
of hash collision due to different entries.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1130
PS9, Line 1130: suc
nit. entry_exist


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1156
PS9, Line 1156: teste
nit typo?


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1191
PS9, Line 1191: result << function->getName().data();
We may need to beef up the formation of function name list here since the 
following two cases would produce same list.

a) foobar()
b)  foo() and bar()

One way to do so would be to prepend the # of functions at the beginning of the 
list and use ',' to separate function names.

1,foobar
2,foo,bar


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/codegen/llvm-codegen.cc@1980
PS9, Line 1980: cache
nit. Store to codegen cache


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/exprs/agg-fn.cc
File be/src/exprs/agg-fn.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/exprs/agg-fn.cc@89
PS9, Line 89:  Codegen cache is disabled for udf functions.
nit. describe the reason here.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/exec-env.cc@459
PS9, Line 459: codegen_cache_capacity
nit. better to print like 2.0Gb, 250Mb. etc to make it simple.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/exec-env.cc@466
PS9, Line 466: initialed
nit . typo.


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/fragment-state.h
File be/src/runtime/fragment-state.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/fragment-state.h@231
PS9, Line 231: should be
nit is


http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/krpc-data-stream-sender.h
File be/src/runtime/krpc-data-stream-sender.h:

http://gerrit.cloudera.org:8080/#/c/19181/9/be/src/runtime/krpc-data-stream-sender.h@176
PS9, Line 176: uint64_t seed
Why is this necessary WRT codegen cache?


http://gerrit.cloudera.org:8080/#/c/19181/8/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/19181/8/common/thrift/ImpalaService.thrift@747
PS8, Line 747:   // insert a full key for the cache and allow more statistics, 
otherwise, a hashcode
> Added comments. Is it 128 bits? The size of CodeGenCacheKey::HashCode.
Sorry about the typo. Yes, it should be two uint64_t.


http://gerrit.cloudera.org:8080/#/c/19181/9/tests/custom_cluster/test_codegen_cache.py
File tests/custom_cluster/test_codegen_cache.py:

http://gerrit.cloudera.org:8080/#/c/19181/9/tests/custom_cluster/test_codegen_cache.py@57
PS9, Line 57: entries-in-use'
nit. this and the following line are not mentioned in the comment at line 55.


http://gerrit.cloudera.org:8080/#/c/19181/9/tests/custom_cluster/test_codegen_cache.py@71
PS9, Line 71:     assert self.get_metric('impala.codegen-cache.misses') > 0
Need to add more cases.

1). Queries with UDF functions
2). Queries with different column types:
  a) select * from functional.alltypes where int_col > 0
  b) select * from functional.alltypes where short_int_col > 0
3). Queries with polymorphic SQL function names against different column types:
  a) select * from functional.alltypes where CHAR_LENGTH(string_col) > 0
  b) select * from functional.alltypes where CHAR_LENGTH(char_col) > 0



--
To view, visit http://gerrit.cloudera.org:8080/19181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If42c78a7f51fd582e5fe331fead494dadf544eb1
Gerrit-Change-Number: 19181
Gerrit-PatchSet: 9
Gerrit-Owner: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Reviewer: Abhishek Rawat <ara...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qfc...@hotmail.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>
Gerrit-Comment-Date: Fri, 18 Nov 2022 21:44:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-11470: Add Cache For Codegen Functions

Reply via email to