Daniel Becker has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/20168 )
Change subject: IMPALA-12269: Codegen cache false negative because of function names hash ...................................................................... IMPALA-12269: Codegen cache false negative because of function names hash Codegen cache entries (execution engines holding an LLVM code module) are stored by keys derived from the unoptimised llvm modules: the key is either the whole unoptimised module (normal mode) or its hash (optimal mode). Because hash collisions are possible (in optimal mode), as an extra precaution we also compare the hashes of the function names in the current and the cached module. However, when assembling the function name list we do not filter out duplicate function names, which may result in cases where the unoptimised llvm modules are identical but the function name hashes do not match. Example: First query: select int_col, tinyint_col from alltypessmall order by int_col desc limit 20; Second query: select tinyint_col from alltypessmall order by int_col desc limit 20; In the first query, there are two 'SlotRef' objects referencing 'tinyint_col' which want to codegen a 'GetSlotRef()' function. The second invokation of 'SlotRef::GetCodegendComputeFnImpl()' checks the already codegen'd functions, finds the function created by its first invokation and returns that. The two 'SlotRef' objects will use the same 'llvm::Function' and there will be only one copy of it in the module, but both 'SlotRef's will call 'LlvmCodeGen::AddFunctionToJit()' with this function in order for their respective function pointers to be set after JIT-compilation. 'LlvmCodeGen::GetAllFunctionNames()' will return the names of all functions with which 'LlvmCodeGen::AddFunctionToJit()' has been called, including duplicates. The second query generates the same unoptimised module as the first query (for the corresponding fragment), but does not have a duplicated 'GetSlotRef()' function in its function name list, so the cached module is rejected. Note that this also results in the cached module being evicted when the new module from the second query is inserted into the cache because the new module will have the same key as the cached one (the modules are identical). This change fixes this problem by filtering out duplicates and sorting the elements when creating the list of function names. Testing: - Added a test in test_codegen_cache.py that asserts that there is a cache hit and no eviction in the above example. Change-Id: Ibf1d2b424c969fbba181ab90bf9c7bf22355f139 --- M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M tests/custom_cluster/test_codegen_cache.py 3 files changed, 51 insertions(+), 6 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/20168/4 -- To view, visit http://gerrit.cloudera.org:8080/20168 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibf1d2b424c969fbba181ab90bf9c7bf22355f139 Gerrit-Change-Number: 20168 Gerrit-PatchSet: 4 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>