Daniel Becker has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/20168 )

Change subject: IMPALA-12269: Codegen cache false negative because of function 
names hash
......................................................................

IMPALA-12269: Codegen cache false negative because of function names hash

Codegen cache entries (execution engines holding an LLVM code module)
are stored by keys derived from the unoptimised llvm modules: the key is
either the whole unoptimised module (normal mode) or its hash (optimal
mode). Because hash collisions are possible (in optimal mode), as an
extra precaution we also compare the hashes of the function names in the
current and the cached module. However, when assembling the function
name list we do not filter out duplicate function names, which may
result in cases where the unoptimised llvm modules are identical but the
function name hashes do not match.

Example:
First query:
  select int_col, tinyint_col
  from alltypessmall
  order by int_col desc
  limit 20;

Second query:
  select tinyint_col
  from alltypessmall
  order by int_col desc
  limit 20;

In the first query, there are two 'SlotRef' objects referencing
'tinyint_col' which want to codegen a 'GetSlotRef()' function. The
second invokation of 'SlotRef::GetCodegendComputeFnImpl()' checks the
already codegen'd functions, finds the function created by its first
invokation and returns that. The two 'SlotRef' objects will use the same
'llvm::Function' and there will be only one copy of it in the module,
but both 'SlotRef's will call 'LlvmCodeGen::AddFunctionToJit()' with
this function in order for their respective function pointers to be set
after JIT-compilation.

'LlvmCodeGen::GetAllFunctionNames()' will return the names of all
functions with which 'LlvmCodeGen::AddFunctionToJit()' has been called,
including duplicates.

The second query generates the same unoptimised module as the first
query (for the corresponding fragment), but does not have a duplicated
'GetSlotRef()' function in its function name list, so the cached module
is rejected.

Note that this also results in the cached module being evicted when the
new module from the second query is inserted into the cache because the
new module will have the same key as the cached one (the modules are
identical).

This change fixes this problem by using a de-duplicated and sorted
function name list.

Testing:
  - Added a test in test_codegen_cache.py that asserts that there is a
    cache hit and no eviction in the above example.

Change-Id: Ibf1d2b424c969fbba181ab90bf9c7bf22355f139
Reviewed-on: http://gerrit.cloudera.org:8080/20168
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
---
M be/src/codegen/llvm-codegen-cache-test.cc
M be/src/codegen/llvm-codegen.cc
M be/src/codegen/llvm-codegen.h
M tests/custom_cluster/test_codegen_cache.py
4 files changed, 125 insertions(+), 61 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/20168
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ibf1d2b424c969fbba181ab90bf9c7bf22355f139
Gerrit-Change-Number: 20168
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Yida Wu <wydbaggio...@gmail.com>

Reply via email to