Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7099

to look at the new patch set (#5).

Change subject: codegen: fix regression in inlining after LLVM 4.0 upgrade
......................................................................

codegen: fix regression in inlining after LLVM 4.0 upgrade

With the upgrade to LLVM 4.0, the performance of our generated code
regressed significantly. After looking at the generated assembly from
codegen-test, I could see that there were many 'call' instructions that
didn't appear in the earlier output, which led me to suspect the
inliner.

After adding a 'module->dump()' call and looking at the output, I could
see that the calls were to utility functions like BitmapTest() which
should be inlined due to their very small size. However, the function
was marked 'noinline' in precompiled.ll. After a bit of Googling I came
across a thread[1] in which someone else noticed that all of their
functions had
unexpected 'noinline' attributes after upgrading to 4.0.

After following the breadcrumbs, I found some advice to change the way
in which we emit precompiled.ll to disable the built-in LLVM passes
which were responsible for adding this attribute.

This had one unintended side-effect which initially broke codegen in
TSAN builds (which use libc++ instead of libstdcxx). Fixing this required
changing the debug builds to still run the most basic ('-O0')
optimization passes at JIT time instead of skipping the optimization
pass manager entirely. Additionally I needed to explicitly run Global
Dead Code Elimination (GlobalDCE). See the new comments in
module_builder.cc for details.

That also caused the size of generated code in UBSAN builds to increase,
so I had to increase the limit on the number of dumped instructions per
function to avoid breaking one of the tests.

During my investigation, I also found that we should have been passing
the optimization and size levels into the inlining pass, so I threw
that change in for good measure.

This fixed the perf regression. I benchmarked using:
  memrowset-test --roundtrip_num_rows=10000000 \
                 --gtest_filter=\*InsertCount\* \
                 --gtest_repeat=10

LLVM 3.9 (before the regression) (asm at [2]):

  I0607 12:00:17.540612 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.295s    user 0.294s     sys 0.000s
  I0607 12:00:23.842226 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.297s    user 0.297s     sys 0.000s
  I0607 12:00:29.710149 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.296s    user 0.296s     sys 0.000s
  I0607 12:00:35.587321 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.296s    user 0.296s     sys 0.000s
  I0607 12:00:41.451495 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.296s    user 0.296s     sys 0.000s
  I0607 12:00:47.309710 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.297s    user 0.297s     sys 0.000s
  I0607 12:00:53.183537 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.297s    user 0.297s     sys 0.000s
  I0607 12:00:59.051522 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.297s    user 0.297s     sys 0.000s
  I0607 12:01:04.935763 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.297s    user 0.298s     sys 0.000s
  I0607 12:01:10.821828 23532 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.297s    user 0.299s     sys 0.000s

LLVM 4.0 (before this fix) (asm at [3]):

  I0607 12:14:07.771476 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.524s    user 0.525s     sys 0.000s
  I0607 12:14:14.283260 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.525s    user 0.525s     sys 0.000s
  I0607 12:14:20.376360 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.543s    user 0.543s     sys 0.000s
  I0607 12:14:26.457681 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.530s    user 0.530s     sys 0.000s
  I0607 12:14:32.535305 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.532s    user 0.532s     sys 0.000s
  I0607 12:14:38.615105 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.531s    user 0.532s     sys 0.000s
  I0607 12:14:44.697655 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.531s    user 0.531s     sys 0.000s
  I0607 12:14:50.779160 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.532s    user 0.532s     sys 0.000s
  I0607 12:14:56.852665 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.531s    user 0.530s     sys 0.000s
  I0607 12:15:02.954947 12117 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.538s    user 0.540s     sys 0.000s

LLVM 4.0 (after this fix) (asm at [4]):

  I0607 12:17:37.953876 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.292s      user 0.292s     sys 0.000s
  I0607 12:17:44.214912 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.295s      user 0.295s     sys 0.000s
  I0607 12:17:50.057200 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.299s      user 0.299s     sys 0.000s
  I0607 12:17:55.882935 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.292s      user 0.292s     sys 0.000s
  I0607 12:18:01.725445 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.293s      user 0.293s     sys 0.000s
  I0607 12:18:07.562417 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.293s      user 0.294s     sys 0.000s
  I0607 12:18:13.421221 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.293s      user 0.293s     sys 0.000s
  I0607 12:18:19.268613 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.293s      user 0.293s     sys 0.000s
  I0607 12:18:25.113490 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.293s      user 0.294s     sys 0.000s
  I0607 12:18:30.950954 17916 memrowset-test.cc:446] Time spent Scanning rows 
where all are committed: real 0.293s      user 0.294s     sys 0.000s

Looking at the assembly for the three tests above, it's clear that the
inlining is broken in 4.0 without this fix, and with it, the generated
assembly is a little better (shorter) than 3.9.

[1] http://lists.llvm.org/pipermail/llvm-dev/2017-May/113175.html
[2] https://gist.github.com/b4b0365ea1570f33d55c6c3dffd4a8af
[3] https://gist.github.com/128ecb292fddd069d5397a76dc53fc11
[4] https://gist.github.com/0cb01f454737d6c2dc457f0cdb6bb52d

Change-Id: I7e449df80e5cd02b9ce2dbf4daa5cf8151007dcf
---
M src/kudu/codegen/CMakeLists.txt
M src/kudu/codegen/code_generator.cc
M src/kudu/codegen/codegen-test.cc
M src/kudu/codegen/module_builder.cc
4 files changed, 53 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/7099/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7099
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7e449df80e5cd02b9ce2dbf4daa5cf8151007dcf
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>

Reply via email to