Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/13591 to look at the new patch set (#2). Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives ...................................................................... KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by 1.6-2.5x depending on the particular predicate, type, and nullability (average around 2x). Branches are reduced by about 7.5x and branch-misses by about 19.6x. Looking at the "after" perf-stat results, the instructions-per-cycle are way down, which indicates we're probably stalled on instruction dependencies or port saturation. This is also indicated by the fact that the smaller ints don't seem to run any faster than the large ints (which wouldn't be the case if we were limited by load/store bandwidth). Likely the next fix here is to use SIMD to do comparisons in parallel as suggested in the JIRA. Unfortunately, the compiler doesn't seem to auto-vectorize these loops, so if we want further gain, we'll have to add some more hand-written vectorization code. So, we'll start with this easy win. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 82185.366028 task-clock (msec) # 0.997 CPUs utilized 288,909,311,749 cycles # 3.515 GHz 956,410,925,173 instructions # 3.31 insn per cycle 149,468,823,714 branches # 1818.679 M/sec 1,237,139,955 branch-misses # 0.83% of all branches perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 42626.067916 task-clock (msec) # 0.996 CPUs utilized 149,363,412,476 cycles # 3.504 GHz 190,514,045,889 instructions # 1.28 insn per cycle 19,902,815,659 branches # 466.917 M/sec 63,130,874 branch-misses # 0.32% of all branches Detailed results before: Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NOT NULL: real 1.730s user 1.730s sys 0.002s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NULL: real 2.097s user 2.096s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NOT NULL: real 1.755s user 1.756s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NULL: real 2.631s user 2.632s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int8 NOT NULL: real 1.850s user 1.848s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int8 NULL: real 2.808s user 2.808s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NOT NULL: real 1.753s user 1.752s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NULL: real 2.248s user 2.244s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NOT NULL: real 1.750s user 1.752s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NULL: real 2.420s user 2.416s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int16 NOT NULL: real 1.811s user 1.808s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int16 NULL: real 5.321s user 5.313s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NOT NULL: real 1.834s user 1.824s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NULL: real 2.233s user 2.232s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NOT NULL: real 1.797s user 1.793s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NULL: real 2.791s user 2.774s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int32 NOT NULL: real 1.873s user 1.869s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int32 NULL: real 3.104s user 3.071s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NOT NULL: real 1.781s user 1.779s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NULL: real 2.209s user 2.203s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NOT NULL: real 1.741s user 1.739s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NULL: real 2.374s user 2.374s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int64 NOT NULL: real 1.769s user 1.767s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int64 NULL: real 3.113s user 3.099s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NOT NULL: real 1.766s user 1.765s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NULL: real 2.305s user 2.299s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NOT NULL: real 1.755s user 1.752s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NULL: real 2.685s user 2.678s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type float NOT NULL: real 1.777s user 1.771s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type float NULL: real 2.940s user 2.929s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NOT NULL: real 1.756s user 1.749s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NULL: real 2.443s user 2.438s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double NOT NULL: real 1.819s user 1.819s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double NULL: real 2.744s user 2.724s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type double NOT NULL: real 1.753s user 1.746s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type double NULL: real 2.481s user 2.460s sys 0.004s Detailed results after: Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NOT NULL: real 1.082s user 1.073s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NULL: real 1.069s user 1.063s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NOT NULL: real 1.085s user 1.076s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NULL: real 1.071s user 1.068s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int8 NOT NULL: real 1.191s user 1.191s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int8 NULL: real 1.209s user 1.206s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NOT NULL: real 1.099s user 1.099s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NULL: real 1.123s user 1.106s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NOT NULL: real 1.100s user 1.100s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NULL: real 1.070s user 1.068s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int16 NOT NULL: real 1.211s user 1.212s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int16 NULL: real 1.220s user 1.220s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NOT NULL: real 1.104s user 1.104s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NULL: real 1.105s user 1.104s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NOT NULL: real 1.107s user 1.108s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NULL: real 1.081s user 1.080s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int32 NOT NULL: real 1.230s user 1.228s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int32 NULL: real 1.219s user 1.220s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NOT NULL: real 1.071s user 1.072s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NULL: real 1.090s user 1.088s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NOT NULL: real 1.069s user 1.067s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NULL: real 1.083s user 1.084s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int64 NOT NULL: real 1.253s user 1.252s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type int64 NULL: real 1.248s user 1.248s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NOT NULL: real 1.144s user 1.144s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NULL: real 1.144s user 1.144s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NOT NULL: real 1.159s user 1.160s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NULL: real 1.214s user 1.216s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type float NOT NULL: real 1.439s user 1.436s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type float NULL: real 1.457s user 1.458s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NOT NULL: real 1.196s user 1.195s sys 0.000s Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NULL: real 1.213s user 1.212s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double NOT NULL: real 1.232s user 1.230s sys 0.000s Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double NULL: real 1.256s user 1.241s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type double NOT NULL: real 1.419s user 1.418s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type double NULL: real 1.430s user 1.426s sys 0.000s Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 --- M src/kudu/common/CMakeLists.txt M src/kudu/common/column_predicate-test.cc M src/kudu/common/column_predicate.cc 3 files changed, 136 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/13591/2 -- To view, visit http://gerrit.cloudera.org:8080/13591 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 Gerrit-Change-Number: 13591 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <t...@apache.org>