Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13591

to look at the new patch set (#3).

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
......................................................................

KUDU-2846 (part 1): optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.

Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA.  Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

      82185.366028      task-clock (msec)         #    0.997 CPUs utilized
   288,909,311,749      cycles                    #    3.515 GHz
   956,410,925,173      instructions              #    3.31  insn per cycle
   149,468,823,714      branches                  # 1818.679 M/sec
     1,237,139,955      branch-misses             #    0.83% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

      42626.067916      task-clock (msec)         #    0.996 CPUs utilized
   149,363,412,476      cycles                    #    3.504 GHz
   190,514,045,889      instructions              #    1.28  insn per cycle
    19,902,815,659      branches                  #  466.917 M/sec
        63,130,874      branch-misses             #    0.32% of all branches

Detailed results before:
  int8   NOT NULL   (c = 0) 573.9M evals/sec    4.78 cycles/eval
  int8   NULL       (c = 0) 456.2M evals/sec    6.14 cycles/eval
  int8   NOT NULL   (c >= 0) 573.5M evals/sec   4.79 cycles/eval
  int8   NULL       (c >= 0) 420.3M evals/sec   6.71 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 565.1M evals/sec 4.87 cycles/eval
  int8   NULL       (c >= 0 AND c < 2) 372.0M evals/sec 7.53 cycles/eval
  int16  NOT NULL   (c = 0) 577.0M evals/sec    4.75 cycles/eval
  int16  NULL       (c = 0) 460.5M evals/sec    6.06 cycles/eval
  int16  NOT NULL   (c >= 0) 568.9M evals/sec   4.80 cycles/eval
  int16  NULL       (c >= 0) 400.4M evals/sec   6.96 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval
  int16  NULL       (c >= 0 AND c < 2) 299.4M evals/sec 9.40 cycles/eval
  int32  NOT NULL   (c = 0) 543.8M evals/sec    5.05 cycles/eval
  int32  NULL       (c = 0) 446.2M evals/sec    6.21 cycles/eval
  int32  NOT NULL   (c >= 0) 565.5M evals/sec   4.84 cycles/eval
  int32  NULL       (c >= 0) 380.4M evals/sec   7.36 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 561.8M evals/sec 4.91 cycles/eval
  int32  NULL       (c >= 0 AND c < 2) 308.6M evals/sec 9.18 cycles/eval
  int64  NOT NULL   (c = 0) 566.6M evals/sec    4.88 cycles/eval
  int64  NULL       (c = 0) 463.9M evals/sec    6.07 cycles/eval
  int64  NOT NULL   (c >= 0) 555.5M evals/sec   4.97 cycles/eval
  int64  NULL       (c >= 0) 385.3M evals/sec   7.28 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 567.1M evals/sec 4.83 cycles/eval
  int64  NULL       (c >= 0 AND c < 2) 194.7M evals/sec 14.61 cycles/eval
  float  NOT NULL   (c = 0) 584.5M evals/sec    4.68 cycles/eval
  float  NULL       (c = 0) 441.4M evals/sec    6.29 cycles/eval
  float  NOT NULL   (c >= 0) 576.6M evals/sec   4.74 cycles/eval
  float  NULL       (c >= 0) 361.1M evals/sec   7.74 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval
  float  NULL       (c >= 0 AND c < 2) 301.5M evals/sec 9.34 cycles/eval
  double NOT NULL   (c = 0) 589.9M evals/sec    4.64 cycles/eval
  double NULL       (c = 0) 450.0M evals/sec    6.15 cycles/eval
  double NOT NULL   (c >= 0) 571.5M evals/sec   4.78 cycles/eval
  double NULL       (c >= 0) 367.8M evals/sec   7.60 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 577.8M evals/sec 4.77 cycles/eval
  double NULL       (c >= 0 AND c < 2) 429.5M evals/sec 6.49 cycles/eval

Detailed results after:
  int8   NOT NULL   (c = 0) 926.7M evals/sec    3.01 cycles/eval
  int8   NULL       (c = 0) 935.2M evals/sec    2.98 cycles/eval
  int8   NOT NULL   (c >= 0) 913.6M evals/sec   3.03 cycles/eval
  int8   NULL       (c >= 0) 903.2M evals/sec   3.08 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 824.3M evals/sec 3.35 cycles/eval
  int8   NULL       (c >= 0 AND c < 2) 814.5M evals/sec 3.38 cycles/eval
  int16  NOT NULL   (c = 0) 900.6M evals/sec    3.07 cycles/eval
  int16  NULL       (c = 0) 946.9M evals/sec    2.93 cycles/eval
  int16  NOT NULL   (c >= 0) 925.8M evals/sec   2.99 cycles/eval
  int16  NULL       (c >= 0) 922.6M evals/sec   3.00 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 819.7M evals/sec 3.35 cycles/eval
  int16  NULL       (c >= 0 AND c < 2) 822.8M evals/sec 3.34 cycles/eval
  int32  NOT NULL   (c = 0) 894.0M evals/sec    3.09 cycles/eval
  int32  NULL       (c = 0) 916.3M evals/sec    3.01 cycles/eval
  int32  NOT NULL   (c >= 0) 916.2M evals/sec   3.02 cycles/eval
  int32  NULL       (c >= 0) 933.2M evals/sec   2.97 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 863.5M evals/sec 3.17 cycles/eval
  int32  NULL       (c >= 0 AND c < 2) 866.4M evals/sec 3.16 cycles/eval
  int64  NOT NULL   (c = 0) 949.9M evals/sec    2.92 cycles/eval
  int64  NULL       (c = 0) 936.2M evals/sec    2.96 cycles/eval
  int64  NOT NULL   (c >= 0) 950.2M evals/sec   2.92 cycles/eval
  int64  NULL       (c >= 0) 926.0M evals/sec   2.99 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 835.5M evals/sec 3.29 cycles/eval
  int64  NULL       (c >= 0 AND c < 2) 835.6M evals/sec 3.30 cycles/eval
  float  NOT NULL   (c = 0) 936.5M evals/sec    2.95 cycles/eval
  float  NULL       (c = 0) 933.0M evals/sec    2.97 cycles/eval
  float  NOT NULL   (c >= 0) 852.2M evals/sec   3.27 cycles/eval
  float  NULL       (c >= 0) 838.3M evals/sec   3.32 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 691.9M evals/sec 3.97 cycles/eval
  float  NULL       (c >= 0 AND c < 2) 705.3M evals/sec 3.90 cycles/eval
  double NOT NULL   (c = 0) 898.3M evals/sec    3.08 cycles/eval
  double NULL       (c = 0) 879.7M evals/sec    3.14 cycles/eval
  double NOT NULL   (c >= 0) 800.0M evals/sec   3.46 cycles/eval
  double NULL       (c >= 0) 836.6M evals/sec   3.32 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 719.2M evals/sec 3.83 cycles/eval
  double NULL       (c >= 0 AND c < 2) 721.1M evals/sec 3.82 cycles/eval

Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
---
M src/kudu/common/CMakeLists.txt
M src/kudu/common/column_predicate-test.cc
M src/kudu/common/column_predicate.cc
3 files changed, 146 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/13591/3
--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to