Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13591

to look at the new patch set (#2).

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
......................................................................

KUDU-2846 (part 1): optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.

Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA.  Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

      82185.366028      task-clock (msec)         #    0.997 CPUs utilized
   288,909,311,749      cycles                    #    3.515 GHz
   956,410,925,173      instructions              #    3.31  insn per cycle
   149,468,823,714      branches                  # 1818.679 M/sec
     1,237,139,955      branch-misses             #    0.83% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

      42626.067916      task-clock (msec)         #    0.996 CPUs utilized
   149,363,412,476      cycles                    #    3.504 GHz
   190,514,045,889      instructions              #    1.28  insn per cycle
    19,902,815,659      branches                  #  466.917 M/sec
        63,130,874      branch-misses             #    0.32% of all branches

Detailed results before:
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NOT 
NULL: real 1.730s user 1.730s     sys 0.002s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NULL: 
real 2.097s     user 2.096s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NOT 
NULL: real 1.755s        user 1.756s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 
NULL: real 2.631s    user 2.632s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int8 NOT NULL: real 1.850s      user 1.848s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int8 NULL: real 2.808s  user 2.808s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NOT 
NULL: real 1.753s        user 1.752s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 
NULL: real 2.248s    user 2.244s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NOT 
NULL: real 1.750s       user 1.752s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 
NULL: real 2.420s   user 2.416s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int16 NOT NULL: real 1.811s     user 1.808s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int16 NULL: real 5.321s user 5.313s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NOT 
NULL: real 1.834s        user 1.824s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 
NULL: real 2.233s    user 2.232s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NOT 
NULL: real 1.797s       user 1.793s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 
NULL: real 2.791s   user 2.774s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int32 NOT NULL: real 1.873s     user 1.869s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int32 NULL: real 3.104s user 3.071s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NOT 
NULL: real 1.781s        user 1.779s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 
NULL: real 2.209s    user 2.203s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NOT 
NULL: real 1.741s       user 1.739s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 
NULL: real 2.374s   user 2.374s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int64 NOT NULL: real 1.769s     user 1.767s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int64 NULL: real 3.113s user 3.099s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NOT 
NULL: real 1.766s        user 1.765s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float 
NULL: real 2.305s    user 2.299s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NOT 
NULL: real 1.755s       user 1.752s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float 
NULL: real 2.685s   user 2.678s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
float NOT NULL: real 1.777s     user 1.771s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
float NULL: real 2.940s user 2.929s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NOT 
NULL: real 1.756s       user 1.749s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double 
NULL: real 2.443s   user 2.438s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double 
NOT NULL: real 1.819s      user 1.819s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double 
NULL: real 2.744s  user 2.724s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
double NOT NULL: real 1.753s    user 1.746s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
double NULL: real 2.481s        user 2.460s     sys 0.004s

Detailed results after:
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NOT 
NULL: real 1.082s user 1.073s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int8 NULL: 
real 1.069s     user 1.063s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 NOT 
NULL: real 1.085s        user 1.076s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int8 
NULL: real 1.071s    user 1.068s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int8 NOT NULL: real 1.191s      user 1.191s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int8 NULL: real 1.209s  user 1.206s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 NOT 
NULL: real 1.099s        user 1.099s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int16 
NULL: real 1.123s    user 1.106s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 NOT 
NULL: real 1.100s       user 1.100s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int16 
NULL: real 1.070s   user 1.068s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int16 NOT NULL: real 1.211s     user 1.212s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int16 NULL: real 1.220s user 1.220s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 NOT 
NULL: real 1.104s        user 1.104s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int32 
NULL: real 1.105s    user 1.104s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 NOT 
NULL: real 1.107s       user 1.108s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int32 
NULL: real 1.081s   user 1.080s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int32 NOT NULL: real 1.230s     user 1.228s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int32 NULL: real 1.219s user 1.220s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 NOT 
NULL: real 1.071s        user 1.072s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type int64 
NULL: real 1.090s    user 1.088s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 NOT 
NULL: real 1.069s       user 1.067s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type int64 
NULL: real 1.083s   user 1.084s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int64 NOT NULL: real 1.253s     user 1.252s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
int64 NULL: real 1.248s user 1.248s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float NOT 
NULL: real 1.144s        user 1.144s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type float 
NULL: real 1.144s    user 1.144s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float NOT 
NULL: real 1.159s       user 1.160s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type float 
NULL: real 1.214s   user 1.216s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
float NOT NULL: real 1.439s     user 1.436s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
float NULL: real 1.457s user 1.458s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double NOT 
NULL: real 1.196s       user 1.195s     sys 0.000s
  Time spent evaluating c = 0: 1000000 batches of 1024 rows for type double 
NULL: real 1.213s   user 1.212s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double 
NOT NULL: real 1.232s      user 1.230s     sys 0.000s
  Time spent evaluating c >= 0: 1000000 batches of 1024 rows for type double 
NULL: real 1.256s  user 1.241s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
double NOT NULL: real 1.419s    user 1.418s     sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 1000000 batches of 1024 rows for type 
double NULL: real 1.430s        user 1.426s     sys 0.000s

Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
---
M src/kudu/common/CMakeLists.txt
M src/kudu/common/column_predicate-test.cc
M src/kudu/common/column_predicate.cc
3 files changed, 136 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/13591/2
--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <andrew.w...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to