[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-12 Thread Todd Lipcon (Code Review)
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13591

to look at the new patch set (#5).

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..

KUDU-2846 (part 1): optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by up to 7.2x depending on the particular
predicate, type, and nullability (average around 4.8x). Branches are
reduced by about 6.5x and branch-misses by about 22x.

It's possible that hand-coded SIMD could improve on this a little bit
but likely not worth the effort.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':
  73905.379627  task-clock (msec) #0.997 CPUs utilized
   272,810,081,028  cycles#3.691 GHz
   938,488,388,743  instructions  #3.44  insn per cycle
   148,052,698,322  branches  # 2003.274 M/sec
   882,311,138  branch-misses #0.60% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':
  15354.077654  task-clock (msec) #0.992 CPUs utilized
56,850,629,856  cycles#3.703 GHz
   181,599,095,960  instructions  #3.19  insn per cycle
22,496,453,160  branches  # 1465.178 M/sec
38,662,626  branch-misses #0.17% of all branches

Detailed results before:
  int8   NOT NULL   (c = 0) 632.1M evals/sec4.44 cycles/eval
  int8   NULL   (c = 0) 515.6M evals/sec5.48 cycles/eval
  int8   NOT NULL   (c >= 0) 630.8M evals/sec   4.45 cycles/eval
  int8   NULL   (c >= 0) 426.8M evals/sec   6.64 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 632.6M evals/sec 4.44 cycles/eval
  int8   NULL   (c >= 0 AND c < 2) 384.7M evals/sec 7.38 cycles/eval
  int16  NOT NULL   (c = 0) 644.4M evals/sec4.34 cycles/eval
  int16  NULL   (c = 0) 524.6M evals/sec5.37 cycles/eval
  int16  NOT NULL   (c >= 0) 638.4M evals/sec   4.37 cycles/eval
  int16  NULL   (c >= 0) 458.8M evals/sec   6.17 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 635.3M evals/sec 4.40 cycles/eval
  int16  NULL   (c >= 0 AND c < 2) 335.1M evals/sec 8.50 cycles/eval
  int32  NOT NULL   (c = 0) 645.2M evals/sec4.34 cycles/eval
  int32  NULL   (c = 0) 492.6M evals/sec5.77 cycles/eval
  int32  NOT NULL   (c >= 0) 608.6M evals/sec   4.64 cycles/eval
  int32  NULL   (c >= 0) 440.7M evals/sec   6.48 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 637.8M evals/sec 4.43 cycles/eval
  int32  NULL   (c >= 0 AND c < 2) 348.0M evals/sec 8.22 cycles/eval
  int64  NOT NULL   (c = 0) 642.7M evals/sec4.36 cycles/eval
  int64  NULL   (c = 0) 505.3M evals/sec5.60 cycles/eval
  int64  NOT NULL   (c >= 0) 643.5M evals/sec   4.34 cycles/eval
  int64  NULL   (c >= 0) 472.8M evals/sec   6.00 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 634.2M evals/sec 4.43 cycles/eval
  int64  NULL   (c >= 0 AND c < 2) 396.7M evals/sec 7.21 cycles/eval
  float  NOT NULL   (c = 0) 604.6M evals/sec4.63 cycles/eval
  float  NULL   (c = 0) 406.7M evals/sec7.05 cycles/eval
  float  NOT NULL   (c >= 0) 545.3M evals/sec   5.20 cycles/eval
  float  NULL   (c >= 0) 384.4M evals/sec   7.39 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 583.2M evals/sec 4.80 cycles/eval
  float  NULL   (c >= 0 AND c < 2) 312.2M evals/sec 9.12 cycles/eval
  double NOT NULL   (c = 0) 614.0M evals/sec4.56 cycles/eval
  double NULL   (c = 0) 471.5M evals/sec5.99 cycles/eval
  double NOT NULL   (c >= 0) 623.0M evals/sec   4.48 cycles/eval
  double NULL   (c >= 0) 379.9M evals/sec   7.47 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 599.5M evals/sec 4.67 cycles/eval
  double NULL   (c >= 0 AND c < 2) 415.2M evals/sec 6.82 cycles/eval

Detailed results after:
  int8   NOT NULL   (c = 0) 3660.3M evals/sec   0.76 cycles/eval
  int8   NULL   (c = 0) 3657.1M evals/sec   0.76 cycles/eval
  int8   NOT NULL   (c >= 0) 3712.0M evals/sec  0.75 cycles/eval
  int8   NULL   (c >= 0) 3618.9M evals/sec  0.78 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 1661.9M evals/sec1.73 cycles/eval
  int8   NULL   (c >= 0 AND c < 2) 1663.4M evals/sec1.77 cycles/eval
  int16  NOT NULL   (c = 0) 3781.4M evals/sec   0.73 cycles/eval
  int16  NULL   (c = 0) 3738.3M evals/sec   0.74 cycles/eval
  int16  NOT NULL   (c >= 0) 3672.9M evals/sec  0.76 cycles/eval
  int16  NULL   (c >= 0) 3767.4M evals/sec  0.75 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 1654.3M evals/sec1.77 cycles/eval
  int16  

[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Todd Lipcon (Code Review)
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13591

to look at the new patch set (#4).

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..

KUDU-2846 (part 1): optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.

Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA.  Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':
  73905.379627  task-clock (msec) #0.997 CPUs utilized
   504  context-switches  #0.007 K/sec
19  cpu-migrations#0.000 K/sec
 1,296  page-faults   #0.018 K/sec
   272,810,081,028  cycles#3.691 GHz
   938,488,388,743  instructions  #3.44  insn per cycle
   148,052,698,322  branches  # 2003.274 M/sec
   882,311,138  branch-misses #0.60% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

  38024.082495  task-clock (msec) #0.996 CPUs utilized
   252  context-switches  #0.007 K/sec
 7  cpu-migrations#0.000 K/sec
 1,295  page-faults   #0.034 K/sec
   142,231,469,257  cycles#3.741 GHz
   172,437,810,470  instructions  #1.21  insn per cycle
18,460,117,439  branches  #  485.485 M/sec
60,960,125  branch-misses #0.33% of all branches

Detailed results before:
  int8   NOT NULL   (c = 0) 632.1M evals/sec4.44 cycles/eval
  int8   NULL   (c = 0) 515.6M evals/sec5.48 cycles/eval
  int8   NOT NULL   (c >= 0) 630.8M evals/sec   4.45 cycles/eval
  int8   NULL   (c >= 0) 426.8M evals/sec   6.64 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 632.6M evals/sec 4.44 cycles/eval
  int8   NULL   (c >= 0 AND c < 2) 384.7M evals/sec 7.38 cycles/eval
  int16  NOT NULL   (c = 0) 644.4M evals/sec4.34 cycles/eval
  int16  NULL   (c = 0) 524.6M evals/sec5.37 cycles/eval
  int16  NOT NULL   (c >= 0) 638.4M evals/sec   4.37 cycles/eval
  int16  NULL   (c >= 0) 458.8M evals/sec   6.17 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 635.3M evals/sec 4.40 cycles/eval
  int16  NULL   (c >= 0 AND c < 2) 335.1M evals/sec 8.50 cycles/eval
  int32  NOT NULL   (c = 0) 645.2M evals/sec4.34 cycles/eval
  int32  NULL   (c = 0) 492.6M evals/sec5.77 cycles/eval
  int32  NOT NULL   (c >= 0) 608.6M evals/sec   4.64 cycles/eval
  int32  NULL   (c >= 0) 440.7M evals/sec   6.48 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 637.8M evals/sec 4.43 cycles/eval
  int32  NULL   (c >= 0 AND c < 2) 348.0M evals/sec 8.22 cycles/eval
  int64  NOT NULL   (c = 0) 642.7M evals/sec4.36 cycles/eval
  int64  NULL   (c = 0) 505.3M evals/sec5.60 cycles/eval
  int64  NOT NULL   (c >= 0) 643.5M evals/sec   4.34 cycles/eval
  int64  NULL   (c >= 0) 472.8M evals/sec   6.00 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 634.2M evals/sec 4.43 cycles/eval
  int64  NULL   (c >= 0 AND c < 2) 396.7M evals/sec 7.21 cycles/eval
  float  NOT NULL   (c = 0) 604.6M evals/sec4.63 cycles/eval
  float  NULL   (c = 0) 406.7M evals/sec7.05 cycles/eval
  float  NOT NULL   (c >= 0) 545.3M evals/sec   5.20 cycles/eval
  float  NULL   (c >= 0) 384.4M evals/sec   7.39 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 583.2M evals/sec 4.80 cycles/eval
  float  NULL   (c >= 0 AND c < 2) 312.2M evals/sec 9.12 cycles/eval
  double NOT NULL   (c = 0) 614.0M evals/sec4.56 cycles/eval
  double NULL   (c = 0) 471.5M evals/sec5.99 cycles/eval
  double NOT NULL   (c >= 0) 623.0M evals/sec   4.48 cycles/eval
  double NULL   (c >= 0) 379.9M evals/sec   7.47 cycles/eval
  

[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13591 )

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..


Patch Set 3:

(3 comments)

Code looks fine but the tests seem angry.

http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate-test.cc
File src/kudu/common/column_predicate-test.cc:

http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate-test.cc@1538
PS3, Line 1538: num_ret += selvec.CountSelected();
Should probably check the final value of num_ret.


http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate.cc
File src/kudu/common/column_predicate.cc:

http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate.cc@654
PS3, Line 654:   return 0x8040201008040201 * t >> 56;
:exploding_head:


http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate.cc@686
PS3, Line 686:   return n_chunks * 8;
nit: Could you doc the return value?



--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 12 Jun 2019 02:36:22 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Andrew Wong (Code Review)
Andrew Wong has removed Andrew Wong from this change.  ( 
http://gerrit.cloudera.org:8080/13591 )

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..


Removed reviewer Andrew Wong.
--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Todd Lipcon (Code Review)
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13591

to look at the new patch set (#3).

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..

KUDU-2846 (part 1): optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.

Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA.  Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

  82185.366028  task-clock (msec) #0.997 CPUs utilized
   288,909,311,749  cycles#3.515 GHz
   956,410,925,173  instructions  #3.31  insn per cycle
   149,468,823,714  branches  # 1818.679 M/sec
 1,237,139,955  branch-misses #0.83% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

  42626.067916  task-clock (msec) #0.996 CPUs utilized
   149,363,412,476  cycles#3.504 GHz
   190,514,045,889  instructions  #1.28  insn per cycle
19,902,815,659  branches  #  466.917 M/sec
63,130,874  branch-misses #0.32% of all branches

Detailed results before:
  int8   NOT NULL   (c = 0) 573.9M evals/sec4.78 cycles/eval
  int8   NULL   (c = 0) 456.2M evals/sec6.14 cycles/eval
  int8   NOT NULL   (c >= 0) 573.5M evals/sec   4.79 cycles/eval
  int8   NULL   (c >= 0) 420.3M evals/sec   6.71 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 565.1M evals/sec 4.87 cycles/eval
  int8   NULL   (c >= 0 AND c < 2) 372.0M evals/sec 7.53 cycles/eval
  int16  NOT NULL   (c = 0) 577.0M evals/sec4.75 cycles/eval
  int16  NULL   (c = 0) 460.5M evals/sec6.06 cycles/eval
  int16  NOT NULL   (c >= 0) 568.9M evals/sec   4.80 cycles/eval
  int16  NULL   (c >= 0) 400.4M evals/sec   6.96 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval
  int16  NULL   (c >= 0 AND c < 2) 299.4M evals/sec 9.40 cycles/eval
  int32  NOT NULL   (c = 0) 543.8M evals/sec5.05 cycles/eval
  int32  NULL   (c = 0) 446.2M evals/sec6.21 cycles/eval
  int32  NOT NULL   (c >= 0) 565.5M evals/sec   4.84 cycles/eval
  int32  NULL   (c >= 0) 380.4M evals/sec   7.36 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 561.8M evals/sec 4.91 cycles/eval
  int32  NULL   (c >= 0 AND c < 2) 308.6M evals/sec 9.18 cycles/eval
  int64  NOT NULL   (c = 0) 566.6M evals/sec4.88 cycles/eval
  int64  NULL   (c = 0) 463.9M evals/sec6.07 cycles/eval
  int64  NOT NULL   (c >= 0) 555.5M evals/sec   4.97 cycles/eval
  int64  NULL   (c >= 0) 385.3M evals/sec   7.28 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 567.1M evals/sec 4.83 cycles/eval
  int64  NULL   (c >= 0 AND c < 2) 194.7M evals/sec 14.61 cycles/eval
  float  NOT NULL   (c = 0) 584.5M evals/sec4.68 cycles/eval
  float  NULL   (c = 0) 441.4M evals/sec6.29 cycles/eval
  float  NOT NULL   (c >= 0) 576.6M evals/sec   4.74 cycles/eval
  float  NULL   (c >= 0) 361.1M evals/sec   7.74 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval
  float  NULL   (c >= 0 AND c < 2) 301.5M evals/sec 9.34 cycles/eval
  double NOT NULL   (c = 0) 589.9M evals/sec4.64 cycles/eval
  double NULL   (c = 0) 450.0M evals/sec6.15 cycles/eval
  double NOT NULL   (c >= 0) 571.5M evals/sec   4.78 cycles/eval
  double NULL   (c >= 0) 367.8M evals/sec   7.60 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 577.8M evals/sec 4.77 cycles/eval
  double NULL   (c >= 0 AND c < 2) 429.5M evals/sec 6.49 cycles/eval

Detailed results after:
  int8   NOT NULL   (c = 0) 926.7M evals/sec3.01 cycles/eval
  int8   NULL   (c = 0) 935.2M evals/sec2.98 cycles/eval
  int8   NOT NULL   (c >= 0) 913.6M evals/sec   3.03 cycles/eval
  int8   NULL   (c >= 0) 903.2M 

[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Todd Lipcon (Code Review)
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/13591

to look at the new patch set (#2).

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..

KUDU-2846 (part 1): optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.

Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA.  Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

  82185.366028  task-clock (msec) #0.997 CPUs utilized
   288,909,311,749  cycles#3.515 GHz
   956,410,925,173  instructions  #3.31  insn per cycle
   149,468,823,714  branches  # 1818.679 M/sec
 1,237,139,955  branch-misses #0.83% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

  42626.067916  task-clock (msec) #0.996 CPUs utilized
   149,363,412,476  cycles#3.504 GHz
   190,514,045,889  instructions  #1.28  insn per cycle
19,902,815,659  branches  #  466.917 M/sec
63,130,874  branch-misses #0.32% of all branches

Detailed results before:
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NOT 
NULL: real 1.730s user 1.730s sys 0.002s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NULL: 
real 2.097s user 2.096s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 NOT 
NULL: real 1.755suser 1.756s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 
NULL: real 2.631suser 2.632s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int8 NOT NULL: real 1.850s  user 1.848s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int8 NULL: real 2.808s  user 2.808s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 NOT 
NULL: real 1.753suser 1.752s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 
NULL: real 2.248suser 2.244s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 NOT 
NULL: real 1.750s   user 1.752s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 
NULL: real 2.420s   user 2.416s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int16 NOT NULL: real 1.811s user 1.808s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int16 NULL: real 5.321s user 5.313s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 NOT 
NULL: real 1.834suser 1.824s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 
NULL: real 2.233suser 2.232s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 NOT 
NULL: real 1.797s   user 1.793s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 
NULL: real 2.791s   user 2.774s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int32 NOT NULL: real 1.873s user 1.869s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int32 NULL: real 3.104s user 3.071s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 NOT 
NULL: real 1.781suser 1.779s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 
NULL: real 2.209suser 2.203s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int64 NOT 
NULL: real 

[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13591 )

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG@41
PS1, Line 41: perf-stat after:
> Could you include the time elapsed for 'after' too?
oh, I meant to actually remove it from 'before' because 'task-clock' is the 
same thing (it's a single-threaded CPU bound workload)



--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 11 Jun 2019 23:50:32 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13591 )

Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG@41
PS1, Line 41: perf-stat after:
Could you include the time elapsed for 'after' too?



--
To view, visit http://gerrit.cloudera.org:8080/13591
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13591
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 11 Jun 2019 23:42:20 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives

2019-06-11 Thread Todd Lipcon (Code Review)
Hello Andrew Wong,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/13591

to review the following change.


Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives
..

KUDU-2846 (part 1): optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by 1.6-2.5x depending on the particular
predicate, type, and nullability (average around 2x). Branches are
reduced by about 7.5x and branch-misses by about 19.6x.

Looking at the "after" perf-stat results, the instructions-per-cycle are
way down, which indicates we're probably stalled on instruction
dependencies or port saturation. This is also indicated by the fact that
the smaller ints don't seem to run any faster than the large ints (which
wouldn't be the case if we were limited by load/store bandwidth). Likely
the next fix here is to use SIMD to do comparisons in parallel as
suggested in the JIRA.  Unfortunately, the compiler doesn't seem to
auto-vectorize these loops, so if we want further gain, we'll have to
add some more hand-written vectorization code. So, we'll start with this
easy win.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

  82185.366028  task-clock (msec) #0.997 CPUs utilized
   288,909,311,749  cycles#3.515 GHz
   956,410,925,173  instructions  #3.31  insn per cycle
   149,468,823,714  branches  # 1818.679 M/sec
 1,237,139,955  branch-misses #0.83% of all branches

  82.398392581 seconds time elapsed

  82.132012000 seconds user
   0.055937000 seconds sys

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test 
--gtest_filter=*Bench*':

  42626.067916  task-clock (msec) #0.996 CPUs utilized
   149,363,412,476  cycles#3.504 GHz
   190,514,045,889  instructions  #1.28  insn per cycle
19,902,815,659  branches  #  466.917 M/sec
63,130,874  branch-misses #0.32% of all branches

Detailed results before:
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NOT 
NULL: real 1.730s user 1.730s sys 0.002s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NULL: 
real 2.097s user 2.096s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 NOT 
NULL: real 1.755suser 1.756s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 
NULL: real 2.631suser 2.632s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int8 NOT NULL: real 1.850s  user 1.848s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int8 NULL: real 2.808s  user 2.808s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 NOT 
NULL: real 1.753suser 1.752s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 
NULL: real 2.248suser 2.244s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 NOT 
NULL: real 1.750s   user 1.752s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 
NULL: real 2.420s   user 2.416s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int16 NOT NULL: real 1.811s user 1.808s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int16 NULL: real 5.321s user 5.313s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 NOT 
NULL: real 1.834suser 1.824s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 
NULL: real 2.233suser 2.232s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 NOT 
NULL: real 1.797s   user 1.793s sys 0.000s
  Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 
NULL: real 2.791s   user 2.774s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int32 NOT NULL: real 1.873s user 1.869s sys 0.000s
  Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type 
int32 NULL: real 3.104s user 3.071s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 NOT 
NULL: real 1.781suser 1.779s sys 0.000s
  Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 
NULL: real 2.209suser 2.203s sys 0.000s
  Time spent evaluating c >= 0: