Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/13591 to look at the new patch set (#4). Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives ...................................................................... KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by 1.6-2.5x depending on the particular predicate, type, and nullability (average around 2x). Branches are reduced by about 7.5x and branch-misses by about 19.6x. Looking at the "after" perf-stat results, the instructions-per-cycle are way down, which indicates we're probably stalled on instruction dependencies or port saturation. This is also indicated by the fact that the smaller ints don't seem to run any faster than the large ints (which wouldn't be the case if we were limited by load/store bandwidth). Likely the next fix here is to use SIMD to do comparisons in parallel as suggested in the JIRA. Unfortunately, the compiler doesn't seem to auto-vectorize these loops, so if we want further gain, we'll have to add some more hand-written vectorization code. So, we'll start with this easy win. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 73905.379627 task-clock (msec) # 0.997 CPUs utilized 504 context-switches # 0.007 K/sec 19 cpu-migrations # 0.000 K/sec 1,296 page-faults # 0.018 K/sec 272,810,081,028 cycles # 3.691 GHz 938,488,388,743 instructions # 3.44 insn per cycle 148,052,698,322 branches # 2003.274 M/sec 882,311,138 branch-misses # 0.60% of all branches perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 38024.082495 task-clock (msec) # 0.996 CPUs utilized 252 context-switches # 0.007 K/sec 7 cpu-migrations # 0.000 K/sec 1,295 page-faults # 0.034 K/sec 142,231,469,257 cycles # 3.741 GHz 172,437,810,470 instructions # 1.21 insn per cycle 18,460,117,439 branches # 485.485 M/sec 60,960,125 branch-misses # 0.33% of all branches Detailed results before: int8 NOT NULL (c = 0) 632.1M evals/sec 4.44 cycles/eval int8 NULL (c = 0) 515.6M evals/sec 5.48 cycles/eval int8 NOT NULL (c >= 0) 630.8M evals/sec 4.45 cycles/eval int8 NULL (c >= 0) 426.8M evals/sec 6.64 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 632.6M evals/sec 4.44 cycles/eval int8 NULL (c >= 0 AND c < 2) 384.7M evals/sec 7.38 cycles/eval int16 NOT NULL (c = 0) 644.4M evals/sec 4.34 cycles/eval int16 NULL (c = 0) 524.6M evals/sec 5.37 cycles/eval int16 NOT NULL (c >= 0) 638.4M evals/sec 4.37 cycles/eval int16 NULL (c >= 0) 458.8M evals/sec 6.17 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 635.3M evals/sec 4.40 cycles/eval int16 NULL (c >= 0 AND c < 2) 335.1M evals/sec 8.50 cycles/eval int32 NOT NULL (c = 0) 645.2M evals/sec 4.34 cycles/eval int32 NULL (c = 0) 492.6M evals/sec 5.77 cycles/eval int32 NOT NULL (c >= 0) 608.6M evals/sec 4.64 cycles/eval int32 NULL (c >= 0) 440.7M evals/sec 6.48 cycles/eval int32 NOT NULL (c >= 0 AND c < 2) 637.8M evals/sec 4.43 cycles/eval int32 NULL (c >= 0 AND c < 2) 348.0M evals/sec 8.22 cycles/eval int64 NOT NULL (c = 0) 642.7M evals/sec 4.36 cycles/eval int64 NULL (c = 0) 505.3M evals/sec 5.60 cycles/eval int64 NOT NULL (c >= 0) 643.5M evals/sec 4.34 cycles/eval int64 NULL (c >= 0) 472.8M evals/sec 6.00 cycles/eval int64 NOT NULL (c >= 0 AND c < 2) 634.2M evals/sec 4.43 cycles/eval int64 NULL (c >= 0 AND c < 2) 396.7M evals/sec 7.21 cycles/eval float NOT NULL (c = 0) 604.6M evals/sec 4.63 cycles/eval float NULL (c = 0) 406.7M evals/sec 7.05 cycles/eval float NOT NULL (c >= 0) 545.3M evals/sec 5.20 cycles/eval float NULL (c >= 0) 384.4M evals/sec 7.39 cycles/eval float NOT NULL (c >= 0 AND c < 2) 583.2M evals/sec 4.80 cycles/eval float NULL (c >= 0 AND c < 2) 312.2M evals/sec 9.12 cycles/eval double NOT NULL (c = 0) 614.0M evals/sec 4.56 cycles/eval double NULL (c = 0) 471.5M evals/sec 5.99 cycles/eval double NOT NULL (c >= 0) 623.0M evals/sec 4.48 cycles/eval double NULL (c >= 0) 379.9M evals/sec 7.47 cycles/eval double NOT NULL (c >= 0 AND c < 2) 599.5M evals/sec 4.67 cycles/eval double NULL (c >= 0 AND c < 2) 415.2M evals/sec 6.82 cycles/eval Detailed results after: int8 NOT NULL (c = 0) 1053.2M evals/sec 2.74 cycles/eval int8 NULL (c = 0) 1044.6M evals/sec 2.77 cycles/eval int8 NOT NULL (c >= 0) 1044.6M evals/sec 2.77 cycles/eval int8 NULL (c >= 0) 1045.0M evals/sec 2.76 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 943.8M evals/sec 3.03 cycles/eval int8 NULL (c >= 0 AND c < 2) 933.9M evals/sec 3.07 cycles/eval int16 NOT NULL (c = 0) 1039.2M evals/sec 2.78 cycles/eval int16 NULL (c = 0) 1037.2M evals/sec 2.79 cycles/eval int16 NOT NULL (c >= 0) 1041.2M evals/sec 2.78 cycles/eval int16 NULL (c >= 0) 1049.2M evals/sec 2.76 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 948.3M evals/sec 3.00 cycles/eval int16 NULL (c >= 0 AND c < 2) 951.1M evals/sec 2.99 cycles/eval int32 NOT NULL (c = 0) 1049.5M evals/sec 2.74 cycles/eval int32 NULL (c = 0) 1050.3M evals/sec 2.74 cycles/eval int32 NOT NULL (c >= 0) 1040.9M evals/sec 2.76 cycles/eval int32 NULL (c >= 0) 1050.1M evals/sec 2.75 cycles/eval int32 NOT NULL (c >= 0 AND c < 2) 944.7M evals/sec 2.99 cycles/eval int32 NULL (c >= 0 AND c < 2) 931.0M evals/sec 3.03 cycles/eval int64 NOT NULL (c = 0) 1040.7M evals/sec 2.75 cycles/eval int64 NULL (c = 0) 1040.8M evals/sec 2.76 cycles/eval int64 NOT NULL (c >= 0) 1036.6M evals/sec 2.77 cycles/eval int64 NULL (c >= 0) 1044.9M evals/sec 2.75 cycles/eval int64 NOT NULL (c >= 0 AND c < 2) 941.2M evals/sec 3.02 cycles/eval int64 NULL (c >= 0 AND c < 2) 930.9M evals/sec 3.04 cycles/eval float NOT NULL (c = 0) 1040.6M evals/sec 2.77 cycles/eval float NULL (c = 0) 1035.7M evals/sec 2.78 cycles/eval float NOT NULL (c >= 0) 960.5M evals/sec 3.00 cycles/eval float NULL (c >= 0) 955.2M evals/sec 3.01 cycles/eval float NOT NULL (c >= 0 AND c < 2) 797.5M evals/sec 3.56 cycles/eval float NULL (c >= 0 AND c < 2) 797.6M evals/sec 3.56 cycles/eval double NOT NULL (c = 0) 1036.4M evals/sec 2.77 cycles/eval double NULL (c = 0) 988.7M evals/sec 2.91 cycles/eval double NOT NULL (c >= 0) 924.2M evals/sec 3.11 cycles/eval double NULL (c >= 0) 930.9M evals/sec 3.10 cycles/eval double NOT NULL (c >= 0 AND c < 2) 800.0M evals/sec 3.55 cycles/eval double NULL (c >= 0 AND c < 2) 802.5M evals/sec 3.52 cycles/eval Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 --- M src/kudu/common/CMakeLists.txt M src/kudu/common/column_predicate-test.cc M src/kudu/common/column_predicate.cc 3 files changed, 152 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/91/13591/4 -- To view, visit http://gerrit.cloudera.org:8080/13591 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 Gerrit-Change-Number: 13591 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <t...@apache.org>