[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13591 to look at the new patch set (#5). Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by up to 7.2x depending on the particular predicate, type, and nullability (average around 4.8x). Branches are reduced by about 6.5x and branch-misses by about 22x. It's possible that hand-coded SIMD could improve on this a little bit but likely not worth the effort. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 73905.379627 task-clock (msec) #0.997 CPUs utilized 272,810,081,028 cycles#3.691 GHz 938,488,388,743 instructions #3.44 insn per cycle 148,052,698,322 branches # 2003.274 M/sec 882,311,138 branch-misses #0.60% of all branches perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 15354.077654 task-clock (msec) #0.992 CPUs utilized 56,850,629,856 cycles#3.703 GHz 181,599,095,960 instructions #3.19 insn per cycle 22,496,453,160 branches # 1465.178 M/sec 38,662,626 branch-misses #0.17% of all branches Detailed results before: int8 NOT NULL (c = 0) 632.1M evals/sec4.44 cycles/eval int8 NULL (c = 0) 515.6M evals/sec5.48 cycles/eval int8 NOT NULL (c >= 0) 630.8M evals/sec 4.45 cycles/eval int8 NULL (c >= 0) 426.8M evals/sec 6.64 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 632.6M evals/sec 4.44 cycles/eval int8 NULL (c >= 0 AND c < 2) 384.7M evals/sec 7.38 cycles/eval int16 NOT NULL (c = 0) 644.4M evals/sec4.34 cycles/eval int16 NULL (c = 0) 524.6M evals/sec5.37 cycles/eval int16 NOT NULL (c >= 0) 638.4M evals/sec 4.37 cycles/eval int16 NULL (c >= 0) 458.8M evals/sec 6.17 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 635.3M evals/sec 4.40 cycles/eval int16 NULL (c >= 0 AND c < 2) 335.1M evals/sec 8.50 cycles/eval int32 NOT NULL (c = 0) 645.2M evals/sec4.34 cycles/eval int32 NULL (c = 0) 492.6M evals/sec5.77 cycles/eval int32 NOT NULL (c >= 0) 608.6M evals/sec 4.64 cycles/eval int32 NULL (c >= 0) 440.7M evals/sec 6.48 cycles/eval int32 NOT NULL (c >= 0 AND c < 2) 637.8M evals/sec 4.43 cycles/eval int32 NULL (c >= 0 AND c < 2) 348.0M evals/sec 8.22 cycles/eval int64 NOT NULL (c = 0) 642.7M evals/sec4.36 cycles/eval int64 NULL (c = 0) 505.3M evals/sec5.60 cycles/eval int64 NOT NULL (c >= 0) 643.5M evals/sec 4.34 cycles/eval int64 NULL (c >= 0) 472.8M evals/sec 6.00 cycles/eval int64 NOT NULL (c >= 0 AND c < 2) 634.2M evals/sec 4.43 cycles/eval int64 NULL (c >= 0 AND c < 2) 396.7M evals/sec 7.21 cycles/eval float NOT NULL (c = 0) 604.6M evals/sec4.63 cycles/eval float NULL (c = 0) 406.7M evals/sec7.05 cycles/eval float NOT NULL (c >= 0) 545.3M evals/sec 5.20 cycles/eval float NULL (c >= 0) 384.4M evals/sec 7.39 cycles/eval float NOT NULL (c >= 0 AND c < 2) 583.2M evals/sec 4.80 cycles/eval float NULL (c >= 0 AND c < 2) 312.2M evals/sec 9.12 cycles/eval double NOT NULL (c = 0) 614.0M evals/sec4.56 cycles/eval double NULL (c = 0) 471.5M evals/sec5.99 cycles/eval double NOT NULL (c >= 0) 623.0M evals/sec 4.48 cycles/eval double NULL (c >= 0) 379.9M evals/sec 7.47 cycles/eval double NOT NULL (c >= 0 AND c < 2) 599.5M evals/sec 4.67 cycles/eval double NULL (c >= 0 AND c < 2) 415.2M evals/sec 6.82 cycles/eval Detailed results after: int8 NOT NULL (c = 0) 3660.3M evals/sec 0.76 cycles/eval int8 NULL (c = 0) 3657.1M evals/sec 0.76 cycles/eval int8 NOT NULL (c >= 0) 3712.0M evals/sec 0.75 cycles/eval int8 NULL (c >= 0) 3618.9M evals/sec 0.78 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 1661.9M evals/sec1.73 cycles/eval int8 NULL (c >= 0 AND c < 2) 1663.4M evals/sec1.77 cycles/eval int16 NOT NULL (c = 0) 3781.4M evals/sec 0.73 cycles/eval int16 NULL (c = 0) 3738.3M evals/sec 0.74 cycles/eval int16 NOT NULL (c >= 0) 3672.9M evals/sec 0.76 cycles/eval int16 NULL (c >= 0) 3767.4M evals/sec 0.75 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 1654.3M evals/sec1.77 cycles/eval int16
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13591 to look at the new patch set (#4). Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by 1.6-2.5x depending on the particular predicate, type, and nullability (average around 2x). Branches are reduced by about 7.5x and branch-misses by about 19.6x. Looking at the "after" perf-stat results, the instructions-per-cycle are way down, which indicates we're probably stalled on instruction dependencies or port saturation. This is also indicated by the fact that the smaller ints don't seem to run any faster than the large ints (which wouldn't be the case if we were limited by load/store bandwidth). Likely the next fix here is to use SIMD to do comparisons in parallel as suggested in the JIRA. Unfortunately, the compiler doesn't seem to auto-vectorize these loops, so if we want further gain, we'll have to add some more hand-written vectorization code. So, we'll start with this easy win. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 73905.379627 task-clock (msec) #0.997 CPUs utilized 504 context-switches #0.007 K/sec 19 cpu-migrations#0.000 K/sec 1,296 page-faults #0.018 K/sec 272,810,081,028 cycles#3.691 GHz 938,488,388,743 instructions #3.44 insn per cycle 148,052,698,322 branches # 2003.274 M/sec 882,311,138 branch-misses #0.60% of all branches perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 38024.082495 task-clock (msec) #0.996 CPUs utilized 252 context-switches #0.007 K/sec 7 cpu-migrations#0.000 K/sec 1,295 page-faults #0.034 K/sec 142,231,469,257 cycles#3.741 GHz 172,437,810,470 instructions #1.21 insn per cycle 18,460,117,439 branches # 485.485 M/sec 60,960,125 branch-misses #0.33% of all branches Detailed results before: int8 NOT NULL (c = 0) 632.1M evals/sec4.44 cycles/eval int8 NULL (c = 0) 515.6M evals/sec5.48 cycles/eval int8 NOT NULL (c >= 0) 630.8M evals/sec 4.45 cycles/eval int8 NULL (c >= 0) 426.8M evals/sec 6.64 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 632.6M evals/sec 4.44 cycles/eval int8 NULL (c >= 0 AND c < 2) 384.7M evals/sec 7.38 cycles/eval int16 NOT NULL (c = 0) 644.4M evals/sec4.34 cycles/eval int16 NULL (c = 0) 524.6M evals/sec5.37 cycles/eval int16 NOT NULL (c >= 0) 638.4M evals/sec 4.37 cycles/eval int16 NULL (c >= 0) 458.8M evals/sec 6.17 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 635.3M evals/sec 4.40 cycles/eval int16 NULL (c >= 0 AND c < 2) 335.1M evals/sec 8.50 cycles/eval int32 NOT NULL (c = 0) 645.2M evals/sec4.34 cycles/eval int32 NULL (c = 0) 492.6M evals/sec5.77 cycles/eval int32 NOT NULL (c >= 0) 608.6M evals/sec 4.64 cycles/eval int32 NULL (c >= 0) 440.7M evals/sec 6.48 cycles/eval int32 NOT NULL (c >= 0 AND c < 2) 637.8M evals/sec 4.43 cycles/eval int32 NULL (c >= 0 AND c < 2) 348.0M evals/sec 8.22 cycles/eval int64 NOT NULL (c = 0) 642.7M evals/sec4.36 cycles/eval int64 NULL (c = 0) 505.3M evals/sec5.60 cycles/eval int64 NOT NULL (c >= 0) 643.5M evals/sec 4.34 cycles/eval int64 NULL (c >= 0) 472.8M evals/sec 6.00 cycles/eval int64 NOT NULL (c >= 0 AND c < 2) 634.2M evals/sec 4.43 cycles/eval int64 NULL (c >= 0 AND c < 2) 396.7M evals/sec 7.21 cycles/eval float NOT NULL (c = 0) 604.6M evals/sec4.63 cycles/eval float NULL (c = 0) 406.7M evals/sec7.05 cycles/eval float NOT NULL (c >= 0) 545.3M evals/sec 5.20 cycles/eval float NULL (c >= 0) 384.4M evals/sec 7.39 cycles/eval float NOT NULL (c >= 0 AND c < 2) 583.2M evals/sec 4.80 cycles/eval float NULL (c >= 0 AND c < 2) 312.2M evals/sec 9.12 cycles/eval double NOT NULL (c = 0) 614.0M evals/sec4.56 cycles/eval double NULL (c = 0) 471.5M evals/sec5.99 cycles/eval double NOT NULL (c >= 0) 623.0M evals/sec 4.48 cycles/eval double NULL (c >= 0) 379.9M evals/sec 7.47 cycles/eval
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13591 ) Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. Patch Set 3: (3 comments) Code looks fine but the tests seem angry. http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate-test.cc File src/kudu/common/column_predicate-test.cc: http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate-test.cc@1538 PS3, Line 1538: num_ret += selvec.CountSelected(); Should probably check the final value of num_ret. http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate.cc File src/kudu/common/column_predicate.cc: http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate.cc@654 PS3, Line 654: return 0x8040201008040201 * t >> 56; :exploding_head: http://gerrit.cloudera.org:8080/#/c/13591/3/src/kudu/common/column_predicate.cc@686 PS3, Line 686: return n_chunks * 8; nit: Could you doc the return value? -- To view, visit http://gerrit.cloudera.org:8080/13591 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 Gerrit-Change-Number: 13591 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 12 Jun 2019 02:36:22 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Andrew Wong has removed Andrew Wong from this change. ( http://gerrit.cloudera.org:8080/13591 ) Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. Removed reviewer Andrew Wong. -- To view, visit http://gerrit.cloudera.org:8080/13591 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteReviewer Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 Gerrit-Change-Number: 13591 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13591 to look at the new patch set (#3). Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by 1.6-2.5x depending on the particular predicate, type, and nullability (average around 2x). Branches are reduced by about 7.5x and branch-misses by about 19.6x. Looking at the "after" perf-stat results, the instructions-per-cycle are way down, which indicates we're probably stalled on instruction dependencies or port saturation. This is also indicated by the fact that the smaller ints don't seem to run any faster than the large ints (which wouldn't be the case if we were limited by load/store bandwidth). Likely the next fix here is to use SIMD to do comparisons in parallel as suggested in the JIRA. Unfortunately, the compiler doesn't seem to auto-vectorize these loops, so if we want further gain, we'll have to add some more hand-written vectorization code. So, we'll start with this easy win. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 82185.366028 task-clock (msec) #0.997 CPUs utilized 288,909,311,749 cycles#3.515 GHz 956,410,925,173 instructions #3.31 insn per cycle 149,468,823,714 branches # 1818.679 M/sec 1,237,139,955 branch-misses #0.83% of all branches perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 42626.067916 task-clock (msec) #0.996 CPUs utilized 149,363,412,476 cycles#3.504 GHz 190,514,045,889 instructions #1.28 insn per cycle 19,902,815,659 branches # 466.917 M/sec 63,130,874 branch-misses #0.32% of all branches Detailed results before: int8 NOT NULL (c = 0) 573.9M evals/sec4.78 cycles/eval int8 NULL (c = 0) 456.2M evals/sec6.14 cycles/eval int8 NOT NULL (c >= 0) 573.5M evals/sec 4.79 cycles/eval int8 NULL (c >= 0) 420.3M evals/sec 6.71 cycles/eval int8 NOT NULL (c >= 0 AND c < 2) 565.1M evals/sec 4.87 cycles/eval int8 NULL (c >= 0 AND c < 2) 372.0M evals/sec 7.53 cycles/eval int16 NOT NULL (c = 0) 577.0M evals/sec4.75 cycles/eval int16 NULL (c = 0) 460.5M evals/sec6.06 cycles/eval int16 NOT NULL (c >= 0) 568.9M evals/sec 4.80 cycles/eval int16 NULL (c >= 0) 400.4M evals/sec 6.96 cycles/eval int16 NOT NULL (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval int16 NULL (c >= 0 AND c < 2) 299.4M evals/sec 9.40 cycles/eval int32 NOT NULL (c = 0) 543.8M evals/sec5.05 cycles/eval int32 NULL (c = 0) 446.2M evals/sec6.21 cycles/eval int32 NOT NULL (c >= 0) 565.5M evals/sec 4.84 cycles/eval int32 NULL (c >= 0) 380.4M evals/sec 7.36 cycles/eval int32 NOT NULL (c >= 0 AND c < 2) 561.8M evals/sec 4.91 cycles/eval int32 NULL (c >= 0 AND c < 2) 308.6M evals/sec 9.18 cycles/eval int64 NOT NULL (c = 0) 566.6M evals/sec4.88 cycles/eval int64 NULL (c = 0) 463.9M evals/sec6.07 cycles/eval int64 NOT NULL (c >= 0) 555.5M evals/sec 4.97 cycles/eval int64 NULL (c >= 0) 385.3M evals/sec 7.28 cycles/eval int64 NOT NULL (c >= 0 AND c < 2) 567.1M evals/sec 4.83 cycles/eval int64 NULL (c >= 0 AND c < 2) 194.7M evals/sec 14.61 cycles/eval float NOT NULL (c = 0) 584.5M evals/sec4.68 cycles/eval float NULL (c = 0) 441.4M evals/sec6.29 cycles/eval float NOT NULL (c >= 0) 576.6M evals/sec 4.74 cycles/eval float NULL (c >= 0) 361.1M evals/sec 7.74 cycles/eval float NOT NULL (c >= 0 AND c < 2) 577.9M evals/sec 4.73 cycles/eval float NULL (c >= 0 AND c < 2) 301.5M evals/sec 9.34 cycles/eval double NOT NULL (c = 0) 589.9M evals/sec4.64 cycles/eval double NULL (c = 0) 450.0M evals/sec6.15 cycles/eval double NOT NULL (c >= 0) 571.5M evals/sec 4.78 cycles/eval double NULL (c >= 0) 367.8M evals/sec 7.60 cycles/eval double NOT NULL (c >= 0 AND c < 2) 577.8M evals/sec 4.77 cycles/eval double NULL (c >= 0 AND c < 2) 429.5M evals/sec 6.49 cycles/eval Detailed results after: int8 NOT NULL (c = 0) 926.7M evals/sec3.01 cycles/eval int8 NULL (c = 0) 935.2M evals/sec2.98 cycles/eval int8 NOT NULL (c >= 0) 913.6M evals/sec 3.03 cycles/eval int8 NULL (c >= 0) 903.2M
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/13591 to look at the new patch set (#2). Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by 1.6-2.5x depending on the particular predicate, type, and nullability (average around 2x). Branches are reduced by about 7.5x and branch-misses by about 19.6x. Looking at the "after" perf-stat results, the instructions-per-cycle are way down, which indicates we're probably stalled on instruction dependencies or port saturation. This is also indicated by the fact that the smaller ints don't seem to run any faster than the large ints (which wouldn't be the case if we were limited by load/store bandwidth). Likely the next fix here is to use SIMD to do comparisons in parallel as suggested in the JIRA. Unfortunately, the compiler doesn't seem to auto-vectorize these loops, so if we want further gain, we'll have to add some more hand-written vectorization code. So, we'll start with this easy win. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 82185.366028 task-clock (msec) #0.997 CPUs utilized 288,909,311,749 cycles#3.515 GHz 956,410,925,173 instructions #3.31 insn per cycle 149,468,823,714 branches # 1818.679 M/sec 1,237,139,955 branch-misses #0.83% of all branches perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 42626.067916 task-clock (msec) #0.996 CPUs utilized 149,363,412,476 cycles#3.504 GHz 190,514,045,889 instructions #1.28 insn per cycle 19,902,815,659 branches # 466.917 M/sec 63,130,874 branch-misses #0.32% of all branches Detailed results before: Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NOT NULL: real 1.730s user 1.730s sys 0.002s Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NULL: real 2.097s user 2.096s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 NOT NULL: real 1.755suser 1.756s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 NULL: real 2.631suser 2.632s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int8 NOT NULL: real 1.850s user 1.848s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int8 NULL: real 2.808s user 2.808s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 NOT NULL: real 1.753suser 1.752s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 NULL: real 2.248suser 2.244s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 NOT NULL: real 1.750s user 1.752s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 NULL: real 2.420s user 2.416s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int16 NOT NULL: real 1.811s user 1.808s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int16 NULL: real 5.321s user 5.313s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 NOT NULL: real 1.834suser 1.824s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 NULL: real 2.233suser 2.232s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 NOT NULL: real 1.797s user 1.793s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 NULL: real 2.791s user 2.774s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int32 NOT NULL: real 1.873s user 1.869s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int32 NULL: real 3.104s user 3.071s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 NOT NULL: real 1.781suser 1.779s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 NULL: real 2.209suser 2.203s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int64 NOT NULL: real
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13591 ) Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG@41 PS1, Line 41: perf-stat after: > Could you include the time elapsed for 'after' too? oh, I meant to actually remove it from 'before' because 'task-clock' is the same thing (it's a single-threaded CPU bound workload) -- To view, visit http://gerrit.cloudera.org:8080/13591 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 Gerrit-Change-Number: 13591 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 11 Jun 2019 23:50:32 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13591 ) Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13591/1//COMMIT_MSG@41 PS1, Line 41: perf-stat after: Could you include the time elapsed for 'after' too? -- To view, visit http://gerrit.cloudera.org:8080/13591 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1 Gerrit-Change-Number: 13591 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Tue, 11 Jun 2019 23:42:20 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2846 (part 1): optimize predicate evaluation for primitives
Hello Andrew Wong, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/13591 to review the following change. Change subject: KUDU-2846 (part 1): optimize predicate evaluation for primitives .. KUDU-2846 (part 1): optimize predicate evaluation for primitives This changes to an optimized unrolled-by-8 predicate evaluation for primitive columns. Performance is improved by 1.6-2.5x depending on the particular predicate, type, and nullability (average around 2x). Branches are reduced by about 7.5x and branch-misses by about 19.6x. Looking at the "after" perf-stat results, the instructions-per-cycle are way down, which indicates we're probably stalled on instruction dependencies or port saturation. This is also indicated by the fact that the smaller ints don't seem to run any faster than the large ints (which wouldn't be the case if we were limited by load/store bandwidth). Likely the next fix here is to use SIMD to do comparisons in parallel as suggested in the JIRA. Unfortunately, the compiler doesn't seem to auto-vectorize these loops, so if we want further gain, we'll have to add some more hand-written vectorization code. So, we'll start with this easy win. perf-stat before: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 82185.366028 task-clock (msec) #0.997 CPUs utilized 288,909,311,749 cycles#3.515 GHz 956,410,925,173 instructions #3.31 insn per cycle 149,468,823,714 branches # 1818.679 M/sec 1,237,139,955 branch-misses #0.83% of all branches 82.398392581 seconds time elapsed 82.132012000 seconds user 0.055937000 seconds sys perf-stat after: Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*': 42626.067916 task-clock (msec) #0.996 CPUs utilized 149,363,412,476 cycles#3.504 GHz 190,514,045,889 instructions #1.28 insn per cycle 19,902,815,659 branches # 466.917 M/sec 63,130,874 branch-misses #0.32% of all branches Detailed results before: Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NOT NULL: real 1.730s user 1.730s sys 0.002s Time spent evaluating c = 0: 100 batches of 1024 rows for type int8 NULL: real 2.097s user 2.096s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 NOT NULL: real 1.755suser 1.756s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int8 NULL: real 2.631suser 2.632s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int8 NOT NULL: real 1.850s user 1.848s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int8 NULL: real 2.808s user 2.808s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 NOT NULL: real 1.753suser 1.752s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int16 NULL: real 2.248suser 2.244s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 NOT NULL: real 1.750s user 1.752s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int16 NULL: real 2.420s user 2.416s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int16 NOT NULL: real 1.811s user 1.808s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int16 NULL: real 5.321s user 5.313s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 NOT NULL: real 1.834suser 1.824s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int32 NULL: real 2.233suser 2.232s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 NOT NULL: real 1.797s user 1.793s sys 0.000s Time spent evaluating c >= 0: 100 batches of 1024 rows for type int32 NULL: real 2.791s user 2.774s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int32 NOT NULL: real 1.873s user 1.869s sys 0.000s Time spent evaluating c >= 0 AND c < 2: 100 batches of 1024 rows for type int32 NULL: real 3.104s user 3.071s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 NOT NULL: real 1.781suser 1.779s sys 0.000s Time spent evaluating c = 0: 100 batches of 1024 rows for type int64 NULL: real 2.209suser 2.203s sys 0.000s Time spent evaluating c >= 0: