Github user revans2 commented on the pull request:
https://github.com/apache/storm/pull/765#issuecomment-147124664
I have some new test results. I did a comparison of several different
branches. I looked at this branch, the upgraded-disruptor branch #750,
STORM-855 #694, and apache-master 0.11.0-SNAPSHOT
(04cf3f6162ce6fdd1ec13b758222d889dafd5749). I had to make a few modifications
to get my test to work. I applied the following patch
https://gist.github.com/revans2/84301ef0fde0dc4fbe44 to each of the branches.
For STORM-855 I had to modify the test a bit so it would optionally do
batching. In that case batching was enabled on all streams and all spouts and
bolts.
I then ran the test at various throughputs 100, 200, 400, 800, 1600, 3200,
6400, 10000, 12800, 25600. and possibly a few others when looking for it to hit
the maximum throughput, and different batch sizes.
Each test ran for 5 mins. Here is the results of that test, excluding the
tests where the worker could not keep up with the rate.
| 99%-ile ns | 99.9%-ils ns | throughput | branch-batch | mean latency ns |
avg service latency ms | std-dev ns |
|---|---|---|---|---|---|---|
| 2,613,247 | 4,673,535 | 100 | STORM-855-0 | 2,006,347.25 | 1.26 |
2,675,778.36 |
| 2,617,343 | 4,423,679 | 200 | STORM-855-0 | 1,991,238.45 | 1.29 |
2,024,687.45 |
| 2,623,487 | 5,619,711 | 400 | STORM-855-0 | 1,999,926.81 | 1.24 |
1,778,335.92 |
| 2,627,583 | 4,603,903 | 1600 | STORM-855-0 | 1,971,888.24 | 1.30 |
893,085.40 |
| 2,635,775 | 8,560,639 | 800 | STORM-855-0 | 2,010,286.65 | 1.35 |
2,134,795.12 |
| 2,654,207 | 302,252,031 | 3200 | STORM-855-0 | 2,942,360.75 | 2.13 |
16,676,136.60 |
| 2,684,927 | 124,190,719 | 3200 | batch-v2-1 | 2,154,234.45 | 1.41 |
6,219,057.66 |
| 2,701,311 | 349,700,095 | 5000 | batch-v2-1 | 2,921,661.67 | 1.78 |
18,274,805.30 |
| 2,715,647 | 7,356,415 | 100 | storm-base-1 | 2,092,991.53 | 1.30 |
2,447,956.21 |
| 2,723,839 | 4,587,519 | 400 | storm-base-1 | 2,082,835.21 | 1.31 |
1,978,424.49 |
| 2,723,839 | 6,049,791 | 100 | dist-upgraade-1 | 2,091,407.68 | 1.31 |
2,222,977.89 |
| 2,725,887 | 10,403,839 | 1600 | batch-v2-1 | 2,010,694.30 | 1.27 |
2,095,223.90 |
| 2,725,887 | 4,607,999 | 200 | storm-base-1 | 2,074,784.50 | 1.30 |
1,951,564.93 |
| 2,727,935 | 4,513,791 | 200 | dist-upgraade-1 | 2,082,025.31 | 1.33 |
2,057,591.08 |
| 2,729,983 | 4,182,015 | 400 | dist-upgraade-1 | 2,056,282.29 | 1.43 |
862,428.67 |
| 2,732,031 | 4,632,575 | 800 | storm-base-1 | 2,092,514.39 | 1.27 |
2,231,550.66 |
| 2,734,079 | 4,472,831 | 800 | dist-upgraade-1 | 2,095,994.08 | 1.28 |
1,870,953.62 |
| 2,740,223 | 4,192,255 | 200 | batch-v2-1 | 2,011,025.19 | 1.21 |
911,556.19 |
| 2,742,271 | 4,726,783 | 1600 | storm-base-1 | 2,089,581.40 | 1.35 |
2,410,668.79 |
| 2,748,415 | 4,444,159 | 400 | batch-v2-1 | 2,055,600.78 | 1.34 |
1,729,257.92 |
| 2,748,415 | 4,575,231 | 100 | batch-v2-1 | 2,035,920.21 | 1.31 |
1,213,874.52 |
| 2,754,559 | 16,875,519 | 1600 | dist-upgraade-1 | 2,098,441.13 | 1.35 |
2,279,870.41 |
| 2,754,559 | 3,969,023 | 800 | batch-v2-1 | 2,026,222.88 | 1.29 |
767,491.71 |
| 2,793,471 | 53,477,375 | 3200 | storm-base-1 | 2,147,360.05 | 1.42 |
3,668,366.37 |
| 2,801,663 | 147,062,783 | 3200 | dist-upgraade-1 | 2,358,863.31 | 1.59 |
7,574,577.81 |
| 13,344,767 | 180,879,359 | 6400 | batch-v2-100 | 11,319,553.69 | 10.62 |
7,777,381.54 |
| 13,369,343 | 15,122,431 | 3200 | batch-v2-100 | 10,699,832.23 | 10.02 |
1,623,949.38 |
| 13,418,495 | 15,392,767 | 800 | batch-v2-100 | 10,589,813.17 | 9.86 |
2,439,134.80 |
| 13,426,687 | 14,680,063 | 400 | batch-v2-100 | 10,738,973.68 | 10.03 |
2,298,229.99 |
| 13,484,031 | 14,368,767 | 200 | batch-v2-100 | 10,941,653.28 | 10.20 |
2,471,899.43 |
| 13,508,607 | 14,262,271 | 100 | batch-v2-100 | 11,099,257.68 | 10.35 |
1,658,054.66 |
| 13,524,991 | 14,376,959 | 1600 | batch-v2-100 | 10,723,471.83 | 10.00 |
1,477,621.07 |
| 346,554,367 | 977,272,831 | 12800 | batch-v2-100 | 18,596,303.93 | 15.59
| 78,326,501.83 |
| 710,934,527 | 827,326,463 | 4000 | STORM-855-100 | 351,305,653.90 |
339.28 | 141,283,307.30 |
| 783,286,271 | 1,268,776,959 | 5000 | STORM-855-100 | 332,417,358.65 |
312.07 | 139,760,316.82 |
| 888,668,159 | 1,022,361,599 | 3200 | STORM-855-100 | 445,646,342.60 |
431.55 | 179,065,279.65 |
| 940,048,383 | 1,363,148,799 | 6400 | storm-base-1 | 20,225,300.17 | 17.17
| 134,848,974.52 |
| 1,043,333,119 | 1,409,286,143 | 10000 | batch-v2-1 | 22,750,840.18 | 6.13
| 146,235,076.73 |
| 1,209,008,127 | 1,786,773,503 | 6400 | dist-upgraade-1 | 28,588,397.01 |
24.70 | 181,801,409.69 |
| 1,747,976,191 | 1,946,157,055 | 1600 | STORM-855-100 | 738,741,774.85 |
734.75 | 374,194,675.56 |
| 2,642,411,519 | 3,124,756,479 | 20000 | batch-v2-100 | 133,706,248.88 |
51.67 | 497,027,226.45 |
| 3,374,317,567 | 3,892,314,111 | 10000 | dist-upgraade-1 | 141,866,760.39
| 69.39 | 589,014,777.73 |
| 3,447,717,887 | 3,869,245,439 | 10000 | storm-base-1 | 139,149,514.03 |
56.45 | 609,509,456.98 |
| 3,456,106,495 | 3,953,131,519 | 22000 | batch-v2-100 | 274,785,584.11 |
93.37 | 743,434,065.83 |
| 3,512,729,599 | 3,898,605,567 | 800 | STORM-855-100 | 1,354,193,514.47 |
1,361.58 | 779,667,263.64 |
| 3,963,617,279 | 4,416,602,111 | 5500 | STORM-855-100 | 450,364,286.22 |
415.96 | 575,017,536.40 |
| 4,185,915,391 | 5,347,737,599 | 4500 | STORM-855-0 | 366,268,233.66 |
259.94 | 995,928,429.75 |
| 4,919,918,591 | 5,582,618,623 | 6000 | STORM-855-100 | 534,520,242.96 |
497.47 | 758,754,139.61 |
| 4,919,918,591 | 5,582,618,623 | 6000 | STORM-855-100 | 534,520,242.96 |
497.47 | 758,754,139.61 |
| 7,071,596,543 | 7,843,348,479 | 400 | STORM-855-100 | 2,652,137,010.52 |
2,630.51 | 1,589,666,333.78 |
| 14,159,970,303 | 15,653,142,527 | 200 | STORM-855-100 | 5,202,877,719.25
| 5,206.33 | 3,199,275,795.66 |
| 27,648,851,967 | 31,205,621,759 | 100 | STORM-855-100 | 10,201,124,134.76
| 10,169.37 | 6,289,786,882.10 |
I then filtered the list to show the maximum throughput for a given latency
(several different ones)
99th percentile:
| 99%-ile ns | 99.9%-ils ns | throughput | branch-batch | mean latency ns |
avg service latency ms | std-dev ns |
|---|---|---|---|---|---|---|
| 2,613,247 | 4,673,535 | 100 | STORM-855-0 | 2,006,347.25 | 1.26 |
2,675,778.36 |
| 2,617,343 | 4,423,679 | 200 | STORM-855-0 | 1,991,238.45 | 1.29 |
2,024,687.45 |
| 2,623,487 | 5,619,711 | 400 | STORM-855-0 | 1,999,926.81 | 1.24 |
1,778,335.92 |
| 2,627,583 | 4,603,903 | 1600 | STORM-855-0 | 1,971,888.24 | 1.30 |
893,085.40 |
| 2,654,207 | 302,252,031 | 3200 | STORM-855-0 | 2,942,360.75 | 2.13 |
16,676,136.60 |
| 2,701,311 | 349,700,095 | 5000 | batch-v2-1 | 2,921,661.67 | 1.78 |
18,274,805.30 |
| 13,344,767 | 180,879,359 | 6400 | batch-v2-100 | 11,319,553.69 | 10.62 |
7,777,381.54 |
| 346,554,367 | 977,272,831 | 12800 | batch-v2-100 | 18,596,303.93 | 15.59
| 78,326,501.83 |
| 2,642,411,519 | 3,124,756,479 | 20000 | batch-v2-100 | 133,706,248.88 |
51.67 | 497,027,226.45 |
| 3,456,106,495 | 3,953,131,519 | 22000 | batch-v2-100 | 274,785,584.11 |
93.37 | 743,434,065.83 |
99.9th percentile:
| 99%-ile ns | 99.9%-ils ns | throughput | branch-batch | mean latency ns |
avg service latency ms | std-dev ns |
|---|---|---|---|---|---|---|
| 2,754,559 | 3,969,023 | 800 | batch-v2-1 | 2,026,222.88 | 1.29 |
767,491.71 |
| 2,627,583 | 4,603,903 | 1600 | STORM-855-0 | 1,971,888.24 | 1.30 |
893,085.40 |
| 13,369,343 | 15,122,431 | 3200 | batch-v2-100 | 10,699,832.23 | 10.02 |
1,623,949.38 |
| 13,344,767 | 180,879,359 | 6400 | batch-v2-100 | 11,319,553.69 | 10.62 |
7,777,381.54 |
| 346,554,367 | 977,272,831 | 12800 | batch-v2-100 | 18,596,303.93 | 15.59
| 78,326,501.83 |
| 2,642,411,519 | 3,124,756,479 | 20000 | batch-v2-100 | 133,706,248.88 |
51.67 | 497,027,226.45 |
| 3,456,106,495 | 3,953,131,519 | 22000 | batch-v2-100 | 274,785,584.11 |
93.37 | 743,434,065.83 |
mean latency:
| 99%-ile ns | 99.9%-ils ns | throughput | branch-batch | mean latency ns |
avg service latency ms | std-dev ns |
|---|---|---|---|---|---|---|
| 2,627,583 | 4,603,903 | 1600 | STORM-855-0 | 1,971,888.24 | 1.30 |
893,085.40 |
| 2,793,471 | 53,477,375 | 3200 | storm-base-1 | 2,147,360.05 | 1.42 |
3,668,366.37 |
| 2,701,311 | 349,700,095 | 5000 | batch-v2-1 | 2,921,661.67 | 1.78 |
18,274,805.30 |
| 13,344,767 | 180,879,359 | 6400 | batch-v2-100 | 11,319,553.69 | 10.62 |
7,777,381.54 |
| 346,554,367 | 977,272,831 | 12800 | batch-v2-100 | 18,596,303.93 | 15.59
| 78,326,501.83 |
| 2,642,411,519 | 3,124,756,479 | 20000 | batch-v2-100 | 133,706,248.88 |
51.67 | 497,027,226.45 |
| 3,456,106,495 | 3,953,131,519 | 22000 | batch-v2-100 | 274,785,584.11 |
93.37 | 743,434,065.83 |
service latency ms (storm's complete latency):
| 99%-ile ns | 99.9%-ils ns | throughput | branch-batch | mean latency ns |
avg service latency ms | std-dev ns |
|---|---|---|---|---|---|---|
| 2,740,223 | 4,192,255 | 200 | batch-v2-1 | 2,011,025.19 | 1.21 |
911,556.19 |
| 2,623,487 | 5,619,711 | 400 | STORM-855-0 | 1,999,926.81 | 1.24 |
1,778,335.92 |
| 2,725,887 | 10,403,839 | 1600 | batch-v2-1 | 2,010,694.30 | 1.27 |
2,095,223.90 |
| 2,684,927 | 124,190,719 | 3200 | batch-v2-1 | 2,154,234.45 | 1.41 |
6,219,057.66 |
| 2,701,311 | 349,700,095 | 5000 | batch-v2-1 | 2,921,661.67 | 1.78 |
18,274,805.30 |
| 1,043,333,119 | 1,409,286,143 | 10000 | batch-v2-1 | 22,750,840.18 | 6.13
| 146,235,076.73 |
| 346,554,367 | 977,272,831 | 12800 | batch-v2-100 | 18,596,303.93 | 15.59
| 78,326,501.83 |
| 2,642,411,519 | 3,124,756,479 | 20000 | batch-v2-100 | 133,706,248.88 |
51.67 | 497,027,226.45 |
| 3,456,106,495 | 3,953,131,519 | 22000 | batch-v2-100 | 274,785,584.11 |
93.37 | 743,434,065.83 |
I also looked at about the maximum throughput each branch could handle.
| branch-batch | throughput | mean latency | 99%-lie latency |
|---|---|---|---|
| STORM-855-0 | 4,500 | 366,268,233.66 | 4,185,915,391 |
| STORM-855-100 | 5,500 | 450,364,286.22 | 3,963,617,279 |
| storm-base-1 | 10,000 | 139,149,514.03 | 3,447,717,887 |
| dist-upgrade-1 | 10,000 | 141,866,760.39 | 3,374,317,567 |
| batch-v2-1 | 10,000 | 22,750,840.18 | 1,043,333,119 |
| batch-v2-100 | 22,000 | 274,785,584.11 | 3,456,106,495 |
I really would like some feedback here, because these numbers seem to
contradict STORM-855 using my original speed of light test. I don't really
like that test, even though I wrote it, because the throughput is limited only
by storm, so with acking disabled it is measuring what the latency is when we
hit the wall, and cannot provide any more throughput. No one should run in
production that way. When acking is enabled and we are using max-spout pending
for flow control the throughput is directly related to the end to end latency.
This too shouldn't be the common case in production because it means we cannot
keep up with the incoming rate and are falling behind.
This seems to indicate that the only time STORM-855 makes since is when
looking at the 99%-ile latency at a very low throughput, and then it only seems
to save 1/20th of a ms advantage over the others. In other cases it looks like
the throughput per host it can support is about 1/2 of that without the change.
This branch however has a weakness on the low end when batching is enabled it
is about 12 ms slower, but on the high end it can handle more then 2x the
throughput with little change to the latency. If that 12 ms is important I
think we can mitigate it by allowing the batch size to self-adjust on a per
queue bases.
I really would like others to look at my numbers and my test to see if
there are issues with it that I am missing, because like I said it seems to
contradict the numbers from STORM-855. The only thing I can think of is that
the messaging layer is the bottleneck in the speed of light test, which is what
it was intended to stress test, and STORM-855 is giving a significant batching
advantage there. If that is the case then we should look at what STORM-855 is
doing around that to try and combine it with the batching we are doing here.
@ptgoetz @d2r @rfarivar @mjsax @kishorvpatil @knusbaum please let me know
what you think.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---