[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-152178557 @mjsax I am still very interested in #694 if it makes thing better. I am skeptical that it will add much over this does because they both cover similar areas, but if it does then we should try to use it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user mjsax commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-152122558 I did not find time to rerun the tests and fix #694. If you want to move on with this PR, just go ahead... I will still try to get #694 in better shape --- still hope I can get it accepted. (If you are still interested.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user kishorvpatil commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-151942417 @d2r @knusbaum @mjsax @harshach @HeartSaVioR @ptgoetz Now that #797 is in. We should consider pulling in batching solution. Can you please review this PR and pass on your opinion? I am still +1 on this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-151596007 Just rebased. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user mjsax commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-150163728 Great analysis! I would like to run this tests, too. Especially to work on the flushing of batches. It's a pity that too many tuples times out in your test. To get a fair comparison, I would like to fix this, and redo the test. Btw: I am still wondering, you #694 behaves so badly with low throughput. From a conceptual point of view there is no reason for this. Maybe my code is not good enough (as I am a Clojure newbe). Hope I can reproduce your result and get better insight into the problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-150252797 @mjsax At least in my profiling the next largest consumer of CPU in word count, without acking enabled, after adding in batching is the metrics. They are horribly slow. I assumed that because Aeolus was only counting batches, not individual tuples that it mitigated a lot of this slowness. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user mjsax commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-150250711 I am not sure how the metric relate to it... Furthermore, I also used Aeolus on the current master and get 6x improvement (with acking disabled -- acking is currently not supported by Aeolus). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-150247138 @mjsax If you want to take some time to debug things for a better comparison I am happy to wait. It is entirely possible that I somehow messed things up when I was enabling hybrid-mode so getting and apples to apples comparison is important. I want the best final solution no matter what it is, but I also don't want to wait forever. For efficiency and cluster utilization purposes I really would like to pull in batching into the clusters that I run, but because the configuration/interfaces are incompatible with one another between the two approaches I don't want to do it until open source has decided on a direction they want to go in. In the mean time I am going to be working on improving the metrics gathering. You said previously that with your first batching attempt you saw up to 6x the throughput. I suspect a lot of that is because the metrics are so very very slow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user mjsax commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-150255113 Make sense. It might be worth explore counting in batches... (ie, increase a counter by "batch-size" for each batch). So we still get the correct value. Of course the metric will not increase slowly over time but "jump" in larger intervals. Not sure if this i acceptable behavior and/or what other metrics are there that cannot be updated in a "batch" fashion... count might be a too simple example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-149987537 So I have been running a number of tests trying to come to a conclusive decision on how storm should handle batching, and trying to understand the difference between my test results and the test results from #694. I ran the word count test I wrote as a part of #805 on a 35 node storm cluster. This was done against several different storm versions, the baseline in the #805 pull request; this patch + #805 (batch-v2); and #694 + #805 + modifications to use the hybrid approach to enable acking and batch to work in a multi-process topology (STORM-855). To avoid having all of the numbers be hard to parse I am just going to include some charts, but if anyone wants to see the raw numbers or reproduce it themselves I am happy to provide data and/or branches. The numbers below were collected after the topology had been running for at least 200 seconds. This is to avoid startup issues like JIT etc. I filtered out any 30 second interval where the measured throughput was not +/- 10% of the target throughput on the assumption that if the topology cannot keep up with the desired throughput or it was trying to catch up from previous slowness it would not be within that range. I did not filter based off of the number of failures that happened, simply because that would have resulted in removing all of the STORM-855 with batching enabled results. None of the other test configurations saw any failures at all during testing. ![throughput-vs-latency](https://cloud.githubusercontent.com/assets/3441321/10644336/d0393222-77ed-11e5-849a-0b6be6ac5178.png) This shows the 99%-ile latency vs measured throughput. It is not too interesting except to note that batching in STORM-855 at low throughput resulted in nothing being fully processed. All of the tuples timed out before they could finish. Only at a medium throughput above 16,000 sentences/second were we able to maintain enough tuples to complete batches regularly, but even then many tuples would still time out. This should be able to be fixed with a batch timeout, but that is not implemented yet. To get a better view I adjusted the latency to be a log scale. ![throughput-vs-latency-log](https://cloud.githubusercontent.com/assets/3441321/10644335/d02ab29c-77ed-11e5-883e-a647f6b4279b.png) From this we can see that on the very low end batching-v2 is increasing the 99%-ile latency from 5-10 ms to 19-21 ms. Most of that you can get back by configuring the batch size to 1, instead of the default 100 tuples. However, once the baseline stops functioning at around 7000 sentences/sec the batching code is able to continue working, with either a batch size of 1 or 100. I believe that this has to do with the automatic backpressure. In the baseline code backpressure does not take into account the overflow buffer, but in the batching code it does. I think this gives the topology more stability in maintaining a throughput, but I don't have any solid evidence for that. I then zoomed in on the graphs to show what a 2 second SLA would look like ![throughput-vs-latency-2-sec](https://cloud.githubusercontent.com/assets/3441321/10644332/d0176f5c-77ed-11e5-98c4-d2e7a9e48c70.png) and a 100 ms SLA. ![throughput-vs-latency-100-ms](https://cloud.githubusercontent.com/assets/3441321/10644334/d0291540-77ed-11e5-9fb3-9c9c97f504f9.png) In both cases the batching v2 with a batch size of 100 was able to handle the highest throughput for that given latency. Then I wanted to look at memory and CPU Utilization. ![throughput-vs-mem](https://cloud.githubusercontent.com/assets/3441321/10644337/d03c3094-77ed-11e5-8cda-cf53fe3a2389.png) Memory does not show much, the amount of memory used varies a bit from one to the other, but if you realize this is for 35 worker processes it is varying from 70 MB/worker to about 200 MB/worker. The numbers simply show that as the throughput increases the memory utilizations does too, and it does not vary too much from one implementation to another. ![throughput-vs-cpu](https://cloud.githubusercontent.com/assets/3441321/10645834/6ba799e0-77f5-11e5-88fd-7e09475a5b6c.png) CPU however shows that on the low end we are going from 7 or 8 cores worth of CPU time to about 35 cores worth for the batching code. This seems to be the result of the batch flushing threads waking up periodically. We should be able to mitigate this by adjusting that interval to be larger, but that would in turn impact the latency. I believe that with further work we should be able to reduce that CPU utilization and the latency on the low end by dynamically adjusting the batch size and timeout based off of a specified SLA. At this point I feel this branch is ready for a formal
[GitHub] storm pull request: Disruptor batching v2
Github user mjsax commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-148548577 Thanks for clarification. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-148544657 @mjsax What I saw when testing STORM-855 was that the maximum throughput was cut almost in half from 10,000 sentences per second to 5,500. But your numbers showed maximum throughput more than doubling from around 7,960,300 tuples sent in 30 seconds to 16,347,100 in the same time period (no-acking). And 1,832,160 in 30 seconds to 2,323,580 an increase of 25% with acking. To me this feels like a contradiction. The only thing I can think of is that the messaging layer is so scary slow that cutting the maximum throughput of a worker by half has no impact on the overall performance if it can double the throughput of the messaging layer, by doing more batching. This is likely the case, as on the high end 16,347,100 / 30 seconds / 24 workers is about 22,000 tuples per second per worker, where as 5,500 sentences per second results in about 181,500 total tuples per second/worker being processed. I'm just looking for feedback from others on this, but it looks like I need to do a distributed apples to apples comparison as well to see the impact the messaging layer has. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user mjsax commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-148540526 I had a look at the numbers. It's a lot of stuff and hard to parse for me... I am not sure what you mean by "contradict the numbers"? Can you explain in more detail? However, I agree with your thought about STORM-855 #694 It basically reduces the contention on the output queue, because less calls are made here. I profiled Storm once and observed that on high-end data-rates (when we hit the wall) the contention on the output-queue is the bottleneck (the writing and reading thread have to lock the queue and a lot of waiting for the lock consumes a fair share of the consumed time). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-147124664 I have some new test results. I did a comparison of several different branches. I looked at this branch, the upgraded-disruptor branch #750, STORM-855 #694, and apache-master 0.11.0-SNAPSHOT (04cf3f6162ce6fdd1ec13b758222d889dafd5749). I had to make a few modifications to get my test to work. I applied the following patch https://gist.github.com/revans2/84301ef0fde0dc4fbe44 to each of the branches. For STORM-855 I had to modify the test a bit so it would optionally do batching. In that case batching was enabled on all streams and all spouts and bolts. I then ran the test at various throughputs 100, 200, 400, 800, 1600, 3200, 6400, 1, 12800, 25600. and possibly a few others when looking for it to hit the maximum throughput, and different batch sizes. Each test ran for 5 mins. Here is the results of that test, excluding the tests where the worker could not keep up with the rate. | 99%-ile ns | 99.9%-ils ns | throughput | branch-batch | mean latency ns | avg service latency ms | std-dev ns | |---|---|---|---|---|---|---| | 2,613,247 | 4,673,535 | 100 | STORM-855-0 | 2,006,347.25 | 1.26 | 2,675,778.36 | | 2,617,343 | 4,423,679 | 200 | STORM-855-0 | 1,991,238.45 | 1.29 | 2,024,687.45 | | 2,623,487 | 5,619,711 | 400 | STORM-855-0 | 1,999,926.81 | 1.24 | 1,778,335.92 | | 2,627,583 | 4,603,903 | 1600 | STORM-855-0 | 1,971,888.24 | 1.30 | 893,085.40 | | 2,635,775 | 8,560,639 | 800 | STORM-855-0 | 2,010,286.65 | 1.35 | 2,134,795.12 | | 2,654,207 | 302,252,031 | 3200 | STORM-855-0 | 2,942,360.75 | 2.13 | 16,676,136.60 | | 2,684,927 | 124,190,719 | 3200 | batch-v2-1 | 2,154,234.45 | 1.41 | 6,219,057.66 | | 2,701,311 | 349,700,095 | 5000 | batch-v2-1 | 2,921,661.67 | 1.78 | 18,274,805.30 | | 2,715,647 | 7,356,415 | 100 | storm-base-1 | 2,092,991.53 | 1.30 | 2,447,956.21 | | 2,723,839 | 4,587,519 | 400 | storm-base-1 | 2,082,835.21 | 1.31 | 1,978,424.49 | | 2,723,839 | 6,049,791 | 100 | dist-upgraade-1 | 2,091,407.68 | 1.31 | 2,222,977.89 | | 2,725,887 | 10,403,839 | 1600 | batch-v2-1 | 2,010,694.30 | 1.27 | 2,095,223.90 | | 2,725,887 | 4,607,999 | 200 | storm-base-1 | 2,074,784.50 | 1.30 | 1,951,564.93 | | 2,727,935 | 4,513,791 | 200 | dist-upgraade-1 | 2,082,025.31 | 1.33 | 2,057,591.08 | | 2,729,983 | 4,182,015 | 400 | dist-upgraade-1 | 2,056,282.29 | 1.43 | 862,428.67 | | 2,732,031 | 4,632,575 | 800 | storm-base-1 | 2,092,514.39 | 1.27 | 2,231,550.66 | | 2,734,079 | 4,472,831 | 800 | dist-upgraade-1 | 2,095,994.08 | 1.28 | 1,870,953.62 | | 2,740,223 | 4,192,255 | 200 | batch-v2-1 | 2,011,025.19 | 1.21 | 911,556.19 | | 2,742,271 | 4,726,783 | 1600 | storm-base-1 | 2,089,581.40 | 1.35 | 2,410,668.79 | | 2,748,415 | 4,444,159 | 400 | batch-v2-1 | 2,055,600.78 | 1.34 | 1,729,257.92 | | 2,748,415 | 4,575,231 | 100 | batch-v2-1 | 2,035,920.21 | 1.31 | 1,213,874.52 | | 2,754,559 | 16,875,519 | 1600 | dist-upgraade-1 | 2,098,441.13 | 1.35 | 2,279,870.41 | | 2,754,559 | 3,969,023 | 800 | batch-v2-1 | 2,026,222.88 | 1.29 | 767,491.71 | | 2,793,471 | 53,477,375 | 3200 | storm-base-1 | 2,147,360.05 | 1.42 | 3,668,366.37 | | 2,801,663 | 147,062,783 | 3200 | dist-upgraade-1 | 2,358,863.31 | 1.59 | 7,574,577.81 | | 13,344,767 | 180,879,359 | 6400 | batch-v2-100 | 11,319,553.69 | 10.62 | 7,777,381.54 | | 13,369,343 | 15,122,431 | 3200 | batch-v2-100 | 10,699,832.23 | 10.02 | 1,623,949.38 | | 13,418,495 | 15,392,767 | 800 | batch-v2-100 | 10,589,813.17 | 9.86 | 2,439,134.80 | | 13,426,687 | 14,680,063 | 400 | batch-v2-100 | 10,738,973.68 | 10.03 | 2,298,229.99 | | 13,484,031 | 14,368,767 | 200 | batch-v2-100 | 10,941,653.28 | 10.20 | 2,471,899.43 | | 13,508,607 | 14,262,271 | 100 | batch-v2-100 | 11,099,257.68 | 10.35 | 1,658,054.66 | | 13,524,991 | 14,376,959 | 1600 | batch-v2-100 | 10,723,471.83 | 10.00 | 1,477,621.07 | | 346,554,367 | 977,272,831 | 12800 | batch-v2-100 | 18,596,303.93 | 15.59 | 78,326,501.83 | | 710,934,527 | 827,326,463 | 4000 | STORM-855-100 | 351,305,653.90 | 339.28 | 141,283,307.30 | | 783,286,271 | 1,268,776,959 | 5000 | STORM-855-100 | 332,417,358.65 | 312.07 | 139,760,316.82 | | 888,668,159 | 1,022,361,599 | 3200 | STORM-855-100 | 445,646,342.60 | 431.55 | 179,065,279.65 | | 940,048,383 | 1,363,148,799 | 6400 | storm-base-1 | 20,225,300.17 | 17.17 | 134,848,974.52 | | 1,043,333,119 | 1,409,286,143 | 1 | batch-v2-1 | 22,750,840.18 | 6.13 | 146,235,076.73 | | 1,209,008,127 | 1,786,773,503 | 6400 | dist-upgraade-1 | 28,588,397.01 | 24.70 | 181,801,409.69 | | 1,747,976,191 | 1,946,157,055 | 1600 | STORM-855-100 | 738,741,774.85 | 734.75 | 374,194,675.56 | | 2,642,411,519 | 3,124,756,479 | 2 | batch-v2-100 | 133,706,248.88 |
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-147125596 Sorry I forgot to mention that for one test STORM-855-100 was able to run at a throughput of 6000 sentences/second successfully, but in other runs it failed, so I set the number to 5500, even though a 6000 is shown in the results. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user kishorvpatil commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-144429694 Ran some sanity tests. LGTM. +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on a diff in the pull request: https://github.com/apache/storm/pull/765#discussion_r40801096 --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj --- @@ -123,12 +123,13 @@ port (:port worker) storm-cluster-state (:storm-cluster-state worker) prev-backpressure-flag @(:backpressure worker)] -(if executors - (if (reduce #(or %1 %2) (map #(.get-backpressure-flag %1) executors)) -(reset! (:backpressure worker) true) ;; at least one executor has set backpressure -(reset! (:backpressure worker) false))) ;; no executor has backpressure set +(when executors + (reset! (:backpressure worker) + (or @(:transfer-backpressure worker) --- End diff -- This was actually a bug, and I am going to file a JIRA for this. Having it reuse the worker backpressure flag, made it turn off backpressure too soon in some cases, and actually have it get stuck off when it should be on in other cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user zhuoliu commented on a diff in the pull request: https://github.com/apache/storm/pull/765#discussion_r40803217 --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj --- @@ -123,12 +123,13 @@ port (:port worker) storm-cluster-state (:storm-cluster-state worker) prev-backpressure-flag @(:backpressure worker)] -(if executors - (if (reduce #(or %1 %2) (map #(.get-backpressure-flag %1) executors)) -(reset! (:backpressure worker) true) ;; at least one executor has set backpressure -(reset! (:backpressure worker) false))) ;; no executor has backpressure set +(when executors + (reset! (:backpressure worker) + (or @(:transfer-backpressure worker) --- End diff -- Previously, we have lowWaterMark function as "do nothing" for worker transfer queue's callback, which will not turn off the backpressure when the worker's queue population is below the lowWaterMark. On the other hand, it is possible that all executors all have population above above lowWaterMark but the worker's queue population is still above highWaterMark, and the backpressure flag gets cleared "too often". Anyway, I think this is a good fix, together with the _throttleOn flag in DisruptorQueue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on a diff in the pull request: https://github.com/apache/storm/pull/765#discussion_r40802660 --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj --- @@ -290,6 +290,7 @@ :transfer-fn (mk-transfer-fn <>) :assignment-versions assignment-versions :backpressure (atom false) ;; whether this worker is going slow + :transfer-backpressure (atom false) ;; whether this worker is going slow --- End diff -- Yes you are correct --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user zhuoliu commented on a diff in the pull request: https://github.com/apache/storm/pull/765#discussion_r40803550 --- Diff: storm-core/src/clj/backtype/storm/cluster.clj --- @@ -489,10 +489,10 @@ (let [path (backpressure-path storm-id node port) existed (exists-node? cluster-state path false)] (if existed -(if (not on?) - (delete-node cluster-state path)) ;; delete the znode since the worker is not congested -(if on? - (set-ephemeral-node cluster-state path nil acls) ;; create the znode since worker is congested +(when (not on?) --- End diff -- This is OK, never mind. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on a diff in the pull request: https://github.com/apache/storm/pull/765#discussion_r40800874 --- Diff: storm-core/src/clj/backtype/storm/cluster.clj --- @@ -489,10 +489,10 @@ (let [path (backpressure-path storm-id node port) existed (exists-node? cluster-state path false)] (if existed -(if (not on?) - (delete-node cluster-state path)) ;; delete the znode since the worker is not congested -(if on? - (set-ephemeral-node cluster-state path nil acls) ;; create the znode since worker is congested +(when (not on?) --- End diff -- It is mostly that I put in some debugging and didn't want an ```(if (do (log-message) ...)))``` to fix an issue, and ended up not changing it back after removing the debug logging. If you want me to change it back and add back in the comments I can. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-144059813 @mjsax The difference is not that huge yet. It is 2x in a number of use cases. But we have a lot of slowness in the metrics processing code, which, at least in the case of word count, is dominating the performance, which is probably why you were not seeing a 6x performance improvement in your pull like you were seeing before. I am going to be doing some follow on optimization work on that. The big thing right now is that the System CPU utilization went from 15-30% depending on the OS, etc. to 2-6%. This frees up a huge amount of CPU for other processing, so the overall throughput capability of a node increased significantly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user zhuoliu commented on a diff in the pull request: https://github.com/apache/storm/pull/765#discussion_r40731960 --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj --- @@ -137,14 +138,11 @@ check highWaterMark and lowWaterMark for backpressure" (disruptor/disruptor-backpressure-handler (fn [] - "When worker's queue is above highWaterMark, we set its backpressure flag" - (if (not @(:backpressure worker)) -(do (reset! (:backpressure worker) true) -(WorkerBackpressureThread/notifyBackpressureChecker (:backpressure-trigger worker) ;; set backpressure no matter how the executors are + (reset! (:transfer-backpressure worker) true) + (WorkerBackpressureThread/notifyBackpressureChecker (:backpressure-trigger worker))) --- End diff -- Will this cause the WorkerBackpressureThread gets too frequently notified even when it is not necessary to do so? (In most cases, both the :backpressure and :transfer-backpressure have not changed, we may not really need to notify the backpressure-thread to check all executors's flags, which can be additional overhead.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user zhuoliu commented on a diff in the pull request: https://github.com/apache/storm/pull/765#discussion_r40730661 --- Diff: storm-core/src/clj/backtype/storm/daemon/worker.clj --- @@ -123,12 +123,13 @@ port (:port worker) storm-cluster-state (:storm-cluster-state worker) prev-backpressure-flag @(:backpressure worker)] -(if executors - (if (reduce #(or %1 %2) (map #(.get-backpressure-flag %1) executors)) -(reset! (:backpressure worker) true) ;; at least one executor has set backpressure -(reset! (:backpressure worker) false))) ;; no executor has backpressure set +(when executors + (reset! (:backpressure worker) + (or @(:transfer-backpressure worker) --- End diff -- I agree, it is nice to add a separate flag for the worker's transfer queue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-143872114 I am just rebasing the code now so you can test that out yourself. This code has no issues with acking. But there are a few real issues. First the latency on low throughput queues is much higher. This is because it has to wait for the batch to time out. That timeout is set to 1 ms by default, so it is not that bad, but we should be able to do some on the fly adjustments in a follow on JIRA to dynamically adjust the batch size for each queue to compensate. Second the number of threads used is a lot more. 1 more per disruptor queue. I expect to reduce the total number of disruptor queues once I have optimized other parts of the code. As it stands right now I don't want to do that, because the two queues per bolt/spout design still improves performance in many cases. Third in the worst case situation it is possible to allocate many more objects than previously. It is not actually that many more, we already allocate a lot of objects, which needs to be looked at on a separate JIRA at some point. Also I don't want to shove this code in without doing a real comparison between the two approaches and the code. This is one way of doing batching, but there are others that may have advantages over this, or may compliment this approach as well. I just want storm to eventually be a lot closer to the 1 million sentences/second mark than it is now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] storm pull request: Disruptor batching v2
Github user mjsax commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-143900063 Curious to see first performance results :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---