[ https://issues.apache.org/jira/browse/STORM-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967622#comment-14967622 ]
ASF GitHub Bot commented on STORM-855: -------------------------------------- Github user revans2 commented on the pull request: https://github.com/apache/storm/pull/765#issuecomment-149987537 So I have been running a number of tests trying to come to a conclusive decision on how storm should handle batching, and trying to understand the difference between my test results and the test results from #694. I ran the word count test I wrote as a part of #805 on a 35 node storm cluster. This was done against several different storm versions, the baseline in the #805 pull request; this patch + #805 (batch-v2); and #694 + #805 + modifications to use the hybrid approach to enable acking and batch to work in a multi-process topology (STORM-855). To avoid having all of the numbers be hard to parse I am just going to include some charts, but if anyone wants to see the raw numbers or reproduce it themselves I am happy to provide data and/or branches. The numbers below were collected after the topology had been running for at least 200 seconds. This is to avoid startup issues like JIT etc. I filtered out any 30 second interval where the measured throughput was not +/- 10% of the target throughput on the assumption that if the topology cannot keep up with the desired throughput or it was trying to catch up from previous slowness it would not be within that range. I did not filter based off of the number of failures that happened, simply because that would have resulted in removing all of the STORM-855 with batching enabled results. None of the other test configurations saw any failures at all during testing.  This shows the 99%-ile latency vs measured throughput. It is not too interesting except to note that batching in STORM-855 at low throughput resulted in nothing being fully processed. All of the tuples timed out before they could finish. Only at a medium throughput above 16,000 sentences/second were we able to maintain enough tuples to complete batches regularly, but even then many tuples would still time out. This should be able to be fixed with a batch timeout, but that is not implemented yet. To get a better view I adjusted the latency to be a log scale.  From this we can see that on the very low end batching-v2 is increasing the 99%-ile latency from 5-10 ms to 19-21 ms. Most of that you can get back by configuring the batch size to 1, instead of the default 100 tuples. However, once the baseline stops functioning at around 7000 sentences/sec the batching code is able to continue working, with either a batch size of 1 or 100. I believe that this has to do with the automatic backpressure. In the baseline code backpressure does not take into account the overflow buffer, but in the batching code it does. I think this gives the topology more stability in maintaining a throughput, but I don't have any solid evidence for that. I then zoomed in on the graphs to show what a 2 second SLA would look like  and a 100 ms SLA.  In both cases the batching v2 with a batch size of 100 was able to handle the highest throughput for that given latency. Then I wanted to look at memory and CPU Utilization.  Memory does not show much, the amount of memory used varies a bit from one to the other, but if you realize this is for 35 worker processes it is varying from 70 MB/worker to about 200 MB/worker. The numbers simply show that as the throughput increases the memory utilizations does too, and it does not vary too much from one implementation to another.  CPU however shows that on the low end we are going from 7 or 8 cores worth of CPU time to about 35 cores worth for the batching code. This seems to be the result of the batch flushing threads waking up periodically. We should be able to mitigate this by adjusting that interval to be larger, but that would in turn impact the latency. I believe that with further work we should be able to reduce that CPU utilization and the latency on the low end by dynamically adjusting the batch size and timeout based off of a specified SLA. At this point I feel this branch is ready for a formal review and inclusion into storm, the drawbacks to this patch do not seem to out weigh the clear advantages to it. Additionally with the stability problems associated with #694 I cannot feel good in recommending it at this time. It is clear that some of what it is doing is worthwhile, and I think we should explore the alternative batch serialization between worker processes as a potential stand alone piece. @d2r @knusbaum @kishorvpatil @harshach @HeartSaVioR @ptgoetz If you could please review this remembering that it is based off of #797, I would like to try and get this in soon. > Add tuple batching > ------------------ > > Key: STORM-855 > URL: https://issues.apache.org/jira/browse/STORM-855 > Project: Apache Storm > Issue Type: New Feature > Components: storm-core > Reporter: Matthias J. Sax > Assignee: Matthias J. Sax > Priority: Minor > > In order to increase Storm's throughput, multiple tuples can be grouped > together in a batch of tuples (ie, fat-tuple) and transfered from producer to > consumer at once. > The initial idea is taken from https://github.com/mjsax/aeolus. However, we > aim to integrate this feature deep into the system (in contrast to building > it on top), what has multiple advantages: > - batching can be even more transparent to the user (eg, no extra > direct-streams needed to mimic Storm's data distribution patterns) > - fault-tolerance (anchoring/acking) can be done on a tuple granularity > (not on a batch granularity, what leads to much more replayed tuples -- and > result duplicates -- in case of failure) > The aim is to extend TopologyBuilder interface with an additional parameter > 'batch_size' to expose this feature to the user. Per default, batching will > be disabled. > This batching feature has pure tuple transport purpose, ie, tuple-by-tuple > processing semantics are preserved. An output batch is assembled at the > producer and completely disassembled at the consumer. The consumer output can > be batched again, however, independent of batched or non-batched input. Thus, > batches can be of different size for each producer-consumer pair. > Furthermore, consumers can receive batches of different size from different > producers (including regular non batched input). -- This message was sent by Atlassian JIRA (v6.3.4#6332)