[ 
https://issues.apache.org/jira/browse/STORM-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967622#comment-14967622
 ] 

ASF GitHub Bot commented on STORM-855:
--------------------------------------

Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/765#issuecomment-149987537
  
    So I have been running a number of tests trying to come to a conclusive 
decision on how storm should handle batching, and trying to understand the 
difference between my test results and the test results from #694. I ran the 
word count test I wrote as a part of #805 on a 35 node storm cluster.  This was 
done against several different storm versions, the baseline in the #805 pull 
request; this patch + #805 (batch-v2); and #694 + #805 + modifications to use 
the hybrid approach to enable acking and batch to work in a multi-process 
topology (STORM-855).  To avoid having all of the numbers be hard to parse I am 
just going to include some charts, but if anyone wants to see the raw numbers 
or reproduce it themselves I am happy to provide data and/or branches.
    
    The numbers below were collected after the topology had been running for at 
least 200 seconds.  This is to avoid startup issues like JIT etc.  I filtered 
out any 30 second interval where the measured throughput was not +/- 10% of the 
target throughput on the assumption that if the topology cannot keep up with 
the desired throughput or it was trying to catch up from previous slowness it 
would not be within that range.  I did not filter based off of the number of 
failures that happened, simply because that would have resulted in removing all 
of the STORM-855 with batching enabled results.  None of the other test 
configurations saw any failures at all during testing.
    
    
![throughput-vs-latency](https://cloud.githubusercontent.com/assets/3441321/10644336/d0393222-77ed-11e5-849a-0b6be6ac5178.png)
    
    This shows the 99%-ile latency vs measured throughput.  It is not too 
interesting except to note that batching in STORM-855 at low throughput 
resulted in nothing being fully processed.  All of the tuples timed out before 
they could finish.  Only at a medium throughput above 16,000 sentences/second 
were we able to maintain enough tuples to complete batches regularly, but even 
then many tuples would still time out. This should be able to be fixed with a 
batch timeout, but that is not implemented yet.
    
    To get a better view I adjusted the latency to be a log scale.
    
    
![throughput-vs-latency-log](https://cloud.githubusercontent.com/assets/3441321/10644335/d02ab29c-77ed-11e5-883e-a647f6b4279b.png)
    
    From this we can see that on the very low end batching-v2 is increasing the 
99%-ile latency from 5-10 ms to 19-21 ms.  Most of that you can get back by 
configuring the batch size to 1, instead of the default 100 tuples.  However, 
once the baseline stops functioning at around 7000 sentences/sec the batching 
code is able to continue working, with either a batch size of 1 or 100.  I 
believe that this has to do with the automatic backpressure.  In the baseline 
code backpressure does not take into account the overflow buffer, but in the 
batching code it does.  I think this gives the topology more stability in 
maintaining a throughput, but I don't have any solid evidence for that.
    
    I then zoomed in on the graphs to show what a 2 second SLA would look like
    
    
![throughput-vs-latency-2-sec](https://cloud.githubusercontent.com/assets/3441321/10644332/d0176f5c-77ed-11e5-98c4-d2e7a9e48c70.png)
    
    and a 100 ms SLA.
    
    
![throughput-vs-latency-100-ms](https://cloud.githubusercontent.com/assets/3441321/10644334/d0291540-77ed-11e5-9fb3-9c9c97f504f9.png)
    
    In both cases the batching v2 with a batch size of 100 was able to handle 
the highest throughput for that given latency.
    
    Then I wanted to look at memory and CPU Utilization.
    
    
![throughput-vs-mem](https://cloud.githubusercontent.com/assets/3441321/10644337/d03c3094-77ed-11e5-8cda-cf53fe3a2389.png)
    
    Memory does not show much, the amount of memory used varies a bit from one 
to the other, but if you realize this is for 35 worker processes it is varying 
from 70 MB/worker to about 200 MB/worker.  The numbers simply show that as the 
throughput increases the memory utilizations does too, and it does not vary too 
much from one implementation to another.
    
    
![throughput-vs-cpu](https://cloud.githubusercontent.com/assets/3441321/10645834/6ba799e0-77f5-11e5-88fd-7e09475a5b6c.png)
    
    CPU however shows that on the low end we are going from 7 or 8 cores worth 
of CPU time to about 35 cores worth for the batching code.  This seems to be 
the result of the batch flushing threads waking up periodically.  We should be 
able to mitigate this by adjusting that interval to be larger, but that would 
in turn impact the latency.
    
    I believe that with further work we should be able to reduce that CPU 
utilization and the latency on the low end by dynamically adjusting the batch 
size and timeout based off of a specified SLA.  At this point I feel this 
branch is ready for a formal review and inclusion into storm, the drawbacks to 
this patch do not seem to out weigh the clear advantages to it.  Additionally 
with the stability problems associated with #694 I cannot feel good in 
recommending it at this time.   It is clear that some of what it is doing is 
worthwhile, and I think we should explore the alternative batch serialization 
between worker processes as a potential stand alone piece.
    
    @d2r @knusbaum @kishorvpatil @harshach @HeartSaVioR @ptgoetz 
    
    If you could please review this remembering that it is based off of #797, I 
would like to try and get this in soon.


> Add tuple batching
> ------------------
>
>                 Key: STORM-855
>                 URL: https://issues.apache.org/jira/browse/STORM-855
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>            Reporter: Matthias J. Sax
>            Assignee: Matthias J. Sax
>            Priority: Minor
>
> In order to increase Storm's throughput, multiple tuples can be grouped 
> together in a batch of tuples (ie, fat-tuple) and transfered from producer to 
> consumer at once.
> The initial idea is taken from https://github.com/mjsax/aeolus. However, we 
> aim to integrate this feature deep into the system (in contrast to building 
> it on top), what has multiple advantages:
>   - batching can be even more transparent to the user (eg, no extra 
> direct-streams needed to mimic Storm's data distribution patterns)
>   - fault-tolerance (anchoring/acking) can be done on a tuple granularity 
> (not on a batch granularity, what leads to much more replayed tuples -- and 
> result duplicates -- in case of failure)
> The aim is to extend TopologyBuilder interface with an additional parameter 
> 'batch_size' to expose this feature to the user. Per default, batching will 
> be disabled.
> This batching feature has pure tuple transport purpose, ie, tuple-by-tuple 
> processing semantics are preserved. An output batch is assembled at the 
> producer and completely disassembled at the consumer. The consumer output can 
> be batched again, however, independent of batched or non-batched input. Thus, 
> batches can be of different size for each producer-consumer pair. 
> Furthermore, consumers can receive batches of different size from different 
> producers (including regular non batched input).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to