[ 
https://issues.apache.org/jira/browse/KAFKA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384660#comment-14384660
 ] 

Rajiv Kurian edited comment on KAFKA-2045 at 3/27/15 9:07 PM:
--------------------------------------------------------------

1. "We can actually make serious performance improvements by improving memory 
allocation patterns" - Yeah this is definitely the crux of it. <rant>Any 
performance improvements should also look at long term effects like GC 
activity, longest GC pause etc in addition to just throughput. Even the 
throughput and latency numbers will have to be looked at for a long time 
especially in an application where things don't fit in the L1 or L2 caches. I 
have usually found that with Java most benchmarks (even ones conducted with 
JMH) lie because of how short in duration they are. Since Java has a Thread 
Local Allocation Buffer, objects  allocated in quick succession get allocated 
next to each other in memory too. So even though an ArrayList of objects is an 
array of pointers to objects, the fact that these objects were allocated next 
to each other means they get 95% (hand wave hand wave) of the benefits of an 
equivalent std::vector of structs in C++. The nice memory-striding effects of 
sequential buffers holds even if it is a linked list of Objects again given 
that the Objects themselves were next to each other. But over time even if a 
single Object is actually not deleted/shuffled in the ArrayList,  a garbage 
collection is very likely to move them around in memory and when this happens 
they don't move as an entire unit but separately. Now what began as sequential 
access degenerates into an array of pointers to randomly laid out objects. And 
performance of these is an order of magnitude lower than arrays of sequentially 
laid out structs in C. A ByteBuffer/sun.misc.Unsafe based approach on the other 
hand never changes memory layout so the benefits continue to hold. This is why 
in my experience the 99.99th and above percentiles of typical POJO based 
solutions tanks and is orders of magnitude worse than the 99th etc, whereas 
solutions based on ByteBuffers and sun.misc.Unsafe have 99.99s that are maybe 
4-5 times worse than the 99th</rant over>. But again there might (will?) be 
other bottlenecks like the network or CRC that might show up before one can get 
the max out of such a design.

2. "We don't mangle the code to badly in doing so" - I am planning to write a 
prototype using my own code from scratch that would include things like on the 
fly protocol parsing, buffer management and socket management. I'll  keep 
looking at /copy  the existing code to ensure that I handle errors correctly. 
It is just easier to start from fresh - that way I can work solely on getting 
this to work rather than worrying about how to fit this design in the current 
class hierarchy. A separate prototype will also probably provide the best 
platform for a performance demo since I can use things like primitive array 
based open hash-maps and other non-allocating primitives based data structures 
for metadata management. I can also use char sequences instead of Java's 
allocating strings for topics and such just to see how much of a difference 
they make. It just gives me a lot of options without messing with trunk. If 
this works out and we see an improvement in performance that seems interesting, 
we can work on how best to not mangle the code and/or decide which parts are 
worth mangling for the extra performance. Thoughts?



was (Author: rzidane):
1. "We can actually make serious performance improvements by improving memory 
allocation patterns" - Yeah this is definitely the crux of it. <rant>Any 
performance improvements should also look at long term effects like GC 
activity, longest GC pause etc in addition to just throughput. Even the 
throughput and latency numbers will have to be looked at for a long time 
especially in an application where things don't fit in the L1 or L2 caches. I 
have usually found that with Java most benchmarks (even ones conducted with 
JMH) lie because of how short in duration they are. Since Java has a Thread 
Local Allocation Buffer, objects  allocated in quick succession get allocated 
next to each other in memory too. So even though an ArrayList of objects is an 
array of pointers to objects, the fact that these objects were allocated next 
to each other means they get 95% (hand wave hand wave) of the benefits of an 
equivalent std::vector of structs in C++. The nice memory-striding effects of 
sequential buffers holds even if it is a linked list of Objects again given 
that the Objects themselves were next to each other. But over time even if a 
single Object is actually not deleted/shuffled in the ArrayList,  a garbage 
collection is very likely to move them around in memory and when this happens 
they don't move as an entire unit but separately. Now what began as sequential 
access degenerates into an array of pointers to randomly laid out objects. And 
performance of these is an order of magnitude lower than arrays of sequentially 
laid out structs in C. A ByteBuffer/sun.misc.Unsafe based approach on the other 
hand never changes memory layout so the benefits continue to hold. This is why 
in my experience the 99.99th and above percentiles of typical POJO based 
solutions tanks and is orders of magnitude worse than the 99th etc, whereas 
solutions based on ByteBuffers and sun.misc.Unsafe have 99.99s that are maybe 
4-5 times worse than the 99th</rant over>. But again there might (will?) be 
other bottlenecks like the network or CRC that might show up before one can get 
the max out of such a design.

2. "We don't mangle the code to badly in doing so" - I am planning to write a 
prototype using my own code from scratch that would include things like on the 
fly protocol parsing, buffer management and socket management. I'll  keep 
looking at /copy  the existing code to ensure that I handle errors correctly. 
It is just easier to start from fresh - that way I can work solely on getting 
this to work rather than worrying about how to fit this design in the current 
class hierarchy. A separate no strings prototype will also probably provide the 
best platform for a performance demo since I can use things like primitive 
array based open hash-maps and other non-allocating primitives based data 
structures for metadata management. It just gives me a lot of options without 
messing with trunk. If this works out and we see an improvement in performance 
that seems interesting, we can work on how best to not mangle the code etc. 
Thoughts?


> Memory Management on the consumer
> ---------------------------------
>
>                 Key: KAFKA-2045
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2045
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Guozhang Wang
>
> We need to add the memory management on the new consumer like we did in the 
> new producer. This would probably include:
> 1. byte buffer re-usage for fetch response partition data.
> 2. byte buffer re-usage for on-the-fly de-compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to