[jira] [Commented] (KAFKA-3973) Investigate feasibility of caching bytes vs. records

Bill Bejeck (JIRA) Sun, 24 Jul 2016 18:17:31 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391229#comment-15391229
 ]


Bill Bejeck commented on KAFKA-3973:
------------------------------------

The results for the LRU investigation are below. There were three types of 
measurements taken:

1. The current tracking max by size cache (Control)
2. The cache tracking size by max memory (Object). Both keys and values were 
used in keeping track of total memory.  The max size for the cache was 
calculated by multiplying the memory of a key/value pair (taken using the 
MemoryMeter class from the jamm library https://github.com/jbellis/jamm) by the 
max size specified in the Control/Bytes cache.
3. Storing bytes in the cache (Bytes).  The max size of the cache in this case 
was done by size.  Both keys and values are serialized/deserialized.
4. I have attached the benchmarking class and the modified MemoryLRUCache class 
for reference.


While complete accuracy in java benchmarking can be difficult to achieve, the 
results of these benchmarks are sufficient from the perspective of how the 
differnt approaches compare to each other.

The cache was set to a max size of 500,000 (or in the memory based cache 
500,000 * key/value memory size). Two rounds of 25 iterations each were run.  
In the first round 500,000 put/get combinations were performed to measure 
behaviour when all records could fit in the cache.  The second round had 
1,000,000 put/get combinations to measure performance with evictions.  There 
were also some benchmarks for raw serialization and memory tracking included as 
well.

As exepected Control group had the best performance.  The Object (memory 
tracking) was better than serialization only if the MemoryMeter.measure method 
was used.  However the MemoryMeter.measure only captures the amount of memory 
taken by the object itself, it does not take into account any other objects in 
the object graph.  For example here is debug statement showing the memory for 
the string "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a 
porttitor felis. In vel dolor."


MemoryMeter.measure :
24

MemoryMeter.measureDeep :
root [java.lang.String] 232 bytes (24 bytes)
  |
  +--value [char[]] 208 bytes (208 bytes)

232

MemoryMeter.measure total ignores the char array hanging off String objects.  
With this in mind we would be forced to use MemoryMeter.measureDeep to get an 
accurate meausure of objects being placed in the cache.  From the results below 
the MemoryMeter.measureDeep method had the slowest performance.

With these results in mind, it looks to me like storing bytes in the cache is 
best going forward.  

Final notes
1. Another tool Java Object Layout 
(http://openjdk.java.net/projects/code-tools/jol/) shows promise, but needs 
evaluation.
2. These benchmarks should be re-written with JMH 
(http://openjdk.java.net/projects/code-tools/jmh/).  But using JMH requires a 
separate module at a minimum, but the JMH Gradle pluging 
(https://github.com/melix/jmh-gradle-plugin) looks interesting as it gives the 
ability to integrate JMH benchmarking tests into an existing project.  Having a 
place to write/run JMH benhmarks could be beneficial to the project as a whole. 
 If this seems worthwhile, I will create a Jira ticket and look into adding the 
JMH plugin, or creating a separate benchmarking module.
3. Probably should add a benchmarking test utilizing the MemoryLRUCache as well.

Investigation Results
Tests for 500,000 inserts 500K count/500K * memory  max cache size 

Control       500K cache put/get results 25 iterations ave time (millis) 53.24
Object        500K cache put/get results 25 iterations ave time (millis) 250.88
Object(Deep)  500K cache put/get results 25 iterations ave time (millis) 1720.08
Bytes         500K cache put/get results 25 iterations ave time (millis) 288.92

Tests for 1,000,000 inserts 500K count/500K * memory  max cache size 

Control       1M cache put/get results 25 iterations ave time (millis) 227.48
Object        1M cache put/get results 25 iterations ave time (millis) 488.2
Object(Deep)  1M cache put/get results 25 iterations ave time (millis) 2575.04
Bytes         1M cache put/get results 25 iterations ave time (millis) 852.04

Raw timing of tracking memory (deep) for 500K Strings
Took [567] millis to track memory

Raw timing of tracking memory for 500K Strings
Took [92] millis to track memory

Raw timing of tracking memory (deep) for 500K ComplexObjects
Took [2813] millis to track memory

Raw timing of tracking memory for 500K ComplexObjects
Took [148] millis to track memory

Raw timing of serialization for 500K Strings
Took [133] millis to serialize

Raw timing of serialization for 500K ComplexObjects
Took [525] millis to serialize






> Investigate feasibility of caching bytes vs. records
> ----------------------------------------------------
>
>                 Key: KAFKA-3973
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3973
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: streams
>            Reporter: Eno Thereska
>            Assignee: Bill Bejeck
>             Fix For: 0.10.1.0
>
>         Attachments: CachingPerformanceBenchmarks.java, MemoryLRUCache.java
>
>
> Currently the cache stores and accounts for records, not bytes or objects. 
> This investigation would be around measuring any performance overheads that 
> come from storing bytes or objects. As an outcome we should know whether 1) 
> we should store bytes or 2) we should store objects. 
> If we store objects, the cache still needs to know their size (so that it can 
> know if the object fits in the allocated cache space, e.g., if the cache is 
> 100MB and the object is 10MB, we'd have space for 10 such objects). The 
> investigation needs to figure out how to find out the size of the object 
> efficiently in Java.
> If we store bytes, then we are serialising an object into bytes before 
> caching it, i.e., we take a serialisation cost. The investigation needs 
> measure how bad this cost can be especially for the case when all objects fit 
> in cache (and thus any extra serialisation cost would show).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3973) Investigate feasibility of caching bytes vs. records

Reply via email to