[ 
https://issues.apache.org/jira/browse/KAFKA-14636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684587#comment-17684587
 ] 

Christo Lolov commented on KAFKA-14636:
---------------------------------------

I incorporated a dictionary to be used and created a new JMH benchmark to test 
the performance of the implementation 
(https://github.com/apache/kafka/compare/trunk...clolov:kafka:produce-dictionary?expand=1).
 There were improvements, but they were big only over an artificial set of data 
as seen below
{code:java}
# Without dictionary
Benchmark                                          (bufferSupplierStr)  (bytes) 
 (maxBatchSize)  (messageSize)  (messageVersion)   Mode  Cnt     Score
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
             50             10                 2  thrpt    2  1046.463
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
             50             50                 2  thrpt    2   957.770
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
             50            100                 2  thrpt    2   877.248
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
            100             10                 2  thrpt    2   679.727
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
            100             50                 2  thrpt    2   642.920
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
            100            100                 2  thrpt    2   569.959
{code}

{code:java}
# With dictionary
Benchmark                                          (bufferSupplierStr)  (bytes) 
 (maxBatchSize)  (messageSize)  (messageVersion)   Mode  Cnt     Score
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
             50             10                 2  thrpt    2  1533.673
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
             50             50                 2  thrpt    2  1376.801
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
             50            100                 2  thrpt    2  1209.928
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
            100             10                 2  thrpt    2   878.464
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
            100             50                 2  thrpt    2   790.505
CompressionBenchmark.measureCompressionThroughput               CREATE     ONES 
            100            100                 2  thrpt    2   701.102
{code}
On a more "realistic" data set as the one given in the link, the improvements 
were minimal. I experimented with different dictionary, sample and buffer 
sizes, but could not obtain results similar to the ones detailed in 
https://github.com/facebook/zstd. I tried reaching out to people who had 
operational knowledge of Zstd, but none of the ones I spoke had employed 
dictionaries.

[~ijuma], do you have any thoughts on whether to proceed with this or not or 
any suggestions for improvement?

> Compression optimization: Use zstd dictionary based (de)compression
> -------------------------------------------------------------------
>
>                 Key: KAFKA-14636
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14636
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Divij Vaidya
>            Assignee: Christo Lolov
>            Priority: Major
>              Labels: needs-kip
>
> Use dictionary functionality of Zstd decompression. Train the dictionary per 
> topic for first few MBs and then use it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to