[ https://issues.apache.org/jira/browse/KAFKA-14636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684587#comment-17684587 ]
Christo Lolov commented on KAFKA-14636: --------------------------------------- I incorporated a dictionary to be used and created a new JMH benchmark to test the performance of the implementation (https://github.com/apache/kafka/compare/trunk...clolov:kafka:produce-dictionary?expand=1). There were improvements, but they were big only over an artificial set of data as seen below {code:java} # Without dictionary Benchmark (bufferSupplierStr) (bytes) (maxBatchSize) (messageSize) (messageVersion) Mode Cnt Score CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 10 2 thrpt 2 1046.463 CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 50 2 thrpt 2 957.770 CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 100 2 thrpt 2 877.248 CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 10 2 thrpt 2 679.727 CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 50 2 thrpt 2 642.920 CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 100 2 thrpt 2 569.959 {code} {code:java} # With dictionary Benchmark (bufferSupplierStr) (bytes) (maxBatchSize) (messageSize) (messageVersion) Mode Cnt Score CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 10 2 thrpt 2 1533.673 CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 50 2 thrpt 2 1376.801 CompressionBenchmark.measureCompressionThroughput CREATE ONES 50 100 2 thrpt 2 1209.928 CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 10 2 thrpt 2 878.464 CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 50 2 thrpt 2 790.505 CompressionBenchmark.measureCompressionThroughput CREATE ONES 100 100 2 thrpt 2 701.102 {code} On a more "realistic" data set as the one given in the link, the improvements were minimal. I experimented with different dictionary, sample and buffer sizes, but could not obtain results similar to the ones detailed in https://github.com/facebook/zstd. I tried reaching out to people who had operational knowledge of Zstd, but none of the ones I spoke had employed dictionaries. [~ijuma], do you have any thoughts on whether to proceed with this or not or any suggestions for improvement? > Compression optimization: Use zstd dictionary based (de)compression > ------------------------------------------------------------------- > > Key: KAFKA-14636 > URL: https://issues.apache.org/jira/browse/KAFKA-14636 > Project: Kafka > Issue Type: Sub-task > Reporter: Divij Vaidya > Assignee: Christo Lolov > Priority: Major > Labels: needs-kip > > Use dictionary functionality of Zstd decompression. Train the dictionary per > topic for first few MBs and then use it. -- This message was sent by Atlassian Jira (v8.20.10#820010)