[
https://issues.apache.org/jira/browse/KAFKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084235#comment-13084235
]
Jay Kreps commented on KAFKA-79:
--------------------------------
We have some performance comparisons, we should include that information on the
performance page at least by the time this is released. Of course our primary
concern is interdatacenter bandwidth rather than performance per se. We see a
~30% compression ratio on our Avro tracking data.
Neha should be able to give a diff. I think it was the last checkin on github
before the cutover.
It is important that decompression always happen with the codec used for
compression, so it can't just be the case that there is some property
compression.codec=org.apache.kafka.GzipCompressor in the config because a
mismatch on producer and consumer would lead to unreadable data, and if two
people send messages with different codecs you would be totally screwed. This
means the codec used must be maintained with the message set. We do this by
having a compression id where 0=none, 1=gzip, etc. This doesn't lend itself to
extensability since that list has to be predetermined, but we could reserve a
codec id for "user defined" codec and leave it up to the user to configure it
right.
My intuition is that most people just want a good compression implementation
included out of the box and don't want to fiddle with it so i think it would be
best to get that right. I think even in the long run there are really only 2-3
algorithms that have a reasonable cpu/size compression/decompression tradeoff
so it makes sense to just implement and fully test those for perf and
correctness and include those in a way that can't break.
> Introduce the compression feature in Kafka
> ------------------------------------------
>
> Key: KAFKA-79
> URL: https://issues.apache.org/jira/browse/KAFKA-79
> Project: Kafka
> Issue Type: New Feature
> Affects Versions: 0.6
> Reporter: Neha Narkhede
> Fix For: 0.7
>
>
> With this feature, we can enable end-to-end block compression in Kafka. The
> idea is to enable compression on the producer for some or all topics, write
> the data in compressed format on the server and make the consumers
> compression aware. The data will be decompressed only on the consumer side.
> Ideally, there should be a choice of compression codecs to be used by the
> producer. That means a change to the message header as well as the network
> byte format. On the consumer side, the state maintenance behavior of the
> zookeeper consumer changes. For compressed data, the consumed offset will be
> advanced one compressed message at a time. For uncompressed data, consumed
> offset will be advanced one message at a time.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira