[ https://issues.apache.org/jira/browse/KAFKA-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418376#comment-13418376 ]
Jay Kreps commented on KAFKA-406: --------------------------------- I actually don't see how the compression flag can slip to 1 unnoticed in a checksummed message sent over tcp. I mean you could incorrectly set it to 1, but you could incorrectly set all kinds of things in a client implementation including the message contents. Not sure if I am missing something... I actually share your distaste for some of the details of the compression implementation. But I think batch compression is fundamentally an invasive feature for a message-at-a-time system, so I am not sure if we can do better. I think we would be open to hearing alternative approaches, if fully thought through, though it would likely be a big change. Here are the basic requirements from my point of view: 1. It must be batch compression. Single message compression doesn't buy much for concise serialization formats and in any case can be implemented in the client 2. Compression must be maintained both in the on-disk format and the network. On disk, compression effectively triples the effective per-node cache size in our usage. In our usage we always produce to a local cluster and that then replicates cross datacenter. So for us the consumer is what really must be compressed over the network. However I am not sure that that is a universal design so ideally both producer and consumer should allow compression, although I think it is okay if the server decompresses and re-compresses in a different batch size. 3. It should not break the message-at-a-time API. The alternative we considered was a paged log (e.g. stuffing messages into fixed-size pages). I am not sure if this is better or worse but we ended up rejecting it due to the complexity of implementation which would require splitting messages over pages on overflow etc. > Gzipped payload is a fully wrapped Message (with headers), not just payload > --------------------------------------------------------------------------- > > Key: KAFKA-406 > URL: https://issues.apache.org/jira/browse/KAFKA-406 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.7.1 > Environment: N/A > Reporter: Lorenzo Alberton > > When creating a gzipped MessageSet, the collection of Messages is passed to > CompressionUtils.compress(), where each message is serialised [1] into a > buffer (not just the payload, the full Message with headers, uncompressed), > then gripped, and finally wrapped into another Message [2]. > In other words, the consumer has to unwrap the Message flagged as gzipped, > unzip the payload, and unwrap the unzipped payload again as a non-compressed > Message. > Is this double-wrapping the intended behaviour? > [1] messages.foreach(m => m.serializeTo(messageByteBuffer)) > [2] new Message(outputStream.toByteArray, compressionCodec) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira