[jira] [Commented] (KAFKA-406) Gzipped payload is a fully wrapped Message (with headers), not just payload

Jay Kreps (JIRA) Thu, 19 Jul 2012 08:54:40 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418376#comment-13418376
 ]


Jay Kreps commented on KAFKA-406:
---------------------------------

I actually don't see how the compression flag can slip to 1 unnoticed in a 
checksummed message sent over tcp. I mean you could incorrectly set it to 1, 
but you could incorrectly set all kinds of things in a client implementation 
including the message contents. Not sure if I am missing something...

I actually share your distaste for some of the details of the compression 
implementation. But I think batch compression is fundamentally an invasive 
feature for a message-at-a-time system, so I am not sure if we can do better. I 
think we would be open to hearing alternative approaches, if fully thought 
through, though it would likely be a big change.

Here are the basic requirements from my point of view:
1. It must be batch compression. Single message compression doesn't buy much 
for concise serialization formats and in any case can be implemented in the 
client
2. Compression must be maintained both in the on-disk format and the network. 
On disk, compression effectively triples the effective per-node cache size in 
our usage. In our usage we always produce to a local cluster and that then 
replicates cross datacenter. So for us the consumer is what really must be 
compressed over the network. However I am not sure that that is a universal 
design so ideally both producer and consumer should allow compression, although 
I think it is okay if the server decompresses and re-compresses in a different 
batch size.
3. It should not break the message-at-a-time API.

The alternative we considered was a paged log (e.g. stuffing messages into 
fixed-size pages). I am not sure if this is better or worse but we ended up 
rejecting it due to the complexity of implementation which would require 
splitting messages over pages on overflow etc.

                
> Gzipped payload is a fully wrapped Message (with headers), not just payload
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-406
>                 URL: https://issues.apache.org/jira/browse/KAFKA-406
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.7.1
>         Environment: N/A
>            Reporter: Lorenzo Alberton
>
> When creating a gzipped MessageSet, the collection of Messages is passed to 
> CompressionUtils.compress(), where each message is serialised [1] into a 
> buffer (not just the payload, the full Message with headers, uncompressed), 
> then gripped, and finally wrapped into another Message [2].
> In other words, the consumer has to unwrap the Message flagged as gzipped, 
> unzip the payload, and unwrap the unzipped payload again as a non-compressed 
> Message. 
> Is this double-wrapping the intended behaviour? 
> [1] messages.foreach(m => m.serializeTo(messageByteBuffer))
> [2] new Message(outputStream.toByteArray, compressionCodec) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-406) Gzipped payload is a fully wrapped Message (with headers), not just payload

Reply via email to