[ 
https://issues.apache.org/jira/browse/KAFKA-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418163#comment-13418163
 ] 

Lorenzo Alberton commented on KAFKA-406:
----------------------------------------

After sleeping over this, I think it's a really bad design decision. I 
appreciate that gzipping multiple messages together can lead to significant 
space savings, but I'm not convinced this is the right way. Since a compressed 
message can contain a *collection* of messages, the symmetry with the 
non-compressed message interface is broken, and a linear log is turned into an 
odd tree structure. This can't even be classified as normal iterator 
polymorphism. 

Other three very good reasons to rethink this design decision:

- as Michal also noted on the kafka-dev mailing list [1], the compression flag 
of a child of a compressed message could easily slip to 1, leading to endless 
recursion calls.

- the collection within a compressed message can't be partially consumed, i.e. 
you can't save the offset within the inner collection, as it would result in an 
invalid offset for the kafka log. The inner collection has to be consumed as a 
whole and the offset needs to be advanced to the next Message in the outer 
collection, breaking another important Kafka property.

- even if we only allow one single message (instead of a collection) as 
compressed payload of an outer Message, I don't see the need for the extra 
wrapping: the outer message has a CRC to verify that the gzipped payload is 
valid, and gzip itself has a CRC on the content, no need to have a 3rd CRC on 
the uncompressed message (waste of space and CPU).

Thoughts?

Best,
-- 
Lorenzo Alberton
Chief Tech Architect
DataSift, Inc.


[1] 
http://mail-archives.apache.org/mod_mbox/incubator-kafka-dev/201207.mbox/%3CCAP5ZrEiDjUyhYuNpmh7Xck1dzdCROG_pEgdKbZdDV2yXrXxQAg%40mail.gmail.com%3Ec
                
> Gzipped payload is a fully wrapped Message (with headers), not just payload
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-406
>                 URL: https://issues.apache.org/jira/browse/KAFKA-406
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.7.1
>         Environment: N/A
>            Reporter: Lorenzo Alberton
>
> When creating a gzipped MessageSet, the collection of Messages is passed to 
> CompressionUtils.compress(), where each message is serialised [1] into a 
> buffer (not just the payload, the full Message with headers, uncompressed), 
> then gripped, and finally wrapped into another Message [2].
> In other words, the consumer has to unwrap the Message flagged as gzipped, 
> unzip the payload, and unwrap the unzipped payload again as a non-compressed 
> Message. 
> Is this double-wrapping the intended behaviour? 
> [1] messages.foreach(m => m.serializeTo(messageByteBuffer))
> [2] new Message(outputStream.toByteArray, compressionCodec) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to