[ https://issues.apache.org/jira/browse/KAFKA-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418163#comment-13418163 ]
Lorenzo Alberton commented on KAFKA-406: ---------------------------------------- After sleeping over this, I think it's a really bad design decision. I appreciate that gzipping multiple messages together can lead to significant space savings, but I'm not convinced this is the right way. Since a compressed message can contain a *collection* of messages, the symmetry with the non-compressed message interface is broken, and a linear log is turned into an odd tree structure. This can't even be classified as normal iterator polymorphism. Other three very good reasons to rethink this design decision: - as Michal also noted on the kafka-dev mailing list [1], the compression flag of a child of a compressed message could easily slip to 1, leading to endless recursion calls. - the collection within a compressed message can't be partially consumed, i.e. you can't save the offset within the inner collection, as it would result in an invalid offset for the kafka log. The inner collection has to be consumed as a whole and the offset needs to be advanced to the next Message in the outer collection, breaking another important Kafka property. - even if we only allow one single message (instead of a collection) as compressed payload of an outer Message, I don't see the need for the extra wrapping: the outer message has a CRC to verify that the gzipped payload is valid, and gzip itself has a CRC on the content, no need to have a 3rd CRC on the uncompressed message (waste of space and CPU). Thoughts? Best, -- Lorenzo Alberton Chief Tech Architect DataSift, Inc. [1] http://mail-archives.apache.org/mod_mbox/incubator-kafka-dev/201207.mbox/%3CCAP5ZrEiDjUyhYuNpmh7Xck1dzdCROG_pEgdKbZdDV2yXrXxQAg%40mail.gmail.com%3Ec > Gzipped payload is a fully wrapped Message (with headers), not just payload > --------------------------------------------------------------------------- > > Key: KAFKA-406 > URL: https://issues.apache.org/jira/browse/KAFKA-406 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.7.1 > Environment: N/A > Reporter: Lorenzo Alberton > > When creating a gzipped MessageSet, the collection of Messages is passed to > CompressionUtils.compress(), where each message is serialised [1] into a > buffer (not just the payload, the full Message with headers, uncompressed), > then gripped, and finally wrapped into another Message [2]. > In other words, the consumer has to unwrap the Message flagged as gzipped, > unzip the payload, and unwrap the unzipped payload again as a non-compressed > Message. > Is this double-wrapping the intended behaviour? > [1] messages.foreach(m => m.serializeTo(messageByteBuffer)) > [2] new Message(outputStream.toByteArray, compressionCodec) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira