Jay Kreps created KAFKA-527:
-------------------------------

             Summary: Compression support does numerous byte copies
                 Key: KAFKA-527
                 URL: https://issues.apache.org/jira/browse/KAFKA-527
             Project: Kafka
          Issue Type: Bug
            Reporter: Jay Kreps


The data path for compressing or decompressing messages is extremely 
inefficient. We do something like 7 (?) complete copies of the data, often for 
simple things like adding a 4 byte size to the front. I am not how this went by 
unnoticed.

This is likely the root cause of the performance issues we saw in doing bulk 
recompression of data in mirror maker.

The mismatch between the InputStream and OutputStream interfaces and the 
Message/MessageSet interfaces which are based on byte buffers is the cause of 
many of these.

I believe the right thing to do is to rework the compression code so that it 
doesn't use the Stream interface. Snappy supports ByteBuffers directly. GZIP in 
java doesn't seem to, but I think GZIP is the wrong thing to be using. If I 
understand correctly GZIP = DEFLATE + HEADER + FOOTER. The header contains 
things like a version and checksum. Since we already record the compression 
type, using GZIP is redundant, and we should just be using DEFLATE which has 
direct support for byte arrays. With this change I think it should be possible 
to optimize the compression down to eliminate all copying in the common case.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to