[ 
https://issues.apache.org/jira/browse/KAFKA-527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Kreps updated KAFKA-527:
----------------------------

    Description: 
The data path for compressing or decompressing messages is extremely 
inefficient. We do something like 7 (?) complete copies of the data, often for 
simple things like adding a 4 byte size to the front. I am not sure how this 
went by unnoticed.

This is likely the root cause of the performance issues we saw in doing bulk 
recompression of data in mirror maker.

The mismatch between the InputStream and OutputStream interfaces and the 
Message/MessageSet interfaces which are based on byte buffers is the cause of 
many of these.



  was:
The data path for compressing or decompressing messages is extremely 
inefficient. We do something like 7 (?) complete copies of the data, often for 
simple things like adding a 4 byte size to the front. I am not sure how this 
went by unnoticed.

This is likely the root cause of the performance issues we saw in doing bulk 
recompression of data in mirror maker.

The mismatch between the InputStream and OutputStream interfaces and the 
Message/MessageSet interfaces which are based on byte buffers is the cause of 
many of these.

I believe the right thing to do is to rework the compression code so that it 
doesn't use the Stream interface. Snappy supports ByteBuffers directly. GZIP in 
java doesn't seem to, but I think GZIP is the wrong thing to be using. If I 
understand correctly GZIP = DEFLATE + HEADER + FOOTER. The header contains 
things like a version and checksum. Since we already record the compression 
type, using GZIP is redundant, and we should just be using DEFLATE which has 
direct support for byte arrays. With this change I think it should be possible 
to optimize the compression down to eliminate all copying in the common case.



    
> Compression support does numerous byte copies
> ---------------------------------------------
>
>                 Key: KAFKA-527
>                 URL: https://issues.apache.org/jira/browse/KAFKA-527
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jay Kreps
>
> The data path for compressing or decompressing messages is extremely 
> inefficient. We do something like 7 (?) complete copies of the data, often 
> for simple things like adding a 4 byte size to the front. I am not sure how 
> this went by unnoticed.
> This is likely the root cause of the performance issues we saw in doing bulk 
> recompression of data in mirror maker.
> The mismatch between the InputStream and OutputStream interfaces and the 
> Message/MessageSet interfaces which are based on byte buffers is the cause of 
> many of these.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to