Jay Kreps created KAFKA-527:
-------------------------------
Summary: Compression support does numerous byte copies
Key: KAFKA-527
URL: https://issues.apache.org/jira/browse/KAFKA-527
Project: Kafka
Issue Type: Bug
Reporter: Jay Kreps
The data path for compressing or decompressing messages is extremely
inefficient. We do something like 7 (?) complete copies of the data, often for
simple things like adding a 4 byte size to the front. I am not how this went by
unnoticed.
This is likely the root cause of the performance issues we saw in doing bulk
recompression of data in mirror maker.
The mismatch between the InputStream and OutputStream interfaces and the
Message/MessageSet interfaces which are based on byte buffers is the cause of
many of these.
I believe the right thing to do is to rework the compression code so that it
doesn't use the Stream interface. Snappy supports ByteBuffers directly. GZIP in
java doesn't seem to, but I think GZIP is the wrong thing to be using. If I
understand correctly GZIP = DEFLATE + HEADER + FOOTER. The header contains
things like a version and checksum. Since we already record the compression
type, using GZIP is redundant, and we should just be using DEFLATE which has
direct support for byte arrays. With this change I think it should be possible
to optimize the compression down to eliminate all copying in the common case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira