Jay Kreps created KAFKA-527: ------------------------------- Summary: Compression support does numerous byte copies Key: KAFKA-527 URL: https://issues.apache.org/jira/browse/KAFKA-527 Project: Kafka Issue Type: Bug Reporter: Jay Kreps
The data path for compressing or decompressing messages is extremely inefficient. We do something like 7 (?) complete copies of the data, often for simple things like adding a 4 byte size to the front. I am not how this went by unnoticed. This is likely the root cause of the performance issues we saw in doing bulk recompression of data in mirror maker. The mismatch between the InputStream and OutputStream interfaces and the Message/MessageSet interfaces which are based on byte buffers is the cause of many of these. I believe the right thing to do is to rework the compression code so that it doesn't use the Stream interface. Snappy supports ByteBuffers directly. GZIP in java doesn't seem to, but I think GZIP is the wrong thing to be using. If I understand correctly GZIP = DEFLATE + HEADER + FOOTER. The header contains things like a version and checksum. Since we already record the compression type, using GZIP is redundant, and we should just be using DEFLATE which has direct support for byte arrays. With this change I think it should be possible to optimize the compression down to eliminate all copying in the common case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira