[ 
https://issues.apache.org/jira/browse/KAFKA-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895443#comment-15895443
 ] 

ASF GitHub Bot commented on KAFKA-3995:
---------------------------------------

GitHub user becketqin opened a pull request:

    https://github.com/apache/kafka/pull/2638

    KAFKA-3995: fix compression ratio estimation.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/becketqin/kafka KAFKA-3995

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/2638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2638
    
----
commit cee4d58cb0afaa526bddd372246894bab82c82c2
Author: Jiangjie Qin <becket....@gmail.com>
Date:   2017-03-04T01:40:12Z

    KAFKA-3995: fix compression ratio estimation.

----


> Add a new configuration "enable.compression.ratio.estimation" to the producer 
> config
> ------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3995
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3995
>             Project: Kafka
>          Issue Type: Improvement
>          Components: clients
>    Affects Versions: 0.10.0.0
>            Reporter: Jiangjie Qin
>            Assignee: Mayuresh Gharat
>
> We recently see a few cases where RecordTooLargeException is thrown because 
> the compressed message sent by KafkaProducer exceeded the max message size.
> The root cause of this issue is because the compressor is estimating the 
> batch size using an estimated compression ratio based on heuristic 
> compression ratio statistics. This does not quite work for the traffic with 
> highly variable compression ratios. 
> For example, if the batch size is set to 1MB and the max message size is 1MB. 
> Initially a the producer is sending messages (each message is 1MB) to topic_1 
> whose data can be compressed to 1/10 of the original size. After a while the 
> estimated compression ratio in the compressor will be trained to 1/10 and the 
> producer would put 10 messages into one batch. Now the producer starts to 
> send messages (each message is also 1MB) to topic_2 whose message can only be 
> compress to 1/5 of the original size. The producer would still use 1/10 as 
> the estimated compression ratio and put 10 messages into a batch. That batch 
> would be 2 MB after compression which exceeds the maximum message size. In 
> this case the user do not have many options other than resend everything or 
> close the producer if they care about ordering.
> This is especially an issue for services like MirrorMaker whose producer is 
> shared by many different topics.
> To solve this issue, we can probably add a configuration 
> "enable.compression.ratio.estimation" to the producer. So when this 
> configuration is set to false, we stop estimating the compressed size but 
> will close the batch once the uncompressed bytes in the batch reaches the 
> batch size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to