[jira] [Updated] (KAFKA-5781) Frequent long produce latency periods that result in reduced produce rate.

Raoufeh Hashemian (JIRA) Thu, 24 Aug 2017 09:02:23 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-5781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raoufeh Hashemian updated KAFKA-5781:
-------------------------------------
    Description: 
When we upgraded from Kafka 0.10,2 to 0.11.0 , I started to see frequent 
throughput drops with a predictable pattern (attached file shows the pattern in 
a 14 hour period). This resulted in an a degradation of up to 30% in our 
overall produce throughput.

The drops can be correlated to the significant increase in 99th percentile 
latency (up to 4 seconds). We have a cluster of 6 brokers and a single topic. 
The problem happens both with/without consumers running so I only included a 
case without consumers.

There is no specific message in the broker logs when the latency surge happens. 
 However, I found a correlation between the log rotation messages in the log 
and the the longer cycles in the pattern (details shown in the first attached 
graph)

Each increased latency period takes 5 to 20 minutes to finish (shown in the 
zoomed graph in the attached files). 

The broker cpu utilization goes down during this time and some read disk 
activity is observed (see attached graph)

This pattern started to appear in our environment exactly at the time when we 
switched to kafka 0.11.0. We kept the idempotence as false and didn`t make any 
configuration change as we switched. So I was wondering if it could be a bug or 
configuration that needs to be changed after upgrade?

  was:
When we upgraded from Kafka 0.10,2 to 0.11.0 , I started to see frequent 
throughput drops with a predictable pattern (attached file shows the pattern in 
a 14 hour period). This resulted in an overall degradation of up to 30% in our 
overall produce throughput.

The drops can be correlated to the significant increase in 99th percentile 
latency (up to 4 seconds). We have a cluster of 6 brokers and a single topic. 
The problem happens both with/without consumers running so I only included a 
case without consumers.

There is no specific message in the broker logs when the latency surge happens. 
 However, I found a correlation between the log rotation messages in the log 
and the the longer cycles in the pattern (details shown in the first attached 
graph)

Each increased latency period takes 5 to 20 minutes to finish (shown in the 
zoomed graph in the attached files). 

The broker cpu utilization goes down during this time and some read disk 
activity is observed (see attached graph)

This pattern started to appear in our environment exactly at the time when we 
switched to kafka 0.11.0. We kept the idempotence as false and didn`t make any 
configuration change as we switched. So I was wondering if it could be a bug or 
configuration that needs to be changed after upgrade?


> Frequent long produce latency periods that result in reduced produce rate.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-5781
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5781
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.11.0.0
>         Environment: CentOS Linux release 7.3.1611 , Kernel 3.10, java 
> version "1.8.0_121"
>            Reporter: Raoufeh Hashemian
>         Attachments: frequent_latency_increase_diskactivity.png, 
> frequent_latency_increase.png, frequent_latency_increase_zoomed.png
>
>
> When we upgraded from Kafka 0.10,2 to 0.11.0 , I started to see frequent 
> throughput drops with a predictable pattern (attached file shows the pattern 
> in a 14 hour period). This resulted in an a degradation of up to 30% in our 
> overall produce throughput.
> The drops can be correlated to the significant increase in 99th percentile 
> latency (up to 4 seconds). We have a cluster of 6 brokers and a single topic. 
> The problem happens both with/without consumers running so I only included a 
> case without consumers.
> There is no specific message in the broker logs when the latency surge 
> happens.  However, I found a correlation between the log rotation messages in 
> the log and the the longer cycles in the pattern (details shown in the first 
> attached graph)
> Each increased latency period takes 5 to 20 minutes to finish (shown in the 
> zoomed graph in the attached files). 
> The broker cpu utilization goes down during this time and some read disk 
> activity is observed (see attached graph)
> This pattern started to appear in our environment exactly at the time when we 
> switched to kafka 0.11.0. We kept the idempotence as false and didn`t make 
> any configuration change as we switched. So I was wondering if it could be a 
> bug or configuration that needs to be changed after upgrade?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (KAFKA-5781) Frequent long produce latency periods that result in reduced produce rate.

Reply via email to