[ https://issues.apache.org/jira/browse/KAFKA-5781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150090#comment-16150090 ]
Raoufeh Hashemian commented on KAFKA-5781: ------------------------------------------ Just attached the log files and a plot of the produce latency. The times in the plot are -6 hours behind the UTC time in the logs. So the peaks happened at 05:22 , 05:36 and 05:49 UTC > Frequent long produce latency periods that result in reduced produce rate. > -------------------------------------------------------------------------- > > Key: KAFKA-5781 > URL: https://issues.apache.org/jira/browse/KAFKA-5781 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.11.0.0 > Environment: CentOS Linux release 7.3.1611 , Kernel 3.10, java > version "1.8.0_121" > Reporter: Raoufeh Hashemian > Attachments: controler.log, > frequent_latency_increase_diskactivity.png, frequent_latency_increase.png, > frequent_latency_increase_zoomed.png, gc0.log, GC time.png, > produce_delay.png, server.log, state-change.log.zip > > > When we upgraded from Kafka 0.10,2 to 0.11.0 , I started to see frequent > throughput drops with a predictable pattern (attached file shows the pattern > in a 14 hour period). This resulted in an a degradation of up to 30% in our > overall produce throughput. > The drops can be correlated to the significant increase in 99th percentile > latency (up to 4 seconds). We have a cluster of 6 brokers and a single topic. > The problem happens both with/without consumers running so I only included a > case without consumers. > There is no specific message in the broker logs when the latency surge > happens. However, I found a correlation between the log rotation messages in > the log and the the longer cycles in the pattern (details shown in the > attached graph:frequent_latency_increase.png) > Each increased latency period takes 5 to 20 minutes to finish (shown in the > zoomed graph in the attached files). > The broker cpu utilization goes down during this time and some read disk > activity is observed (see attached graph) > This pattern started to appear in our environment exactly at the time when we > switched to kafka 0.11.0. We kept the idempotence as false and didn`t make > any configuration change as we switched. So I was wondering if it could be a > bug or configuration that needs to be changed after upgrade? -- This message was sent by Atlassian JIRA (v6.4.14#64029)