Can you collect garbage collection stats and verify there isn't a long GC 
happening at the same time?

-Dave

-----Original Message-----
From: Stephen Powis [mailto:spo...@salesforce.com]
Sent: Thursday, January 12, 2017 8:34 AM
To: users@kafka.apache.org
Subject: Re: Taking a long time to roll a new log segment (~1 min)

So per the kafka docs I up'd our FD limit to 100k, and we are no longer seeing 
the process die, which is good.

Unfortunately we're still seeing very high log segment roll times, and I'm 
unsure if this is considered 'normal', as it tends to block producers during 
this period.

We are running kafka 0.10.0.1, but I patched in some additionally timing 
statements into the kafka.log.log roll() method to narrow down exactly which 
part of that method is taking so long.

Again, typically the process to roll a new log file takes only 1-2ms tops, but 
several times a day it takes 30-60+ seconds, across all of our brokers.  I've 
narrowed it down to this bit of code causing the issue:
https://github.com/apache/kafka/blob/0.10.0/core/src/main/scala/kafka/log/Log.scala#L652-L658

Here's an example of output w/ my additional timing log statements:

[2017-01-12 07:17:58,199] INFO Rolled new log segment for 'MyTopic-4' in
> 28028 ms. (kafka.log.Log)

[2017-01-12 07:17:58,200] INFO LastEntry match time 'EventStream-4' in
> 28027 ms. (kafka.log.Log)
>

Any thoughts?  Help?  This considered normal?

Thanks
Stephen


On Tue, Jan 10, 2017 at 2:37 AM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> I can't speak to the exact details of why fds would be kept open
> longer in that specific case, but are you aware that the
> recommendation for production clusters for open fd limits is much
> higher? It's been suggested to be 100,000 as a starting point for quite 
> awhile:
> http://kafka.apache.org/documentation.html#os
>
> -Ewen
>
> On Mon, Jan 9, 2017 at 12:45 PM, Stephen Powis <spo...@salesforce.com>
> wrote:
>
> > Hey!
> >
> > I've run into something concerning in our production cluster....I
> > believe I've posted this question to the mailing list previously (
> > http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/bro
> > wser
> )
> > but the problem has become considerably more serious.
> >
> > We've been fighting issues where Kafka 0.10.0.1 hits its max file
> > descriptor limit.  Our limit is set to ~16k, and under normal
> > operation
> it
> > holds steady around 4k open files.
> >
> > But occasionally Kafka will roll a new log segment, which typically
> > takes on the order of magnitude of a few milliseconds.
> > However...sometimes it will take a considerable amount of time, any
> > where from 40 seconds up to over a minute.  When this happens, it
> > seems like connections are not released by kafka, and we end up with
> > thousands of client connections
> stuck
> > in CLOSE_WAIT, which pile up and exceed our max file descriptor limit.
> > This happens all in the span of about a minute.
> >
> > Our logs look like this:
> >
> > [2017-01-08 01:10:17,117] INFO Rolled new log segment for
> > 'MyTopic-8' in
> > > 41122 ms. (kafka.log.Log)
> > > [2017-01-08 01:10:32,550] INFO Rolled new log segment for 'MyTopic-4'
> in
> > 1
> > > ms. (kafka.log.Log)
> > > [2017-01-08 01:11:10,039] INFO [Group Metadata Manager on Broker 4]:
> > > Removed 0 expired offsets in 0 milliseconds.
> > > (kafka.coordinator.GroupMetadataManager)
> > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> > > (kafka.network.Acceptor)
> > > java.io.IOException: Too many open files       at
> > > sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > >
> >         at
> > > sun.nio.ch.ServerSocketChannelImpl.accept(
> ServerSocketChannelImpl.java:
> > 422)
> > >         at
> > > sun.nio.ch.ServerSocketChannelImpl.accept(
> ServerSocketChannelImpl.java:
> > 250)
> > >         at kafka.network.Acceptor.accept(SocketServer.scala:323)
> > >         at kafka.network.Acceptor.run(SocketServer.scala:268)
> > >         at java.lang.Thread.run(Thread.java:745)
> > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> > > (kafka.network.Acceptor)
> > > java.io.IOException: Too many open files
> > >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > >         at
> > > sun.nio.ch.ServerSocketChannelImpl.accept(
> ServerSocketChannelImpl.java:
> > 422)
> > >         at
> > > sun.nio.ch.ServerSocketChannelImpl.accept(
> ServerSocketChannelImpl.java:
> > 250)
> > >         at kafka.network.Acceptor.accept(SocketServer.scala:323)
> > >         at kafka.network.Acceptor.run(SocketServer.scala:268)
> > >         at java.lang.Thread.run(Thread.java:745)
> > > .....
> > >
> >
> >
> > And then kafka crashes.
> >
> > Has anyone seen this behavior of slow log segmented being rolled?
> > Any ideas of how to track down what could be causing this?
> >
> > Thanks!
> > Stephen
> >
>
This e-mail and any files transmitted with it are confidential, may contain 
sensitive information, and are intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this e-mail in error, 
please notify the sender by reply e-mail immediately and destroy all copies of 
the e-mail and any attachments.

Reply via email to