You may run into this bug https://issues.apache.org/jira/browse/KAFKA-4614

On Thu, 12 Jan 2017 at 23:38 Stephen Powis <spo...@salesforce.com> wrote:

> Per my email to the list in Sept, when I reviewed GC logs then, I didn't
> see anything out of the ordinary. (
>
> http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/%3CCABQB-gS7h4Nuq3TKgHoAVeRHPWnBNs2B0Tz0kCjmdB9c0SDcLQ%40mail.gmail.com%3E
> )
>
> Reviewing the GC logs from this morning around the timestamp in my previous
> email, I see GC running roughly every 10-12 seconds, with total times
> similar to the following:
>
> 2017-01-12T07:16:46.867-0500: 46891.844: Total time for which application
> > threads were stopped: 0.0141281 seconds, Stopping threads took: 0.0002171
> > seconds
> >
> >
> Here is a link to a GCEasy report:
>
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMDEvMTIvLS10b3RhbEdDLWthZmthMS00LmxvZy5nei0tMTUtMzQtNTk=
>
>
> Currently using G1 gc with the following settings:
>
> -Xmx12G -Xms12G -server -XX:MaxPermSize=48M -verbose:gc
> -Xloggc:/var/log/kafka/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintTLAB -XX:+DisableExplicitGC -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -XX:+UseCompressedOops
> -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/var/log/kafka/heapDump.log
> -Xloggc:/opt/kafka/current/bin/../logs/kafkaServer-gc.log -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
>
>
>
> On Thu, Jan 12, 2017 at 9:55 AM, Tauzell, Dave <
> dave.tauz...@surescripts.com
> > wrote:
>
> > Can you collect garbage collection stats and verify there isn't a long GC
> > happening at the same time?
> >
> > -Dave
> >
> > -----Original Message-----
> > From: Stephen Powis [mailto:spo...@salesforce.com]
> > Sent: Thursday, January 12, 2017 8:34 AM
> > To: users@kafka.apache.org
> > Subject: Re: Taking a long time to roll a new log segment (~1 min)
> >
> > So per the kafka docs I up'd our FD limit to 100k, and we are no longer
> > seeing the process die, which is good.
> >
> > Unfortunately we're still seeing very high log segment roll times, and
> I'm
> > unsure if this is considered 'normal', as it tends to block producers
> > during this period.
> >
> > We are running kafka 0.10.0.1, but I patched in some additionally timing
> > statements into the kafka.log.log roll() method to narrow down exactly
> > which part of that method is taking so long.
> >
> > Again, typically the process to roll a new log file takes only 1-2ms
> tops,
> > but several times a day it takes 30-60+ seconds, across all of our
> > brokers.  I've narrowed it down to this bit of code causing the issue:
> > https://github.com/apache/kafka/blob/0.10.0/core/src/
> > main/scala/kafka/log/Log.scala#L652-L658
> >
> > Here's an example of output w/ my additional timing log statements:
> >
> > [2017-01-12 07:17:58,199] INFO Rolled new log segment for 'MyTopic-4' in
> > > 28028 ms. (kafka.log.Log)
> >
> > [2017-01-12 07:17:58,200] INFO LastEntry match time 'EventStream-4' in
> > > 28027 ms. (kafka.log.Log)
> > >
> >
> > Any thoughts?  Help?  This considered normal?
> >
> > Thanks
> > Stephen
> >
> >
> > On Tue, Jan 10, 2017 at 2:37 AM, Ewen Cheslack-Postava <
> e...@confluent.io>
> > wrote:
> >
> > > I can't speak to the exact details of why fds would be kept open
> > > longer in that specific case, but are you aware that the
> > > recommendation for production clusters for open fd limits is much
> > > higher? It's been suggested to be 100,000 as a starting point for quite
> > awhile:
> > > http://kafka.apache.org/documentation.html#os
> > >
> > > -Ewen
> > >
> > > On Mon, Jan 9, 2017 at 12:45 PM, Stephen Powis <spo...@salesforce.com>
> > > wrote:
> > >
> > > > Hey!
> > > >
> > > > I've run into something concerning in our production cluster....I
> > > > believe I've posted this question to the mailing list previously (
> > > > http://mail-archives.apache.org/mod_mbox/kafka-users/201609.mbox/bro
> > > > wser
> > > )
> > > > but the problem has become considerably more serious.
> > > >
> > > > We've been fighting issues where Kafka 0.10.0.1 hits its max file
> > > > descriptor limit.  Our limit is set to ~16k, and under normal
> > > > operation
> > > it
> > > > holds steady around 4k open files.
> > > >
> > > > But occasionally Kafka will roll a new log segment, which typically
> > > > takes on the order of magnitude of a few milliseconds.
> > > > However...sometimes it will take a considerable amount of time, any
> > > > where from 40 seconds up to over a minute.  When this happens, it
> > > > seems like connections are not released by kafka, and we end up with
> > > > thousands of client connections
> > > stuck
> > > > in CLOSE_WAIT, which pile up and exceed our max file descriptor
> limit.
> > > > This happens all in the span of about a minute.
> > > >
> > > > Our logs look like this:
> > > >
> > > > [2017-01-08 01:10:17,117] INFO Rolled new log segment for
> > > > 'MyTopic-8' in
> > > > > 41122 ms. (kafka.log.Log)
> > > > > [2017-01-08 01:10:32,550] INFO Rolled new log segment for
> 'MyTopic-4'
> > > in
> > > > 1
> > > > > ms. (kafka.log.Log)
> > > > > [2017-01-08 01:11:10,039] INFO [Group Metadata Manager on Broker
> 4]:
> > > > > Removed 0 expired offsets in 0 milliseconds.
> > > > > (kafka.coordinator.GroupMetadataManager)
> > > > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> > > > > (kafka.network.Acceptor)
> > > > > java.io.IOException: Too many open files       at
> > > > > sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> > > > >
> > > >         at
> > > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > > ServerSocketChannelImpl.java:
> > > > 422)
> > > > >         at
> > > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > > ServerSocketChannelImpl.java:
> > > > 250)
> > > > >         at kafka.network.Acceptor.accept(SocketServer.scala:323)
> > > > >         at kafka.network.Acceptor.run(SocketServer.scala:268)
> > > > >         at java.lang.Thread.run(Thread.java:745)
> > > > > [2017-01-08 01:19:02,877] ERROR Error while accepting connection
> > > > > (kafka.network.Acceptor)
> > > > > java.io.IOException: Too many open files
> > > > >         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native
> Method)
> > > > >         at
> > > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > > ServerSocketChannelImpl.java:
> > > > 422)
> > > > >         at
> > > > > sun.nio.ch.ServerSocketChannelImpl.accept(
> > > ServerSocketChannelImpl.java:
> > > > 250)
> > > > >         at kafka.network.Acceptor.accept(SocketServer.scala:323)
> > > > >         at kafka.network.Acceptor.run(SocketServer.scala:268)
> > > > >         at java.lang.Thread.run(Thread.java:745)
> > > > > .....
> > > > >
> > > >
> > > >
> > > > And then kafka crashes.
> > > >
> > > > Has anyone seen this behavior of slow log segmented being rolled?
> > > > Any ideas of how to track down what could be causing this?
> > > >
> > > > Thanks!
> > > > Stephen
> > > >
> > >
> > This e-mail and any files transmitted with it are confidential, may
> > contain sensitive information, and are intended solely for the use of the
> > individual or entity to whom they are addressed. If you have received
> this
> > e-mail in error, please notify the sender by reply e-mail immediately and
> > destroy all copies of the e-mail and any attachments.
> >
>

Reply via email to