Hey Marcos,

Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it matches? At
a glance, it looks like the same problem. We tried pretty hard to get the
fix into the release, but it didn't quite make it. A few questions:

1. Did you not see this in prior releases? As far as we can tell, it is
possible going back to 0.9.0.0, but there could be a subtle difference in
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch that's
available if it's occurring frequently and you can't downgrade. I think the
existing patch is incomplete, but it should still reduce the likelihood.
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?

Thanks!
Jason

On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <mjua...@gmail.com> wrote:

> Just to expand on Lawrence's answer:  The increase in file descriptor usage
> goes from 2-3K under normal conditions, to 64K+ under deadlock, which it
> hits within a couple of hours, at which point the broker goes down, because
> that's our OS-defined limit.
>
> If it was only a 33% increase from the new timestamp indexes, we should be
> going to max 4K-5K file descriptors in use, not 64K+.
>
> Marcos
>
>
> On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <lwei...@pandora.com>
> wrote:
>
> > We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
> > We’re now running on 0.10.1.0, and the FD increase is due to a deadlock,
> > not functionality or new features.
> >
> > Lawrence Weikum | Software Engineer | Pandora
> > 1426 Pearl Street, Suite 100, Boulder CO 80302
> > m 720.203.1578 | lwei...@pandora.com
> >
> > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> wrote:
> >
> >     The 0.10.1 broker will use more file descriptor than previous
> releases
> >     because of the new timestamp indexes. You should expect and plan for
> > ~33%
> >     more file descriptors to be open.
> >
> >     -hans
> >
> >     /**
> >      * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
> >      * h...@confluent.io (650)924-2670
> >      */
> >
> >     On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <mjua...@gmail.com>
> > wrote:
> >
> >     > We're running into a recurrent deadlock issue in both our
> production
> > and
> >     > staging clusters, both using the latest 0.10.1 release.  The
> symptom
> > we
> >     > noticed was that, in servers in which kafka producer connections
> are
> > short
> >     > lived, every other day or so, we'd see file descriptors being
> > exhausted,
> >     > until the broker is restarted, or the broker runs out of file
> > descriptors,
> >     > and it goes down.  None of the clients are on 0.10.1 kafka jars,
> > they're
> >     > all using previous versions.
> >     >
> >     > When diagnosing the issue, we found that when the system is in that
> > state,
> >     > using up file descriptors at a really fast rate, the JVM is
> actually
> > in a
> >     > deadlock.  Did a thread dump from both jstack and visualvm, and
> > attached
> >     > those to this email.
> >     >
> >     > This is the interesting bit from the jstack thread dump:
> >     >
> >     >
> >     > Found one Java-level deadlock:
> >     > =============================
> >     > "executor-Heartbeat":
> >     >   waiting to lock monitor 0x00000000016c8138 (object
> > 0x000000062732a398, a
> >     > kafka.coordinator.GroupMetadata),
> >     >   which is held by "group-metadata-manager-0"
> >     >
> >     > "group-metadata-manager-0":
> >     >   waiting to lock monitor 0x00000000011ddaa8 (object
> > 0x000000063f1b0cc0, a
> >     > java.util.LinkedList),
> >     >   which is held by "kafka-request-handler-3"
> >     >
> >     > "kafka-request-handler-3":
> >     >   waiting to lock monitor 0x00000000016c8138 (object
> > 0x000000062732a398, a
> >     > kafka.coordinator.GroupMetadata),
> >     >   which is held by "group-metadata-manager-0"
> >     >
> >     >
> >     > I also noticed the background heartbeat thread (I'm guessing the
> one
> >     > called "executor-Heartbeat" above) is new for this release, under
> >     > KAFKA-3888 ticket - https://urldefense.proofpoint.
> > com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
> > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
> > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
> > jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
> > vNriVw0wrEegk&e=
> >     >
> >     > We haven't noticed this problem with earlier Kafka broker versions,
> > so I'm
> >     > guessing maybe this new background heartbeat thread is what
> > introduced the
> >     > deadlock problem.
> >     >
> >     > That same broker is still in the deadlock scenario, we haven't
> > restarted
> >     > it, so let me know if you'd like more info/log/stats from the
> system
> > before
> >     > we restart it.
> >     >
> >     > Thanks,
> >     >
> >     > Marcos Juarez
> >     >
> >
> >
> >
>

Reply via email to