We ran into this issue several more times over the weekend.  Basically, FDs
are exhausted so fast now, we can't even get to the server in time, the JVM
goes down in less than 5 minutes.

I can send the whole thread dumps if needed, but for brevity's sake, I just
copied over the relevant deadlock segment, and concatenated them all
together in the attached text file.

Do you think this is something I should add to KAFKA-3994 ticket?  Or is
the information in that ticket enough for now?

Thanks,

Marcos

On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez <mjua...@gmail.com> wrote:

> That's great, thanks Jason.
>
> We'll try and apply the patch in the meantime, and wait for the official
> release for 0.10.1.1.
>
> Please let us know if you need more details about the deadlocks on our
> side.
>
> Thanks again!
>
> Marcos
>
> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <ja...@confluent.io>
> wrote:
>
>> Hi Marcos,
>>
>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
>> we're now seeing users hit this in practice, I'll definitely bump up the
>> priority on a fix. I can't say for sure when the release will be, but
>> we'll
>> merge the fix into the 0.10.1 branch and you can build from there if you
>> need something in a hurry.
>>
>> Thanks,
>> Jason
>>
>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com> wrote:
>>
>> > Jason,
>> >
>> > Thanks for that link.  It does appear to be a very similar issue, if not
>> > identical.  In our case, the deadlock is reported as across 3 threads,
>> one
>> > of them being a group_metadata_manager thread. Otherwise, it looks the
>> > same.
>> >
>> > On your questions:
>> >
>> > - We did not see this in prior releases, but we are ramping up usage of
>> our
>> > kafka clusters lately, so maybe we didn't have the needed volume before
>> to
>> > trigger it.
>> >
>> > - Across our multiple staging and production clusters, we're seeing the
>> > problem roughly once or twice a day.
>> >
>> > - Our clusters are small at the moment.  The two that are experiencing
>> the
>> > issue are 5 and 8 brokers, respectively.  The number of consumers is
>> small,
>> > I'd say less than 20 at the moment.  The amount of data being produced
>> is
>> > small also, in the tens of megabytes per second range, but the number of
>> > connects/disconnects is really high, because they're usually short-lived
>> > processes.  Our guess at the moment is that this is triggering the
>> bug.  We
>> > have a separate cluster where we don't have short-lived producers, and
>> that
>> > one has been rock solid.
>> >
>> >
>> > We'll look into applying the patch, which could help reduce the problem.
>> > The ticket says it's being targeted for the 0.10.2 release.  Any rough
>> > estimate of a timeline for that to come out?
>> >
>> > Thanks!
>> >
>> > Marcos
>> >
>> >
>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <ja...@confluent.io>
>> > wrote:
>> >
>> > > Hey Marcos,
>> > >
>> > > Thanks for the report. Can you check out
>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
>> matches?
>> > At
>> > > a glance, it looks like the same problem. We tried pretty hard to get
>> the
>> > > fix into the release, but it didn't quite make it. A few questions:
>> > >
>> > > 1. Did you not see this in prior releases? As far as we can tell, it
>> is
>> > > possible going back to 0.9.0.0, but there could be a subtle
>> difference in
>> > > 0.10.1.0 which increases its likelihood.
>> > > 2. How often are you hitting this problem? You might try the patch
>> that's
>> > > available if it's occurring frequently and you can't downgrade. I
>> think
>> > the
>> > > existing patch is incomplete, but it should still reduce the
>> likelihood.
>> > > 3. Out of curiosity, what is the size of your cluster and how many
>> > > consumers do you have in your cluster?
>> > >
>> > > Thanks!
>> > > Jason
>> > >
>> > > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <mjua...@gmail.com>
>> wrote:
>> > >
>> > > > Just to expand on Lawrence's answer:  The increase in file
>> descriptor
>> > > usage
>> > > > goes from 2-3K under normal conditions, to 64K+ under deadlock,
>> which
>> > it
>> > > > hits within a couple of hours, at which point the broker goes down,
>> > > because
>> > > > that's our OS-defined limit.
>> > > >
>> > > > If it was only a 33% increase from the new timestamp indexes, we
>> should
>> > > be
>> > > > going to max 4K-5K file descriptors in use, not 64K+.
>> > > >
>> > > > Marcos
>> > > >
>> > > >
>> > > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
>> lwei...@pandora.com>
>> > > > wrote:
>> > > >
>> > > > > We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
>> > > > > We’re now running on 0.10.1.0, and the FD increase is due to a
>> > > deadlock,
>> > > > > not functionality or new features.
>> > > > >
>> > > > > Lawrence Weikum | Software Engineer | Pandora
>> > > > > 1426 Pearl Street, Suite 100, Boulder CO 80302
>> > > > > m 720.203.1578 | lwei...@pandora.com
>> > > > >
>> > > > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> wrote:
>> > > > >
>> > > > >     The 0.10.1 broker will use more file descriptor than previous
>> > > > releases
>> > > > >     because of the new timestamp indexes. You should expect and
>> plan
>> > > for
>> > > > > ~33%
>> > > > >     more file descriptors to be open.
>> > > > >
>> > > > >     -hans
>> > > > >
>> > > > >     /**
>> > > > >      * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
>> > > > >      * h...@confluent.io (650)924-2670
>> > > > >      */
>> > > > >
>> > > > >     On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
>> > mjua...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >     > We're running into a recurrent deadlock issue in both our
>> > > > production
>> > > > > and
>> > > > >     > staging clusters, both using the latest 0.10.1 release.  The
>> > > > symptom
>> > > > > we
>> > > > >     > noticed was that, in servers in which kafka producer
>> > connections
>> > > > are
>> > > > > short
>> > > > >     > lived, every other day or so, we'd see file descriptors
>> being
>> > > > > exhausted,
>> > > > >     > until the broker is restarted, or the broker runs out of
>> file
>> > > > > descriptors,
>> > > > >     > and it goes down.  None of the clients are on 0.10.1 kafka
>> > jars,
>> > > > > they're
>> > > > >     > all using previous versions.
>> > > > >     >
>> > > > >     > When diagnosing the issue, we found that when the system is
>> in
>> > > that
>> > > > > state,
>> > > > >     > using up file descriptors at a really fast rate, the JVM is
>> > > > actually
>> > > > > in a
>> > > > >     > deadlock.  Did a thread dump from both jstack and visualvm,
>> and
>> > > > > attached
>> > > > >     > those to this email.
>> > > > >     >
>> > > > >     > This is the interesting bit from the jstack thread dump:
>> > > > >     >
>> > > > >     >
>> > > > >     > Found one Java-level deadlock:
>> > > > >     > =============================
>> > > > >     > "executor-Heartbeat":
>> > > > >     >   waiting to lock monitor 0x00000000016c8138 (object
>> > > > > 0x000000062732a398, a
>> > > > >     > kafka.coordinator.GroupMetadata),
>> > > > >     >   which is held by "group-metadata-manager-0"
>> > > > >     >
>> > > > >     > "group-metadata-manager-0":
>> > > > >     >   waiting to lock monitor 0x00000000011ddaa8 (object
>> > > > > 0x000000063f1b0cc0, a
>> > > > >     > java.util.LinkedList),
>> > > > >     >   which is held by "kafka-request-handler-3"
>> > > > >     >
>> > > > >     > "kafka-request-handler-3":
>> > > > >     >   waiting to lock monitor 0x00000000016c8138 (object
>> > > > > 0x000000062732a398, a
>> > > > >     > kafka.coordinator.GroupMetadata),
>> > > > >     >   which is held by "group-metadata-manager-0"
>> > > > >     >
>> > > > >     >
>> > > > >     > I also noticed the background heartbeat thread (I'm guessing
>> > the
>> > > > one
>> > > > >     > called "executor-Heartbeat" above) is new for this release,
>> > under
>> > > > >     > KAFKA-3888 ticket - https://urldefense.proofpoint.
>> > > > > com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
>> > > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
>> > > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
>> > > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
>> > > > > vNriVw0wrEegk&e=
>> > > > >     >
>> > > > >     > We haven't noticed this problem with earlier Kafka broker
>> > > versions,
>> > > > > so I'm
>> > > > >     > guessing maybe this new background heartbeat thread is what
>> > > > > introduced the
>> > > > >     > deadlock problem.
>> > > > >     >
>> > > > >     > That same broker is still in the deadlock scenario, we
>> haven't
>> > > > > restarted
>> > > > >     > it, so let me know if you'd like more info/log/stats from
>> the
>> > > > system
>> > > > > before
>> > > > >     > we restart it.
>> > > > >     >
>> > > > >     > Thanks,
>> > > > >     >
>> > > > >     > Marcos Juarez
>> > > > >     >
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to