Hi Marcos,

Is it possible for you to apply the patch of KAFKA-3994 and see if the
issue is still there. The current patch of KAFKA-3994 should work, the only
reason we haven't checked that in was because when we ran stress test it
shows noticeable performance impact when producers are producing with
acks=all. So if you are blocking on this issue maybe you can pick up the
patch as a short term solution. Meanwhile we will prioritize the ticket.

Thanks,

Jiangjie (Becket) Qin

On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez <mjua...@gmail.com> wrote:

> We ran into this issue several more times over the weekend.  Basically,
> FDs are exhausted so fast now, we can't even get to the server in time, the
> JVM goes down in less than 5 minutes.
>
> I can send the whole thread dumps if needed, but for brevity's sake, I
> just copied over the relevant deadlock segment, and concatenated them all
> together in the attached text file.
>
> Do you think this is something I should add to KAFKA-3994 ticket?  Or is
> the information in that ticket enough for now?
>
> Thanks,
>
> Marcos
>
> On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez <mjua...@gmail.com> wrote:
>
>> That's great, thanks Jason.
>>
>> We'll try and apply the patch in the meantime, and wait for the official
>> release for 0.10.1.1.
>>
>> Please let us know if you need more details about the deadlocks on our
>> side.
>>
>> Thanks again!
>>
>> Marcos
>>
>> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <ja...@confluent.io>
>> wrote:
>>
>>> Hi Marcos,
>>>
>>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
>>> we're now seeing users hit this in practice, I'll definitely bump up the
>>> priority on a fix. I can't say for sure when the release will be, but
>>> we'll
>>> merge the fix into the 0.10.1 branch and you can build from there if you
>>> need something in a hurry.
>>>
>>> Thanks,
>>> Jason
>>>
>>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com> wrote:
>>>
>>> > Jason,
>>> >
>>> > Thanks for that link.  It does appear to be a very similar issue, if
>>> not
>>> > identical.  In our case, the deadlock is reported as across 3 threads,
>>> one
>>> > of them being a group_metadata_manager thread. Otherwise, it looks the
>>> > same.
>>> >
>>> > On your questions:
>>> >
>>> > - We did not see this in prior releases, but we are ramping up usage
>>> of our
>>> > kafka clusters lately, so maybe we didn't have the needed volume
>>> before to
>>> > trigger it.
>>> >
>>> > - Across our multiple staging and production clusters, we're seeing the
>>> > problem roughly once or twice a day.
>>> >
>>> > - Our clusters are small at the moment.  The two that are experiencing
>>> the
>>> > issue are 5 and 8 brokers, respectively.  The number of consumers is
>>> small,
>>> > I'd say less than 20 at the moment.  The amount of data being produced
>>> is
>>> > small also, in the tens of megabytes per second range, but the number
>>> of
>>> > connects/disconnects is really high, because they're usually
>>> short-lived
>>> > processes.  Our guess at the moment is that this is triggering the
>>> bug.  We
>>> > have a separate cluster where we don't have short-lived producers, and
>>> that
>>> > one has been rock solid.
>>> >
>>> >
>>> > We'll look into applying the patch, which could help reduce the
>>> problem.
>>> > The ticket says it's being targeted for the 0.10.2 release.  Any rough
>>> > estimate of a timeline for that to come out?
>>> >
>>> > Thanks!
>>> >
>>> > Marcos
>>> >
>>> >
>>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <ja...@confluent.io>
>>> > wrote:
>>> >
>>> > > Hey Marcos,
>>> > >
>>> > > Thanks for the report. Can you check out
>>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
>>> matches?
>>> > At
>>> > > a glance, it looks like the same problem. We tried pretty hard to
>>> get the
>>> > > fix into the release, but it didn't quite make it. A few questions:
>>> > >
>>> > > 1. Did you not see this in prior releases? As far as we can tell, it
>>> is
>>> > > possible going back to 0.9.0.0, but there could be a subtle
>>> difference in
>>> > > 0.10.1.0 which increases its likelihood.
>>> > > 2. How often are you hitting this problem? You might try the patch
>>> that's
>>> > > available if it's occurring frequently and you can't downgrade. I
>>> think
>>> > the
>>> > > existing patch is incomplete, but it should still reduce the
>>> likelihood.
>>> > > 3. Out of curiosity, what is the size of your cluster and how many
>>> > > consumers do you have in your cluster?
>>> > >
>>> > > Thanks!
>>> > > Jason
>>> > >
>>> > > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <mjua...@gmail.com>
>>> wrote:
>>> > >
>>> > > > Just to expand on Lawrence's answer:  The increase in file
>>> descriptor
>>> > > usage
>>> > > > goes from 2-3K under normal conditions, to 64K+ under deadlock,
>>> which
>>> > it
>>> > > > hits within a couple of hours, at which point the broker goes down,
>>> > > because
>>> > > > that's our OS-defined limit.
>>> > > >
>>> > > > If it was only a 33% increase from the new timestamp indexes, we
>>> should
>>> > > be
>>> > > > going to max 4K-5K file descriptors in use, not 64K+.
>>> > > >
>>> > > > Marcos
>>> > > >
>>> > > >
>>> > > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
>>> lwei...@pandora.com>
>>> > > > wrote:
>>> > > >
>>> > > > > We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
>>> > > > > We’re now running on 0.10.1.0, and the FD increase is due to a
>>> > > deadlock,
>>> > > > > not functionality or new features.
>>> > > > >
>>> > > > > Lawrence Weikum | Software Engineer | Pandora
>>> > > > > 1426 Pearl Street, Suite 100, Boulder CO 80302
>>> > > > > m 720.203.1578 | lwei...@pandora.com
>>> > > > >
>>> > > > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io>
>>> wrote:
>>> > > > >
>>> > > > >     The 0.10.1 broker will use more file descriptor than previous
>>> > > > releases
>>> > > > >     because of the new timestamp indexes. You should expect and
>>> plan
>>> > > for
>>> > > > > ~33%
>>> > > > >     more file descriptors to be open.
>>> > > > >
>>> > > > >     -hans
>>> > > > >
>>> > > > >     /**
>>> > > > >      * Hans Jespersen, Principal Systems Engineer, Confluent Inc.
>>> > > > >      * h...@confluent.io (650)924-2670
>>> > > > >      */
>>> > > > >
>>> > > > >     On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
>>> > mjua...@gmail.com>
>>> > > > > wrote:
>>> > > > >
>>> > > > >     > We're running into a recurrent deadlock issue in both our
>>> > > > production
>>> > > > > and
>>> > > > >     > staging clusters, both using the latest 0.10.1 release.
>>> The
>>> > > > symptom
>>> > > > > we
>>> > > > >     > noticed was that, in servers in which kafka producer
>>> > connections
>>> > > > are
>>> > > > > short
>>> > > > >     > lived, every other day or so, we'd see file descriptors
>>> being
>>> > > > > exhausted,
>>> > > > >     > until the broker is restarted, or the broker runs out of
>>> file
>>> > > > > descriptors,
>>> > > > >     > and it goes down.  None of the clients are on 0.10.1 kafka
>>> > jars,
>>> > > > > they're
>>> > > > >     > all using previous versions.
>>> > > > >     >
>>> > > > >     > When diagnosing the issue, we found that when the system
>>> is in
>>> > > that
>>> > > > > state,
>>> > > > >     > using up file descriptors at a really fast rate, the JVM is
>>> > > > actually
>>> > > > > in a
>>> > > > >     > deadlock.  Did a thread dump from both jstack and
>>> visualvm, and
>>> > > > > attached
>>> > > > >     > those to this email.
>>> > > > >     >
>>> > > > >     > This is the interesting bit from the jstack thread dump:
>>> > > > >     >
>>> > > > >     >
>>> > > > >     > Found one Java-level deadlock:
>>> > > > >     > =============================
>>> > > > >     > "executor-Heartbeat":
>>> > > > >     >   waiting to lock monitor 0x00000000016c8138 (object
>>> > > > > 0x000000062732a398, a
>>> > > > >     > kafka.coordinator.GroupMetadata),
>>> > > > >     >   which is held by "group-metadata-manager-0"
>>> > > > >     >
>>> > > > >     > "group-metadata-manager-0":
>>> > > > >     >   waiting to lock monitor 0x00000000011ddaa8 (object
>>> > > > > 0x000000063f1b0cc0, a
>>> > > > >     > java.util.LinkedList),
>>> > > > >     >   which is held by "kafka-request-handler-3"
>>> > > > >     >
>>> > > > >     > "kafka-request-handler-3":
>>> > > > >     >   waiting to lock monitor 0x00000000016c8138 (object
>>> > > > > 0x000000062732a398, a
>>> > > > >     > kafka.coordinator.GroupMetadata),
>>> > > > >     >   which is held by "group-metadata-manager-0"
>>> > > > >     >
>>> > > > >     >
>>> > > > >     > I also noticed the background heartbeat thread (I'm
>>> guessing
>>> > the
>>> > > > one
>>> > > > >     > called "executor-Heartbeat" above) is new for this release,
>>> > under
>>> > > > >     > KAFKA-3888 ticket - https://urldefense.proofpoint.
>>> > > > > com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
>>> > > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
>>> > > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
>>> > > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
>>> > > > > vNriVw0wrEegk&e=
>>> > > > >     >
>>> > > > >     > We haven't noticed this problem with earlier Kafka broker
>>> > > versions,
>>> > > > > so I'm
>>> > > > >     > guessing maybe this new background heartbeat thread is what
>>> > > > > introduced the
>>> > > > >     > deadlock problem.
>>> > > > >     >
>>> > > > >     > That same broker is still in the deadlock scenario, we
>>> haven't
>>> > > > > restarted
>>> > > > >     > it, so let me know if you'd like more info/log/stats from
>>> the
>>> > > > system
>>> > > > > before
>>> > > > >     > we restart it.
>>> > > > >     >
>>> > > > >     > Thanks,
>>> > > > >     >
>>> > > > >     > Marcos Juarez
>>> > > > >     >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Reply via email to