Re: Deadlock using latest 0.10.1 Kafka release

Marcos Juarez Fri, 11 Nov 2016 10:56:56 -0800

Thanks Becket,

We should get a full thread dump the next time, so I'll send it as soon
that happens.


Marcos

On Fri, Nov 11, 2016 at 11:27 AM, Becket Qin <becket....@gmail.com> wrote:

> Hi Marcos,
>
> Thanks for the update. It looks the deadlock you saw was another one. Do
> you mind sending us a full stack trace after this happens?
>
> Regarding the downgrade, the steps would be the following:
> 1. change the inter.broker.protocol to 0.10.0
> 2. rolling bounce the cluster
> 3. deploy the 0.10.0.1 code
>
> There might be a bunch of .timeindex file left over but that should be
> fine.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Fri, Nov 11, 2016 at 9:51 AM, Marcos Juarez <mjua...@gmail.com> wrote:
>
> > Becket/Jason,
> >
> > We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994
> patch,
> > but we're seeing the same exact issue.  It doesnt' seem like the patch
> > fixes the problem we're seeing.
> >
> > At this point, we're considering downgrading our prod clusters back to
> > 0.10.0.1.  Is there any concern/issues we should be aware of when
> > downgrading the cluster like that?
> >
> > Thanks,
> >
> > Marcos Juarez
> >
> >
> > On Mon, Nov 7, 2016 at 5:47 PM, Marcos Juarez <mjua...@gmail.com> wrote:
> >
> > > Thanks Becket.
> > >
> > > I was working on that today.  I have a working jar, created from the
> > > 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it.
> I've
> > > left it running in one test broker today, will try tomorrow to trigger
> > the
> > > issue, and try it with both the patched and un-patched versions.
> > >
> > > I'll let you know what we find.
> > >
> > > Thanks,
> > >
> > > Marcos
> > >
> > > On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin <becket....@gmail.com>
> > wrote:
> > >
> > >> Hi Marcos,
> > >>
> > >> Is it possible for you to apply the patch of KAFKA-3994 and see if the
> > >> issue is still there. The current patch of KAFKA-3994 should work, the
> > >> only
> > >> reason we haven't checked that in was because when we ran stress test
> it
> > >> shows noticeable performance impact when producers are producing with
> > >> acks=all. So if you are blocking on this issue maybe you can pick up
> the
> > >> patch as a short term solution. Meanwhile we will prioritize the
> ticket.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez <mjua...@gmail.com>
> > wrote:
> > >>
> > >> > We ran into this issue several more times over the weekend.
> > Basically,
> > >> > FDs are exhausted so fast now, we can't even get to the server in
> > time,
> > >> the
> > >> > JVM goes down in less than 5 minutes.
> > >> >
> > >> > I can send the whole thread dumps if needed, but for brevity's
> sake, I
> > >> > just copied over the relevant deadlock segment, and concatenated
> them
> > >> all
> > >> > together in the attached text file.
> > >> >
> > >> > Do you think this is something I should add to KAFKA-3994 ticket?
> Or
> > is
> > >> > the information in that ticket enough for now?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Marcos
> > >> >
> > >> > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez <mjua...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> That's great, thanks Jason.
> > >> >>
> > >> >> We'll try and apply the patch in the meantime, and wait for the
> > >> official
> > >> >> release for 0.10.1.1.
> > >> >>
> > >> >> Please let us know if you need more details about the deadlocks on
> > our
> > >> >> side.
> > >> >>
> > >> >> Thanks again!
> > >> >>
> > >> >> Marcos
> > >> >>
> > >> >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <
> ja...@confluent.io>
> > >> >> wrote:
> > >> >>
> > >> >>> Hi Marcos,
> > >> >>>
> > >> >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
> > >> Since
> > >> >>> we're now seeing users hit this in practice, I'll definitely bump
> up
> > >> the
> > >> >>> priority on a fix. I can't say for sure when the release will be,
> > but
> > >> >>> we'll
> > >> >>> merge the fix into the 0.10.1 branch and you can build from there
> if
> > >> you
> > >> >>> need something in a hurry.
> > >> >>>
> > >> >>> Thanks,
> > >> >>> Jason
> > >> >>>
> > >> >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com>
> > >> wrote:
> > >> >>>
> > >> >>> > Jason,
> > >> >>> >
> > >> >>> > Thanks for that link.  It does appear to be a very similar
> issue,
> > if
> > >> >>> not
> > >> >>> > identical.  In our case, the deadlock is reported as across 3
> > >> threads,
> > >> >>> one
> > >> >>> > of them being a group_metadata_manager thread. Otherwise, it
> looks
> > >> the
> > >> >>> > same.
> > >> >>> >
> > >> >>> > On your questions:
> > >> >>> >
> > >> >>> > - We did not see this in prior releases, but we are ramping up
> > usage
> > >> >>> of our
> > >> >>> > kafka clusters lately, so maybe we didn't have the needed volume
> > >> >>> before to
> > >> >>> > trigger it.
> > >> >>> >
> > >> >>> > - Across our multiple staging and production clusters, we're
> > seeing
> > >> the
> > >> >>> > problem roughly once or twice a day.
> > >> >>> >
> > >> >>> > - Our clusters are small at the moment.  The two that are
> > >> experiencing
> > >> >>> the
> > >> >>> > issue are 5 and 8 brokers, respectively.  The number of
> consumers
> > is
> > >> >>> small,
> > >> >>> > I'd say less than 20 at the moment.  The amount of data being
> > >> produced
> > >> >>> is
> > >> >>> > small also, in the tens of megabytes per second range, but the
> > >> number
> > >> >>> of
> > >> >>> > connects/disconnects is really high, because they're usually
> > >> >>> short-lived
> > >> >>> > processes.  Our guess at the moment is that this is triggering
> the
> > >> >>> bug.  We
> > >> >>> > have a separate cluster where we don't have short-lived
> producers,
> > >> and
> > >> >>> that
> > >> >>> > one has been rock solid.
> > >> >>> >
> > >> >>> >
> > >> >>> > We'll look into applying the patch, which could help reduce the
> > >> >>> problem.
> > >> >>> > The ticket says it's being targeted for the 0.10.2 release.  Any
> > >> rough
> > >> >>> > estimate of a timeline for that to come out?
> > >> >>> >
> > >> >>> > Thanks!
> > >> >>> >
> > >> >>> > Marcos
> > >> >>> >
> > >> >>> >
> > >> >>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <
> > ja...@confluent.io
> > >> >
> > >> >>> > wrote:
> > >> >>> >
> > >> >>> > > Hey Marcos,
> > >> >>> > >
> > >> >>> > > Thanks for the report. Can you check out
> > >> >>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if
> it
> > >> >>> matches?
> > >> >>> > At
> > >> >>> > > a glance, it looks like the same problem. We tried pretty hard
> > to
> > >> >>> get the
> > >> >>> > > fix into the release, but it didn't quite make it. A few
> > >> questions:
> > >> >>> > >
> > >> >>> > > 1. Did you not see this in prior releases? As far as we can
> > tell,
> > >> it
> > >> >>> is
> > >> >>> > > possible going back to 0.9.0.0, but there could be a subtle
> > >> >>> difference in
> > >> >>> > > 0.10.1.0 which increases its likelihood.
> > >> >>> > > 2. How often are you hitting this problem? You might try the
> > patch
> > >> >>> that's
> > >> >>> > > available if it's occurring frequently and you can't
> downgrade.
> > I
> > >> >>> think
> > >> >>> > the
> > >> >>> > > existing patch is incomplete, but it should still reduce the
> > >> >>> likelihood.
> > >> >>> > > 3. Out of curiosity, what is the size of your cluster and how
> > many
> > >> >>> > > consumers do you have in your cluster?
> > >> >>> > >
> > >> >>> > > Thanks!
> > >> >>> > > Jason
> > >> >>> > >
> > >> >>> > > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <
> > mjua...@gmail.com>
> > >> >>> wrote:
> > >> >>> > >
> > >> >>> > > > Just to expand on Lawrence's answer:  The increase in file
> > >> >>> descriptor
> > >> >>> > > usage
> > >> >>> > > > goes from 2-3K under normal conditions, to 64K+ under
> > deadlock,
> > >> >>> which
> > >> >>> > it
> > >> >>> > > > hits within a couple of hours, at which point the broker
> goes
> > >> down,
> > >> >>> > > because
> > >> >>> > > > that's our OS-defined limit.
> > >> >>> > > >
> > >> >>> > > > If it was only a 33% increase from the new timestamp
> indexes,
> > we
> > >> >>> should
> > >> >>> > > be
> > >> >>> > > > going to max 4K-5K file descriptors in use, not 64K+.
> > >> >>> > > >
> > >> >>> > > > Marcos
> > >> >>> > > >
> > >> >>> > > >
> > >> >>> > > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
> > >> >>> lwei...@pandora.com>
> > >> >>> > > > wrote:
> > >> >>> > > >
> > >> >>> > > > > We saw this increase when upgrading from 0.9.0.1 to
> > 0.10.0.1.
> > >> >>> > > > > We’re now running on 0.10.1.0, and the FD increase is due
> > to a
> > >> >>> > > deadlock,
> > >> >>> > > > > not functionality or new features.
> > >> >>> > > > >
> > >> >>> > > > > Lawrence Weikum | Software Engineer | Pandora
> > >> >>> > > > > 1426 Pearl Street, Suite 100, Boulder CO 80302
> > >> >>> > > > > m 720.203.1578 | lwei...@pandora.com
> > >> >>> > > > >
> > >> >>> > > > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io
> >
> > >> >>> wrote:
> > >> >>> > > > >
> > >> >>> > > > >     The 0.10.1 broker will use more file descriptor than
> > >> previous
> > >> >>> > > > releases
> > >> >>> > > > >     because of the new timestamp indexes. You should
> expect
> > >> and
> > >> >>> plan
> > >> >>> > > for
> > >> >>> > > > > ~33%
> > >> >>> > > > >     more file descriptors to be open.
> > >> >>> > > > >
> > >> >>> > > > >     -hans
> > >> >>> > > > >
> > >> >>> > > > >     /**
> > >> >>> > > > >      * Hans Jespersen, Principal Systems Engineer,
> Confluent
> > >> Inc.
> > >> >>> > > > >      * h...@confluent.io (650)924-2670
> > >> >>> > > > >      */
> > >> >>> > > > >
> > >> >>> > > > >     On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
> > >> >>> > mjua...@gmail.com>
> > >> >>> > > > > wrote:
> > >> >>> > > > >
> > >> >>> > > > >     > We're running into a recurrent deadlock issue in
> both
> > >> our
> > >> >>> > > > production
> > >> >>> > > > > and
> > >> >>> > > > >     > staging clusters, both using the latest 0.10.1
> > release.
> > >> >>> The
> > >> >>> > > > symptom
> > >> >>> > > > > we
> > >> >>> > > > >     > noticed was that, in servers in which kafka producer
> > >> >>> > connections
> > >> >>> > > > are
> > >> >>> > > > > short
> > >> >>> > > > >     > lived, every other day or so, we'd see file
> > descriptors
> > >> >>> being
> > >> >>> > > > > exhausted,
> > >> >>> > > > >     > until the broker is restarted, or the broker runs
> out
> > of
> > >> >>> file
> > >> >>> > > > > descriptors,
> > >> >>> > > > >     > and it goes down.  None of the clients are on 0.10.1
> > >> kafka
> > >> >>> > jars,
> > >> >>> > > > > they're
> > >> >>> > > > >     > all using previous versions.
> > >> >>> > > > >     >
> > >> >>> > > > >     > When diagnosing the issue, we found that when the
> > system
> > >> >>> is in
> > >> >>> > > that
> > >> >>> > > > > state,
> > >> >>> > > > >     > using up file descriptors at a really fast rate, the
> > >> JVM is
> > >> >>> > > > actually
> > >> >>> > > > > in a
> > >> >>> > > > >     > deadlock.  Did a thread dump from both jstack and
> > >> >>> visualvm, and
> > >> >>> > > > > attached
> > >> >>> > > > >     > those to this email.
> > >> >>> > > > >     >
> > >> >>> > > > >     > This is the interesting bit from the jstack thread
> > dump:
> > >> >>> > > > >     >
> > >> >>> > > > >     >
> > >> >>> > > > >     > Found one Java-level deadlock:
> > >> >>> > > > >     > =============================
> > >> >>> > > > >     > "executor-Heartbeat":
> > >> >>> > > > >     >   waiting to lock monitor 0x00000000016c8138 (object
> > >> >>> > > > > 0x000000062732a398, a
> > >> >>> > > > >     > kafka.coordinator.GroupMetadata),
> > >> >>> > > > >     >   which is held by "group-metadata-manager-0"
> > >> >>> > > > >     >
> > >> >>> > > > >     > "group-metadata-manager-0":
> > >> >>> > > > >     >   waiting to lock monitor 0x00000000011ddaa8 (object
> > >> >>> > > > > 0x000000063f1b0cc0, a
> > >> >>> > > > >     > java.util.LinkedList),
> > >> >>> > > > >     >   which is held by "kafka-request-handler-3"
> > >> >>> > > > >     >
> > >> >>> > > > >     > "kafka-request-handler-3":
> > >> >>> > > > >     >   waiting to lock monitor 0x00000000016c8138 (object
> > >> >>> > > > > 0x000000062732a398, a
> > >> >>> > > > >     > kafka.coordinator.GroupMetadata),
> > >> >>> > > > >     >   which is held by "group-metadata-manager-0"
> > >> >>> > > > >     >
> > >> >>> > > > >     >
> > >> >>> > > > >     > I also noticed the background heartbeat thread (I'm
> > >> >>> guessing
> > >> >>> > the
> > >> >>> > > > one
> > >> >>> > > > >     > called "executor-Heartbeat" above) is new for this
> > >> release,
> > >> >>> > under
> > >> >>> > > > >     > KAFKA-3888 ticket - https://urldefense.proofpoint.
> > >> >>> > > > > com/v2/url?u=https-3A__issues.
> apache.org_jira_browse_KAFKA-
> > >> >>> > > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-
> > >> JRepxyw&r=
> > >> >>> > > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=
> zJ2wVkapVi8N-
> > >> >>> > > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0
> > &s=nEcLEnYWPyaDuPDI5vSSKPWoljo
> > >> XYb
> > >> >>> > > > > vNriVw0wrEegk&e=
> > >> >>> > > > >     >
> > >> >>> > > > >     > We haven't noticed this problem with earlier Kafka
> > >> broker
> > >> >>> > > versions,
> > >> >>> > > > > so I'm
> > >> >>> > > > >     > guessing maybe this new background heartbeat thread
> is
> > >> what
> > >> >>> > > > > introduced the
> > >> >>> > > > >     > deadlock problem.
> > >> >>> > > > >     >
> > >> >>> > > > >     > That same broker is still in the deadlock scenario,
> we
> > >> >>> haven't
> > >> >>> > > > > restarted
> > >> >>> > > > >     > it, so let me know if you'd like more info/log/stats
> > >> from
> > >> >>> the
> > >> >>> > > > system
> > >> >>> > > > > before
> > >> >>> > > > >     > we restart it.
> > >> >>> > > > >     >
> > >> >>> > > > >     > Thanks,
> > >> >>> > > > >     >
> > >> >>> > > > >     > Marcos Juarez
> > >> >>> > > > >     >
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >
> > >
> >
>

Re: Deadlock using latest 0.10.1 Kafka release

Reply via email to