Thanks Becket, We should get a full thread dump the next time, so I'll send it as soon that happens.
Marcos On Fri, Nov 11, 2016 at 11:27 AM, Becket Qin <becket....@gmail.com> wrote: > Hi Marcos, > > Thanks for the update. It looks the deadlock you saw was another one. Do > you mind sending us a full stack trace after this happens? > > Regarding the downgrade, the steps would be the following: > 1. change the inter.broker.protocol to 0.10.0 > 2. rolling bounce the cluster > 3. deploy the 0.10.0.1 code > > There might be a bunch of .timeindex file left over but that should be > fine. > > Thanks, > > Jiangjie (Becket) Qin > > > On Fri, Nov 11, 2016 at 9:51 AM, Marcos Juarez <mjua...@gmail.com> wrote: > > > Becket/Jason, > > > > We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 > patch, > > but we're seeing the same exact issue. It doesnt' seem like the patch > > fixes the problem we're seeing. > > > > At this point, we're considering downgrading our prod clusters back to > > 0.10.0.1. Is there any concern/issues we should be aware of when > > downgrading the cluster like that? > > > > Thanks, > > > > Marcos Juarez > > > > > > On Mon, Nov 7, 2016 at 5:47 PM, Marcos Juarez <mjua...@gmail.com> wrote: > > > > > Thanks Becket. > > > > > > I was working on that today. I have a working jar, created from the > > > 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. > I've > > > left it running in one test broker today, will try tomorrow to trigger > > the > > > issue, and try it with both the patched and un-patched versions. > > > > > > I'll let you know what we find. > > > > > > Thanks, > > > > > > Marcos > > > > > > On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin <becket....@gmail.com> > > wrote: > > > > > >> Hi Marcos, > > >> > > >> Is it possible for you to apply the patch of KAFKA-3994 and see if the > > >> issue is still there. The current patch of KAFKA-3994 should work, the > > >> only > > >> reason we haven't checked that in was because when we ran stress test > it > > >> shows noticeable performance impact when producers are producing with > > >> acks=all. So if you are blocking on this issue maybe you can pick up > the > > >> patch as a short term solution. Meanwhile we will prioritize the > ticket. > > >> > > >> Thanks, > > >> > > >> Jiangjie (Becket) Qin > > >> > > >> On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez <mjua...@gmail.com> > > wrote: > > >> > > >> > We ran into this issue several more times over the weekend. > > Basically, > > >> > FDs are exhausted so fast now, we can't even get to the server in > > time, > > >> the > > >> > JVM goes down in less than 5 minutes. > > >> > > > >> > I can send the whole thread dumps if needed, but for brevity's > sake, I > > >> > just copied over the relevant deadlock segment, and concatenated > them > > >> all > > >> > together in the attached text file. > > >> > > > >> > Do you think this is something I should add to KAFKA-3994 ticket? > Or > > is > > >> > the information in that ticket enough for now? > > >> > > > >> > Thanks, > > >> > > > >> > Marcos > > >> > > > >> > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez <mjua...@gmail.com> > > >> wrote: > > >> > > > >> >> That's great, thanks Jason. > > >> >> > > >> >> We'll try and apply the patch in the meantime, and wait for the > > >> official > > >> >> release for 0.10.1.1. > > >> >> > > >> >> Please let us know if you need more details about the deadlocks on > > our > > >> >> side. > > >> >> > > >> >> Thanks again! > > >> >> > > >> >> Marcos > > >> >> > > >> >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson < > ja...@confluent.io> > > >> >> wrote: > > >> >> > > >> >>> Hi Marcos, > > >> >>> > > >> >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA). > > >> Since > > >> >>> we're now seeing users hit this in practice, I'll definitely bump > up > > >> the > > >> >>> priority on a fix. I can't say for sure when the release will be, > > but > > >> >>> we'll > > >> >>> merge the fix into the 0.10.1 branch and you can build from there > if > > >> you > > >> >>> need something in a hurry. > > >> >>> > > >> >>> Thanks, > > >> >>> Jason > > >> >>> > > >> >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com> > > >> wrote: > > >> >>> > > >> >>> > Jason, > > >> >>> > > > >> >>> > Thanks for that link. It does appear to be a very similar > issue, > > if > > >> >>> not > > >> >>> > identical. In our case, the deadlock is reported as across 3 > > >> threads, > > >> >>> one > > >> >>> > of them being a group_metadata_manager thread. Otherwise, it > looks > > >> the > > >> >>> > same. > > >> >>> > > > >> >>> > On your questions: > > >> >>> > > > >> >>> > - We did not see this in prior releases, but we are ramping up > > usage > > >> >>> of our > > >> >>> > kafka clusters lately, so maybe we didn't have the needed volume > > >> >>> before to > > >> >>> > trigger it. > > >> >>> > > > >> >>> > - Across our multiple staging and production clusters, we're > > seeing > > >> the > > >> >>> > problem roughly once or twice a day. > > >> >>> > > > >> >>> > - Our clusters are small at the moment. The two that are > > >> experiencing > > >> >>> the > > >> >>> > issue are 5 and 8 brokers, respectively. The number of > consumers > > is > > >> >>> small, > > >> >>> > I'd say less than 20 at the moment. The amount of data being > > >> produced > > >> >>> is > > >> >>> > small also, in the tens of megabytes per second range, but the > > >> number > > >> >>> of > > >> >>> > connects/disconnects is really high, because they're usually > > >> >>> short-lived > > >> >>> > processes. Our guess at the moment is that this is triggering > the > > >> >>> bug. We > > >> >>> > have a separate cluster where we don't have short-lived > producers, > > >> and > > >> >>> that > > >> >>> > one has been rock solid. > > >> >>> > > > >> >>> > > > >> >>> > We'll look into applying the patch, which could help reduce the > > >> >>> problem. > > >> >>> > The ticket says it's being targeted for the 0.10.2 release. Any > > >> rough > > >> >>> > estimate of a timeline for that to come out? > > >> >>> > > > >> >>> > Thanks! > > >> >>> > > > >> >>> > Marcos > > >> >>> > > > >> >>> > > > >> >>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson < > > ja...@confluent.io > > >> > > > >> >>> > wrote: > > >> >>> > > > >> >>> > > Hey Marcos, > > >> >>> > > > > >> >>> > > Thanks for the report. Can you check out > > >> >>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if > it > > >> >>> matches? > > >> >>> > At > > >> >>> > > a glance, it looks like the same problem. We tried pretty hard > > to > > >> >>> get the > > >> >>> > > fix into the release, but it didn't quite make it. A few > > >> questions: > > >> >>> > > > > >> >>> > > 1. Did you not see this in prior releases? As far as we can > > tell, > > >> it > > >> >>> is > > >> >>> > > possible going back to 0.9.0.0, but there could be a subtle > > >> >>> difference in > > >> >>> > > 0.10.1.0 which increases its likelihood. > > >> >>> > > 2. How often are you hitting this problem? You might try the > > patch > > >> >>> that's > > >> >>> > > available if it's occurring frequently and you can't > downgrade. > > I > > >> >>> think > > >> >>> > the > > >> >>> > > existing patch is incomplete, but it should still reduce the > > >> >>> likelihood. > > >> >>> > > 3. Out of curiosity, what is the size of your cluster and how > > many > > >> >>> > > consumers do you have in your cluster? > > >> >>> > > > > >> >>> > > Thanks! > > >> >>> > > Jason > > >> >>> > > > > >> >>> > > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez < > > mjua...@gmail.com> > > >> >>> wrote: > > >> >>> > > > > >> >>> > > > Just to expand on Lawrence's answer: The increase in file > > >> >>> descriptor > > >> >>> > > usage > > >> >>> > > > goes from 2-3K under normal conditions, to 64K+ under > > deadlock, > > >> >>> which > > >> >>> > it > > >> >>> > > > hits within a couple of hours, at which point the broker > goes > > >> down, > > >> >>> > > because > > >> >>> > > > that's our OS-defined limit. > > >> >>> > > > > > >> >>> > > > If it was only a 33% increase from the new timestamp > indexes, > > we > > >> >>> should > > >> >>> > > be > > >> >>> > > > going to max 4K-5K file descriptors in use, not 64K+. > > >> >>> > > > > > >> >>> > > > Marcos > > >> >>> > > > > > >> >>> > > > > > >> >>> > > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum < > > >> >>> lwei...@pandora.com> > > >> >>> > > > wrote: > > >> >>> > > > > > >> >>> > > > > We saw this increase when upgrading from 0.9.0.1 to > > 0.10.0.1. > > >> >>> > > > > We’re now running on 0.10.1.0, and the FD increase is due > > to a > > >> >>> > > deadlock, > > >> >>> > > > > not functionality or new features. > > >> >>> > > > > > > >> >>> > > > > Lawrence Weikum | Software Engineer | Pandora > > >> >>> > > > > 1426 Pearl Street, Suite 100, Boulder CO 80302 > > >> >>> > > > > m 720.203.1578 | lwei...@pandora.com > > >> >>> > > > > > > >> >>> > > > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io > > > > >> >>> wrote: > > >> >>> > > > > > > >> >>> > > > > The 0.10.1 broker will use more file descriptor than > > >> previous > > >> >>> > > > releases > > >> >>> > > > > because of the new timestamp indexes. You should > expect > > >> and > > >> >>> plan > > >> >>> > > for > > >> >>> > > > > ~33% > > >> >>> > > > > more file descriptors to be open. > > >> >>> > > > > > > >> >>> > > > > -hans > > >> >>> > > > > > > >> >>> > > > > /** > > >> >>> > > > > * Hans Jespersen, Principal Systems Engineer, > Confluent > > >> Inc. > > >> >>> > > > > * h...@confluent.io (650)924-2670 > > >> >>> > > > > */ > > >> >>> > > > > > > >> >>> > > > > On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez < > > >> >>> > mjua...@gmail.com> > > >> >>> > > > > wrote: > > >> >>> > > > > > > >> >>> > > > > > We're running into a recurrent deadlock issue in > both > > >> our > > >> >>> > > > production > > >> >>> > > > > and > > >> >>> > > > > > staging clusters, both using the latest 0.10.1 > > release. > > >> >>> The > > >> >>> > > > symptom > > >> >>> > > > > we > > >> >>> > > > > > noticed was that, in servers in which kafka producer > > >> >>> > connections > > >> >>> > > > are > > >> >>> > > > > short > > >> >>> > > > > > lived, every other day or so, we'd see file > > descriptors > > >> >>> being > > >> >>> > > > > exhausted, > > >> >>> > > > > > until the broker is restarted, or the broker runs > out > > of > > >> >>> file > > >> >>> > > > > descriptors, > > >> >>> > > > > > and it goes down. None of the clients are on 0.10.1 > > >> kafka > > >> >>> > jars, > > >> >>> > > > > they're > > >> >>> > > > > > all using previous versions. > > >> >>> > > > > > > > >> >>> > > > > > When diagnosing the issue, we found that when the > > system > > >> >>> is in > > >> >>> > > that > > >> >>> > > > > state, > > >> >>> > > > > > using up file descriptors at a really fast rate, the > > >> JVM is > > >> >>> > > > actually > > >> >>> > > > > in a > > >> >>> > > > > > deadlock. Did a thread dump from both jstack and > > >> >>> visualvm, and > > >> >>> > > > > attached > > >> >>> > > > > > those to this email. > > >> >>> > > > > > > > >> >>> > > > > > This is the interesting bit from the jstack thread > > dump: > > >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > Found one Java-level deadlock: > > >> >>> > > > > > ============================= > > >> >>> > > > > > "executor-Heartbeat": > > >> >>> > > > > > waiting to lock monitor 0x00000000016c8138 (object > > >> >>> > > > > 0x000000062732a398, a > > >> >>> > > > > > kafka.coordinator.GroupMetadata), > > >> >>> > > > > > which is held by "group-metadata-manager-0" > > >> >>> > > > > > > > >> >>> > > > > > "group-metadata-manager-0": > > >> >>> > > > > > waiting to lock monitor 0x00000000011ddaa8 (object > > >> >>> > > > > 0x000000063f1b0cc0, a > > >> >>> > > > > > java.util.LinkedList), > > >> >>> > > > > > which is held by "kafka-request-handler-3" > > >> >>> > > > > > > > >> >>> > > > > > "kafka-request-handler-3": > > >> >>> > > > > > waiting to lock monitor 0x00000000016c8138 (object > > >> >>> > > > > 0x000000062732a398, a > > >> >>> > > > > > kafka.coordinator.GroupMetadata), > > >> >>> > > > > > which is held by "group-metadata-manager-0" > > >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > I also noticed the background heartbeat thread (I'm > > >> >>> guessing > > >> >>> > the > > >> >>> > > > one > > >> >>> > > > > > called "executor-Heartbeat" above) is new for this > > >> release, > > >> >>> > under > > >> >>> > > > > > KAFKA-3888 ticket - https://urldefense.proofpoint. > > >> >>> > > > > com/v2/url?u=https-3A__issues. > apache.org_jira_browse_KAFKA- > > >> >>> > > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw- > > >> JRepxyw&r= > > >> >>> > > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m= > zJ2wVkapVi8N- > > >> >>> > > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0 > > &s=nEcLEnYWPyaDuPDI5vSSKPWoljo > > >> XYb > > >> >>> > > > > vNriVw0wrEegk&e= > > >> >>> > > > > > > > >> >>> > > > > > We haven't noticed this problem with earlier Kafka > > >> broker > > >> >>> > > versions, > > >> >>> > > > > so I'm > > >> >>> > > > > > guessing maybe this new background heartbeat thread > is > > >> what > > >> >>> > > > > introduced the > > >> >>> > > > > > deadlock problem. > > >> >>> > > > > > > > >> >>> > > > > > That same broker is still in the deadlock scenario, > we > > >> >>> haven't > > >> >>> > > > > restarted > > >> >>> > > > > > it, so let me know if you'd like more info/log/stats > > >> from > > >> >>> the > > >> >>> > > > system > > >> >>> > > > > before > > >> >>> > > > > > we restart it. > > >> >>> > > > > > > > >> >>> > > > > > Thanks, > > >> >>> > > > > > > > >> >>> > > > > > Marcos Juarez > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >> > > >> >> > > >> > > > >> > > > > > > > > >