Hi Marcos, Thanks for the update. It looks the deadlock you saw was another one. Do you mind sending us a full stack trace after this happens?
Regarding the downgrade, the steps would be the following: 1. change the inter.broker.protocol to 0.10.0 2. rolling bounce the cluster 3. deploy the 0.10.0.1 code There might be a bunch of .timeindex file left over but that should be fine. Thanks, Jiangjie (Becket) Qin On Fri, Nov 11, 2016 at 9:51 AM, Marcos Juarez <mjua...@gmail.com> wrote: > Becket/Jason, > > We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 patch, > but we're seeing the same exact issue. It doesnt' seem like the patch > fixes the problem we're seeing. > > At this point, we're considering downgrading our prod clusters back to > 0.10.0.1. Is there any concern/issues we should be aware of when > downgrading the cluster like that? > > Thanks, > > Marcos Juarez > > > On Mon, Nov 7, 2016 at 5:47 PM, Marcos Juarez <mjua...@gmail.com> wrote: > > > Thanks Becket. > > > > I was working on that today. I have a working jar, created from the > > 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. I've > > left it running in one test broker today, will try tomorrow to trigger > the > > issue, and try it with both the patched and un-patched versions. > > > > I'll let you know what we find. > > > > Thanks, > > > > Marcos > > > > On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin <becket....@gmail.com> > wrote: > > > >> Hi Marcos, > >> > >> Is it possible for you to apply the patch of KAFKA-3994 and see if the > >> issue is still there. The current patch of KAFKA-3994 should work, the > >> only > >> reason we haven't checked that in was because when we ran stress test it > >> shows noticeable performance impact when producers are producing with > >> acks=all. So if you are blocking on this issue maybe you can pick up the > >> patch as a short term solution. Meanwhile we will prioritize the ticket. > >> > >> Thanks, > >> > >> Jiangjie (Becket) Qin > >> > >> On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez <mjua...@gmail.com> > wrote: > >> > >> > We ran into this issue several more times over the weekend. > Basically, > >> > FDs are exhausted so fast now, we can't even get to the server in > time, > >> the > >> > JVM goes down in less than 5 minutes. > >> > > >> > I can send the whole thread dumps if needed, but for brevity's sake, I > >> > just copied over the relevant deadlock segment, and concatenated them > >> all > >> > together in the attached text file. > >> > > >> > Do you think this is something I should add to KAFKA-3994 ticket? Or > is > >> > the information in that ticket enough for now? > >> > > >> > Thanks, > >> > > >> > Marcos > >> > > >> > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez <mjua...@gmail.com> > >> wrote: > >> > > >> >> That's great, thanks Jason. > >> >> > >> >> We'll try and apply the patch in the meantime, and wait for the > >> official > >> >> release for 0.10.1.1. > >> >> > >> >> Please let us know if you need more details about the deadlocks on > our > >> >> side. > >> >> > >> >> Thanks again! > >> >> > >> >> Marcos > >> >> > >> >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <ja...@confluent.io> > >> >> wrote: > >> >> > >> >>> Hi Marcos, > >> >>> > >> >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA). > >> Since > >> >>> we're now seeing users hit this in practice, I'll definitely bump up > >> the > >> >>> priority on a fix. I can't say for sure when the release will be, > but > >> >>> we'll > >> >>> merge the fix into the 0.10.1 branch and you can build from there if > >> you > >> >>> need something in a hurry. > >> >>> > >> >>> Thanks, > >> >>> Jason > >> >>> > >> >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com> > >> wrote: > >> >>> > >> >>> > Jason, > >> >>> > > >> >>> > Thanks for that link. It does appear to be a very similar issue, > if > >> >>> not > >> >>> > identical. In our case, the deadlock is reported as across 3 > >> threads, > >> >>> one > >> >>> > of them being a group_metadata_manager thread. Otherwise, it looks > >> the > >> >>> > same. > >> >>> > > >> >>> > On your questions: > >> >>> > > >> >>> > - We did not see this in prior releases, but we are ramping up > usage > >> >>> of our > >> >>> > kafka clusters lately, so maybe we didn't have the needed volume > >> >>> before to > >> >>> > trigger it. > >> >>> > > >> >>> > - Across our multiple staging and production clusters, we're > seeing > >> the > >> >>> > problem roughly once or twice a day. > >> >>> > > >> >>> > - Our clusters are small at the moment. The two that are > >> experiencing > >> >>> the > >> >>> > issue are 5 and 8 brokers, respectively. The number of consumers > is > >> >>> small, > >> >>> > I'd say less than 20 at the moment. The amount of data being > >> produced > >> >>> is > >> >>> > small also, in the tens of megabytes per second range, but the > >> number > >> >>> of > >> >>> > connects/disconnects is really high, because they're usually > >> >>> short-lived > >> >>> > processes. Our guess at the moment is that this is triggering the > >> >>> bug. We > >> >>> > have a separate cluster where we don't have short-lived producers, > >> and > >> >>> that > >> >>> > one has been rock solid. > >> >>> > > >> >>> > > >> >>> > We'll look into applying the patch, which could help reduce the > >> >>> problem. > >> >>> > The ticket says it's being targeted for the 0.10.2 release. Any > >> rough > >> >>> > estimate of a timeline for that to come out? > >> >>> > > >> >>> > Thanks! > >> >>> > > >> >>> > Marcos > >> >>> > > >> >>> > > >> >>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson < > ja...@confluent.io > >> > > >> >>> > wrote: > >> >>> > > >> >>> > > Hey Marcos, > >> >>> > > > >> >>> > > Thanks for the report. Can you check out > >> >>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if it > >> >>> matches? > >> >>> > At > >> >>> > > a glance, it looks like the same problem. We tried pretty hard > to > >> >>> get the > >> >>> > > fix into the release, but it didn't quite make it. A few > >> questions: > >> >>> > > > >> >>> > > 1. Did you not see this in prior releases? As far as we can > tell, > >> it > >> >>> is > >> >>> > > possible going back to 0.9.0.0, but there could be a subtle > >> >>> difference in > >> >>> > > 0.10.1.0 which increases its likelihood. > >> >>> > > 2. How often are you hitting this problem? You might try the > patch > >> >>> that's > >> >>> > > available if it's occurring frequently and you can't downgrade. > I > >> >>> think > >> >>> > the > >> >>> > > existing patch is incomplete, but it should still reduce the > >> >>> likelihood. > >> >>> > > 3. Out of curiosity, what is the size of your cluster and how > many > >> >>> > > consumers do you have in your cluster? > >> >>> > > > >> >>> > > Thanks! > >> >>> > > Jason > >> >>> > > > >> >>> > > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez < > mjua...@gmail.com> > >> >>> wrote: > >> >>> > > > >> >>> > > > Just to expand on Lawrence's answer: The increase in file > >> >>> descriptor > >> >>> > > usage > >> >>> > > > goes from 2-3K under normal conditions, to 64K+ under > deadlock, > >> >>> which > >> >>> > it > >> >>> > > > hits within a couple of hours, at which point the broker goes > >> down, > >> >>> > > because > >> >>> > > > that's our OS-defined limit. > >> >>> > > > > >> >>> > > > If it was only a 33% increase from the new timestamp indexes, > we > >> >>> should > >> >>> > > be > >> >>> > > > going to max 4K-5K file descriptors in use, not 64K+. > >> >>> > > > > >> >>> > > > Marcos > >> >>> > > > > >> >>> > > > > >> >>> > > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum < > >> >>> lwei...@pandora.com> > >> >>> > > > wrote: > >> >>> > > > > >> >>> > > > > We saw this increase when upgrading from 0.9.0.1 to > 0.10.0.1. > >> >>> > > > > We’re now running on 0.10.1.0, and the FD increase is due > to a > >> >>> > > deadlock, > >> >>> > > > > not functionality or new features. > >> >>> > > > > > >> >>> > > > > Lawrence Weikum | Software Engineer | Pandora > >> >>> > > > > 1426 Pearl Street, Suite 100, Boulder CO 80302 > >> >>> > > > > m 720.203.1578 | lwei...@pandora.com > >> >>> > > > > > >> >>> > > > > On 11/3/16, 12:42 PM, "Hans Jespersen" <h...@confluent.io> > >> >>> wrote: > >> >>> > > > > > >> >>> > > > > The 0.10.1 broker will use more file descriptor than > >> previous > >> >>> > > > releases > >> >>> > > > > because of the new timestamp indexes. You should expect > >> and > >> >>> plan > >> >>> > > for > >> >>> > > > > ~33% > >> >>> > > > > more file descriptors to be open. > >> >>> > > > > > >> >>> > > > > -hans > >> >>> > > > > > >> >>> > > > > /** > >> >>> > > > > * Hans Jespersen, Principal Systems Engineer, Confluent > >> Inc. > >> >>> > > > > * h...@confluent.io (650)924-2670 > >> >>> > > > > */ > >> >>> > > > > > >> >>> > > > > On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez < > >> >>> > mjua...@gmail.com> > >> >>> > > > > wrote: > >> >>> > > > > > >> >>> > > > > > We're running into a recurrent deadlock issue in both > >> our > >> >>> > > > production > >> >>> > > > > and > >> >>> > > > > > staging clusters, both using the latest 0.10.1 > release. > >> >>> The > >> >>> > > > symptom > >> >>> > > > > we > >> >>> > > > > > noticed was that, in servers in which kafka producer > >> >>> > connections > >> >>> > > > are > >> >>> > > > > short > >> >>> > > > > > lived, every other day or so, we'd see file > descriptors > >> >>> being > >> >>> > > > > exhausted, > >> >>> > > > > > until the broker is restarted, or the broker runs out > of > >> >>> file > >> >>> > > > > descriptors, > >> >>> > > > > > and it goes down. None of the clients are on 0.10.1 > >> kafka > >> >>> > jars, > >> >>> > > > > they're > >> >>> > > > > > all using previous versions. > >> >>> > > > > > > >> >>> > > > > > When diagnosing the issue, we found that when the > system > >> >>> is in > >> >>> > > that > >> >>> > > > > state, > >> >>> > > > > > using up file descriptors at a really fast rate, the > >> JVM is > >> >>> > > > actually > >> >>> > > > > in a > >> >>> > > > > > deadlock. Did a thread dump from both jstack and > >> >>> visualvm, and > >> >>> > > > > attached > >> >>> > > > > > those to this email. > >> >>> > > > > > > >> >>> > > > > > This is the interesting bit from the jstack thread > dump: > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > Found one Java-level deadlock: > >> >>> > > > > > ============================= > >> >>> > > > > > "executor-Heartbeat": > >> >>> > > > > > waiting to lock monitor 0x00000000016c8138 (object > >> >>> > > > > 0x000000062732a398, a > >> >>> > > > > > kafka.coordinator.GroupMetadata), > >> >>> > > > > > which is held by "group-metadata-manager-0" > >> >>> > > > > > > >> >>> > > > > > "group-metadata-manager-0": > >> >>> > > > > > waiting to lock monitor 0x00000000011ddaa8 (object > >> >>> > > > > 0x000000063f1b0cc0, a > >> >>> > > > > > java.util.LinkedList), > >> >>> > > > > > which is held by "kafka-request-handler-3" > >> >>> > > > > > > >> >>> > > > > > "kafka-request-handler-3": > >> >>> > > > > > waiting to lock monitor 0x00000000016c8138 (object > >> >>> > > > > 0x000000062732a398, a > >> >>> > > > > > kafka.coordinator.GroupMetadata), > >> >>> > > > > > which is held by "group-metadata-manager-0" > >> >>> > > > > > > >> >>> > > > > > > >> >>> > > > > > I also noticed the background heartbeat thread (I'm > >> >>> guessing > >> >>> > the > >> >>> > > > one > >> >>> > > > > > called "executor-Heartbeat" above) is new for this > >> release, > >> >>> > under > >> >>> > > > > > KAFKA-3888 ticket - https://urldefense.proofpoint. > >> >>> > > > > com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA- > >> >>> > > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw- > >> JRepxyw&r= > >> >>> > > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N- > >> >>> > > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0 > &s=nEcLEnYWPyaDuPDI5vSSKPWoljo > >> XYb > >> >>> > > > > vNriVw0wrEegk&e= > >> >>> > > > > > > >> >>> > > > > > We haven't noticed this problem with earlier Kafka > >> broker > >> >>> > > versions, > >> >>> > > > > so I'm > >> >>> > > > > > guessing maybe this new background heartbeat thread is > >> what > >> >>> > > > > introduced the > >> >>> > > > > > deadlock problem. > >> >>> > > > > > > >> >>> > > > > > That same broker is still in the deadlock scenario, we > >> >>> haven't > >> >>> > > > > restarted > >> >>> > > > > > it, so let me know if you'd like more info/log/stats > >> from > >> >>> the > >> >>> > > > system > >> >>> > > > > before > >> >>> > > > > > we restart it. > >> >>> > > > > > > >> >>> > > > > > Thanks, > >> >>> > > > > > > >> >>> > > > > > Marcos Juarez > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >> > >> >> > >> > > >> > > > > >