Thanks for the reply Guozhang! But I think we are talking of 2 different issues here. KAFKA-5167 is for LockException. We face this issue intermittently, but not a lot.
There is also another issue where a particular broker is marked as dead for a group id and Streams process never recovers from this exception. On Mon, May 15, 2017 at 11:28 PM, Guozhang Wang <wangg...@gmail.com> wrote: > I'm wondering if it is possibly due to KAFKA-5167? In that case, the "other > thread" will keep retrying on grabbing the lock. > > Guozhang > > > On Sat, May 13, 2017 at 7:30 PM, Mahendra Kariya < > mahendra.kar...@go-jek.com > > wrote: > > > Hi, > > > > There is no missing data. But the INFO level logs are infinite and the > > streams practically stops. For the messages that I posted, we got these > > INFO logs for around 20 mins. After which we got an alert about no data > > being produced in the sink topic and we had to restart the streams > > processes. > > > > > > > > On Sun, May 14, 2017 at 1:01 AM, Matthias J. Sax <matth...@confluent.io> > > wrote: > > > > > Hi, > > > > > > I just dug a little bit. The messages are logged at INFO level and thus > > > should not be a problem if they go away by themselves after some time. > > > Compare: > > > https://groups.google.com/forum/#!topic/confluent-platform/A14dkPlDlv4 > > > > > > Do you still see missing data? > > > > > > > > > -Matthias > > > > > > > > > On 5/11/17 2:39 AM, Mahendra Kariya wrote: > > > > Hi Matthias, > > > > > > > > We faced the issue again. The logs are below. > > > > > > > > 16:13:16.527 [StreamThread-7] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) dead > > > for > > > > group grp_id > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) for > > > group > > > > grp_id. > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) dead > > > for > > > > group grp_id > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) for > > > group > > > > grp_id. > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) dead > > > for > > > > group grp_id > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) for > > > group > > > > grp_id. > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) dead > > > for > > > > group grp_id > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) for > > > group > > > > grp_id. > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) dead > > > for > > > > group grp_id > > > > 16:13:16.573 [StreamThread-2] INFO o.a.k.c.c.i.AbstractCoordinator - > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > <(214)%20748-3642> rack: null) for > > > group > > > > grp_id. > > > > > > > > > > > > > > > > On Tue, May 9, 2017 at 3:40 AM, Matthias J. Sax < > matth...@confluent.io > > > > > > > wrote: > > > > > > > >> Great! Glad 0.10.2.1 fixes it for you! > > > >> > > > >> -Matthias > > > >> > > > >> On 5/7/17 8:57 PM, Mahendra Kariya wrote: > > > >>> Upgrading to 0.10.2.1 seems to have fixed the issue. > > > >>> > > > >>> Until now, we were looking at random 1 hour data to analyse the > > issue. > > > >> Over > > > >>> the weekend, we have written a simple test that will continuously > > check > > > >> for > > > >>> inconsistencies in real time and report if there is any issue. > > > >>> > > > >>> No issues have been reported for the last 24 hours. Will update > this > > > >> thread > > > >>> if we find any issue. > > > >>> > > > >>> Thanks for all the support! > > > >>> > > > >>> > > > >>> > > > >>> On Fri, May 5, 2017 at 3:55 AM, Matthias J. Sax < > > matth...@confluent.io > > > > > > > >>> wrote: > > > >>> > > > >>>> About > > > >>>> > > > >>>>> 07:44:08.493 [StreamThread-10] INFO > o.a.k.c.c.i.AbstractCoordinato > > r > > > - > > > >>>>> Discovered coordinator broker-05:6667 for group group-2. > > > >>>> > > > >>>> Please upgrade to Streams 0.10.2.1 -- we fixed couple of bug and I > > > would > > > >>>> assume this issue is fixed, too. If not, please report back. > > > >>>> > > > >>>>> Another question that I have is, is there a way for us detect how > > > many > > > >>>>> messages have come out of order? And if possible, what is the > > delay? > > > >>>> > > > >>>> There is no metric or api for this. What you could do though is, > to > > > use > > > >>>> #transform() that only forwards each record and as a side task, > > > extracts > > > >>>> the timestamp via `context#timestamp()` and does some book keeping > > to > > > >>>> compute if out-of-order and what the delay was. > > > >>>> > > > >>>> > > > >>>>>>> - same for .mapValues() > > > >>>>>>> > > > >>>>>> > > > >>>>>> I am not sure how to check this. > > > >>>> > > > >>>> The same way as you do for filter()? > > > >>>> > > > >>>> > > > >>>> -Matthias > > > >>>> > > > >>>> > > > >>>> On 5/4/17 10:29 AM, Mahendra Kariya wrote: > > > >>>>> Hi Matthias, > > > >>>>> > > > >>>>> Please find the answers below. > > > >>>>> > > > >>>>> I would recommend to double check the following: > > > >>>>>> > > > >>>>>> - can you confirm that the filter does not remove all data for > > > those > > > >>>>>> time periods? > > > >>>>>> > > > >>>>> > > > >>>>> Filter does not remove all data. There is a lot of data coming in > > > even > > > >>>>> after the filter stage. > > > >>>>> > > > >>>>> > > > >>>>>> - I would also check input for your AggregatorFunction() -- > does > > it > > > >>>>>> receive everything? > > > >>>>>> > > > >>>>> > > > >>>>> Yes. Aggregate function seems to be receiving everything. > > > >>>>> > > > >>>>> > > > >>>>>> - same for .mapValues() > > > >>>>>> > > > >>>>> > > > >>>>> I am not sure how to check this. > > > >>>>> > > > >>>> > > > >>>> > > > >>> > > > >> > > > >> > > > > > > > > > > > > > > > > -- > -- Guozhang >