Hi Gyula, I have observed similar issue with FlinkConsumer09 and 010 and posted it to the mailing list as well . This issue is not consistent, however whenever it happens it leads to checkpoints getting failed or taking a long time to complete.
Regards, Vinay Patil On Wed, Jul 12, 2017 at 7:00 PM, Gyula Fóra [via Apache Flink User Mailing List archive.] <ml+s2336050n14210...@n4.nabble.com> wrote: > I have added logging that will help determine this as well, next time this > happens I will post the results. (Although there doesnt seem to be high > backpressure) > > Thanks for the tips, > Gyula > > Stephan Ewen <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=14210&i=0>> ezt írta (időpont: > 2017. júl. 12., Sze, 15:27): > >> Can it be that the checkpoint thread is waiting to grab the lock, which >> is held by the chain under backpressure? >> >> On Wed, Jul 12, 2017 at 12:23 PM, Gyula Fóra <[hidden email] >> <http:///user/SendEmail.jtp?type=node&node=14210&i=1>> wrote: >> >>> Yes thats definitely what I am about to do next but just thought maybe >>> someone has seen this before. >>> >>> Will post info next time it happens. (Not guaranteed to happen soon as >>> it didn't happen for a long time before) >>> >>> Gyula >>> >>> On Wed, Jul 12, 2017, 12:13 Stefan Richter <[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=14210&i=2>> wrote: >>> >>>> Hi, >>>> >>>> could you introduce some logging to figure out from which method call >>>> the delay is introduced? >>>> >>>> Best, >>>> Stefan >>>> >>>> Am 12.07.2017 um 11:37 schrieb Gyula Fóra <[hidden email] >>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=3>>: >>>> >>>> Hi, >>>> >>>> We are using the latest 1.3.1 >>>> >>>> Gyula >>>> >>>> Urs Schoenenberger <[hidden email] >>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=4>> ezt írta >>>> (időpont: 2017. júl. 12., Sze, 10:44): >>>> >>>>> Hi Gyula, >>>>> >>>>> I don't know the cause unfortunately, but we observed a similiar issue >>>>> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1. >>>>> Which version are you running on? >>>>> >>>>> Urs >>>>> >>>>> On 12.07.2017 09:48, Gyula Fóra wrote: >>>>> > Hi, >>>>> > >>>>> > I have noticed a strange behavior in one of our jobs: every once in >>>>> a while >>>>> > the Kafka source checkpointing time becomes extremely large compared >>>>> to >>>>> > what it usually is. (To be very specific it is a kafka source >>>>> chained with >>>>> > a stateless map operator) >>>>> > >>>>> > To be more specific checkpointing the offsets usually takes around >>>>> 10ms >>>>> > which sounds reasonable but in some checkpoints this goes into the >>>>> 3-5 >>>>> > minutes range practically blocking the job for that period of time. >>>>> > Yesterday I have observed even 10 minute delays. First I thought >>>>> that some >>>>> > sources might trigger checkpoints later than others, but adding some >>>>> > logging and comparing it it seems that the triggerCheckpoint was >>>>> received >>>>> > at the same time. >>>>> > >>>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>>> > affected (last time I checked at least). We are still using the 0.8 >>>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>>> other >>>>> > jobs. >>>>> > >>>>> > Any clue on what might cause this? >>>>> > >>>>> > Thanks :) >>>>> > Gyula >>>>> > >>>>> > >>>>> > >>>>> > Hi, >>>>> > >>>>> > I have noticed a strange behavior in one of our jobs: every once in a >>>>> > while the Kafka source checkpointing time becomes extremely large >>>>> > compared to what it usually is. (To be very specific it is a kafka >>>>> > source chained with a stateless map operator) >>>>> > >>>>> > To be more specific checkpointing the offsets usually takes around >>>>> 10ms >>>>> > which sounds reasonable but in some checkpoints this goes into the >>>>> 3-5 >>>>> > minutes range practically blocking the job for that period of time. >>>>> > Yesterday I have observed even 10 minute delays. First I thought that >>>>> > some sources might trigger checkpoints later than others, but adding >>>>> > some logging and comparing it it seems that the triggerCheckpoint was >>>>> > received at the same time. >>>>> > >>>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>>> > affected (last time I checked at least). We are still using the 0.8 >>>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>>> > other jobs. >>>>> > >>>>> > Any clue on what might cause this? >>>>> > >>>>> > Thanks :) >>>>> > Gyula >>>>> >>>>> -- >>>>> Urs Schönenberger - [hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=14210&i=5> >>>>> >>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >>>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >>>>> >>>> >>>> >> > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/Why-would-a-kafka-source-checkpoint-take- > so-long-tp14193p14210.html > To start a new topic under Apache Flink User Mailing List archive., email > ml+s2336050n1...@n4.nabble.com > To unsubscribe from Apache Flink User Mailing List archive., click here > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=dmluYXkxOC5wYXRpbEBnbWFpbC5jb218MXwxODExMDE2NjAx> > . > NAML > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Why-would-a-kafka-source-checkpoint-take-so-long-tp14193p14232.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.