Re: Data loss in Flink Kafka Pipeline

2017-12-08 Thread Nishu
Hi Fabian, Actually I found a JIRA issue for the similar issue : https://issues.apache.org/jira/browse/BEAM-3225 ,This is something similar I am facing too. I have 4 kafka topics as input source. Those are read using GlobalWindow and processingTime triggers. And further joined based on common

Re: Data loss in Flink Kafka Pipeline

2017-12-08 Thread Aljoscha Krettek
Hi, Could you maybe post your pipeline code. That way I could have a look. Best, Aljoscha > On 8. Dec 2017, at 12:31, Fabian Hueske wrote: > > Hmm, I see... > I'm running out of ideas. > > You might be right with your assumption about a bug in the Beam Flink runner. > In

Re: Data loss in Flink Kafka Pipeline

2017-12-08 Thread Fabian Hueske
Hmm, I see... I'm running out of ideas. You might be right with your assumption about a bug in the Beam Flink runner. In this case, this would be an issue for the Beam project which hosts the Flink runner. But it might also be an issue on the Flink side. Maybe Aljoscha (in CC), one of the

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Fabian Hueske
Hi Nishu, the data loss might be caused by the fact that processing time triggers do not fire when the program terminates. So, if your program has records stored in a window and program terminates because the input was fully consumed, the window operator won't process the remaining windows but

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Nishu
Hi Fabian, Program is running until I manually stop it. Trigger is also firing as expected because I read the entire data after the trigger firing to see what data is captured. And pass that data over to GroupByKey as Input. Its using Global window so I accumulate entire data each time the

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Nishu
Hi, Thanks for your inputs. I am reading Kafka topics in Global windows and have defined some ProcessingTime triggers. Hence there is no late records. Program is performing join between multiple kafka topics. It consists following types of Transformation sequence is something like : 1. Read

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Chen Qin
Nishu You might consider sideouput with metrics at least after window. I would suggest having that to catch data screw or partition screw in all flink jobs and amend if needed. Chen On Thu, Dec 7, 2017 at 9:48 AM Fabian Hueske wrote: > Is it possible that the data is

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Fabian Hueske
Is it possible that the data is dropped due to being late, i.e., records with timestamps behind the current watemark? What kind of operations does your program consist of? Best, Fabian 2017-12-07 10:20 GMT+01:00 Sendoh : > I would recommend to also print the count of

Re: Data loss in Flink Kafka Pipeline

2017-12-07 Thread Sendoh
I would recommend to also print the count of input and output of each operator by using Accumulator. Cheers, Sendoh -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Data loss in Flink Kafka Pipeline

2017-12-07 Thread Nishu
Hi, I am running a Streaming pipeline(written in Beam Framework) with Flink. *Operator sequence* is -> Reading the JSON data, Parse JSON String to the Object and Group the object based on common key. I noticed that GroupByKey operator throws away some data in between and hence I don't get all