Re: Data loss when restoring from savepoint

2019-02-14 Thread Konstantin Knauf
Hi Juho, * does the output of the streaming job contain any data, which is not >> contained in the batch > > > No. > > * do you know if all lost records are contained in the last savepoint you >> took before the window fired? This would mean that no records are lost >> after the last restore. >

Re: Data loss when restoring from savepoint

2019-02-14 Thread Juho Autio
Thanks Konstantin! I'll try to see if I can prepare code & conf to be shared as fully as possible. In the meantime: * does the output of the streaming job contain any data, which is not > contained in the batch No. * do you know if all lost records are contained in the last savepoint you >

Re: Data loss when restoring from savepoint

2019-02-14 Thread Konstantin Knauf
Hi Juho, you are right the problem has actually been narrowed down quite a bit over time. Nevertheless, sharing the code (incl. flink-conf.yaml) might be a good idea. Maybe something strikes the eye, that we have not thought about so far. If you don't feel comfortable sharing the code on the ML,

Re: Data loss when restoring from savepoint

2019-02-13 Thread Gyula Fóra
Sorry not posting on the mail list was my mistake :/ On Wed, 13 Feb 2019 at 15:01, Juho Autio wrote: > Thanks for stepping in, did you post outside of the mailing list on > purpose btw? > > This I did long time ago: > > To rule out for good any questions about sink behaviour, the job was >>

Re: Data loss when restoring from savepoint

2019-02-13 Thread Juho Autio
Stefan (or anyone!), please, could I have some feedback on the findings that I reported on Dec 21, 2018? This is still a major blocker.. On Thu, Jan 31, 2019 at 11:46 AM Juho Autio wrote: > Hello, is there anyone that could help with this? > > On Fri, Jan 11, 2019 at 8:14 AM Juho Autio wrote:

Re: Data loss when restoring from savepoint

2019-01-31 Thread Juho Autio
Hello, is there anyone that could help with this? On Fri, Jan 11, 2019 at 8:14 AM Juho Autio wrote: > Stefan, would you have time to comment? > > On Wednesday, January 2, 2019, Juho Autio wrote: > >> Bump – does anyone know if Stefan will be available to comment the latest >> findings? Thanks.

Re: Data loss when restoring from savepoint

2019-01-10 Thread Juho Autio
Stefan, would you have time to comment? On Wednesday, January 2, 2019, Juho Autio wrote: > Bump – does anyone know if Stefan will be available to comment the latest > findings? Thanks. > > On Fri, Dec 21, 2018 at 2:33 PM Juho Autio wrote: > >> Stefan, I managed to analyze savepoint with bravo.

Re: Data loss when restoring from savepoint

2019-01-02 Thread Juho Autio
Bump – does anyone know if Stefan will be available to comment the latest findings? Thanks. On Fri, Dec 21, 2018 at 2:33 PM Juho Autio wrote: > Stefan, I managed to analyze savepoint with bravo. It seems that the data > that's missing from output *is* found in savepoint. > > I simplified my

Re: Data loss when restoring from savepoint

2018-12-21 Thread Juho Autio
Stefan, I managed to analyze savepoint with bravo. It seems that the data that's missing from output *is* found in savepoint. I simplified my test case to the following: - job 1 has bee running for ~10 days - savepoint X created & job 1 cancelled - job 2 started with restore from savepoint X

Re: Data loss when restoring from savepoint

2018-11-19 Thread Juho Autio
Hi Stefan, Bravo doesn't currently support reading a reducer state. I gave it a try but couldn't get to a working implementation yet. If anyone can provide some insight on how to make this work, please share at github: https://github.com/king/bravo/pull/11 Thanks. On Tue, Oct 23, 2018 at 3:32

Re: Data loss when restoring from savepoint

2018-10-23 Thread Juho Autio
I was glad to find that bravo had now been updated to support installing bravo to a local maven repo. I was able to load a checkpoint created by my job, thanks to the example provided in bravo README, but I'm still missing the essential piece. My code was: OperatorStateReader reader =

Re: Data loss when restoring from savepoint

2018-10-15 Thread Juho Autio
Hi Stefan, Sorry but it doesn't seem immediately clear to me what's a good way to use https://github.com/king/bravo. How are people using it? Would you for example modify build.gradle somehow to publish the bravo as a library locally/internally? Or add code directly in the bravo project

Re: Data loss when restoring from savepoint

2018-10-04 Thread Stefan Richter
Hi, > Am 04.10.2018 um 16:08 schrieb Juho Autio : > > > you could take a look at Bravo [1] to query your savepoints and to check if > > the state in the savepoint complete w.r.t your expectations > > Thanks. I'm not 100% if this is the case, but to me it seemed like the missed > ids were

Re: Data loss when restoring from savepoint

2018-10-04 Thread Juho Autio
> you could take a look at Bravo [1] to query your savepoints and to check if the state in the savepoint complete w.r.t your expectations Thanks. I'm not 100% if this is the case, but to me it seemed like the missed ids were being logged by the reducer soon after the job had started (after

Re: Data loss when restoring from savepoint

2018-10-04 Thread Stefan Richter
Hi, you could take a look at Bravo [1] to query your savepoints and to check if the state in the savepoint complete w.r.t your expectations. I somewhat doubt that there is a general problem with the state/savepoints because many users are successfully running it on a large state and I am not

Re: Data loss when restoring from savepoint

2018-10-04 Thread Juho Autio
Thanks for the suggestions! > In general, it would be tremendously helpful to have a minimal working example which allows to reproduce the problem. Definitely. The problem with reproducing has been that this only seems to happen in the bigger production data volumes. That's why I'm hoping to

Re: Data loss when restoring from savepoint

2018-10-04 Thread Till Rohrmann
Hi Juho, another idea to further narrow down the problem could be to simplify the job to not use a reduce window but simply a time window which outputs the window events. Then counting the input and output events should allow you to verify the results. If you are not seeing missing events, then

Re: Data loss when restoring from savepoint

2018-10-04 Thread Andrey Zagrebin
Hi Juho, can you try to reduce the job to minimal reproducible example and share the job and input? For example: - some simple records as input, e.g. tuples of primitive types saved as cvs - minimal deduplication job which processes them and misses records - check if it happens for shorter

Re: Data loss when restoring from savepoint

2018-10-04 Thread Juho Autio
Sorry to insist, but we seem to be blocked for any serious usage of state in Flink if we can't rely on it to not miss data in case of restore. Would anyone have suggestions for how to troubleshoot this? So far I have verified with DEBUG logs that our reduce function gets to process also the data

Re: Data loss when restoring from savepoint

2018-10-01 Thread Juho Autio
Hi Andrey, To rule out for good any questions about sink behaviour, the job was killed and started with an additional Kafka sink. The same number of ids were missed in both outputs: KafkaSink & BucketingSink. I wonder what would be the next steps in debugging? On Fri, Sep 21, 2018 at 3:49 PM

Re: Data loss when restoring from savepoint

2018-09-21 Thread Juho Autio
Thanks, Andrey. > so it means that the savepoint does not loose at least some dropped records. I'm not sure what you mean by that? I mean, it was known from the beginning, that not everything is lost before/after restoring a savepoint, just some records around the time of restoration. It's not

Re: Data loss when restoring from savepoint

2018-09-21 Thread Andrey Zagrebin
Hi Juho, so it means that the savepoint does not loose at least some dropped records. If it is feasible for your setup, I suggest to insert one more map function after reduce and before sink. The map function should be called right after window is triggered but before flushing to s3. The

Re: Data loss when restoring from savepoint

2018-09-20 Thread Juho Autio
Hi Andrey! I was finally able to gather the DEBUG logs that you suggested. In short, the reducer logged that it processed at least some of the ids that were missing from the output. "At least some", because I didn't have the job running with DEBUG logs for the full 24-hour window period. So I

Re: Data loss when restoring from savepoint

2018-08-29 Thread Andrey Zagrebin
Hi Juho, > only when the 24-hour window triggers, BucketingSink gets a burst of input This is of course totally true, my understanding is the same. We cannot exclude problem there for sure, just savepoints are used a lot w/o problem reports and BucketingSink is known to be problematic with s3.

Re: Data loss when restoring from savepoint

2018-08-29 Thread Juho Autio
Andrey, thank you very much for the debugging suggestions, I'll try them. In the meanwhile two more questions, please: > Just to keep in mind this problem with s3 and exclude it for sure. I would also check whether the size of missing events is around the batch size of BucketingSink or not.

Re: Data loss when restoring from savepoint

2018-08-27 Thread Andrey Zagrebin
Hi, true, StreamingFileSink does not support s3 in 1.6.0, it is planned for the next 1.7 release, sorry for confusion. The old BucketingSink has in general problem with s3. Internally BucketingSink queries s3 as a file system to list already written file parts (batches) and determine index of

Re: Data loss when restoring from savepoint

2018-08-24 Thread Juho Autio
Hi, Using StreamingFileSink is not a convenient option for production use for us as it doesn't support s3*. I could use StreamingFileSink just to verify, but I don't see much point in doing so. Please consider my previous comment: > I realized that BucketingSink must not play any role in this

Re: Data loss when restoring from savepoint

2018-08-24 Thread Andrey Zagrebin
Ok, I think before further debugging the window reduced state, could you try the new ‘StreamingFileSink’ [1] introduced in Flink 1.6.0 instead of the previous 'BucketingSink’? Cheers, Andrey [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html

Re: Data loss when restoring from savepoint

2018-08-24 Thread Juho Autio
Yes, sorry for my confusing comment. I just meant that it seems like there's a bug somewhere now that the output is missing some data. > I would wait and check the actual output in s3 because it is the main result of the job Yes, and that's what I have already done. There seems to be always some

Re: Data loss when restoring from savepoint

2018-08-24 Thread Andrey Zagrebin
Hi Juho, So it is a per key deduplication job. Yes, I would wait and check the actual output in s3 because it is the main result of the job and > The late data around the time of taking savepoint might be not included into > the savepoint but it should be behind the snapshotted offset in

Re: Data loss when restoring from savepoint

2018-08-24 Thread Juho Autio
Thanks for your answer! I check for the missed data from the final output on s3. So I wait until the next day, then run the same thing re-implemented in batch, and compare the output. > The late data around the time of taking savepoint might be not included into the savepoint but it should be

Re: Data loss when restoring from savepoint

2018-08-24 Thread Andrey Zagrebin
Hi Juho, Where exactly does the data miss? When do you notice that? Do you check it: - debugging `DistinctFunction.reduce` right after resume in the middle of the day or - some distinct records miss in the final output of BucketingSink in s3 after window result is actually triggered and

Re: Data loss when restoring from savepoint

2018-08-23 Thread Juho Autio
I changed to allowedLateness=0, no change, still missing data when restoring from savepoint. On Tue, Aug 21, 2018 at 10:43 AM Juho Autio wrote: > I realized that BucketingSink must not play any role in this problem. This > is because only when the 24-hour window triggers, BucketinSink gets a

Re: Data loss when restoring from savepoint

2018-08-21 Thread Juho Autio
I realized that BucketingSink must not play any role in this problem. This is because only when the 24-hour window triggers, BucketinSink gets a burst of input. Around the state restoring point (middle of the day) it doesn't get any input, so it can't lose anything either (right?). I will next

Data loss when restoring from savepoint

2018-08-17 Thread Juho Autio
Some data is silently lost on my Flink stream job when state is restored from a savepoint. Do you have any debugging hints to find out where exactly the data gets dropped? My job gathers distinct values using a 24-hour window. It doesn't have any custom state management. When I cancel the job