Hi Juho,
* does the output of the streaming job contain any data, which is not
>> contained in the batch
>
>
> No.
>
> * do you know if all lost records are contained in the last savepoint you
>> took before the window fired? This would mean that no records are lost
>> after the last restore.
>
>
Thanks Konstantin!
I'll try to see if I can prepare code & conf to be shared as fully as
possible.
In the meantime:
* does the output of the streaming job contain any data, which is not
> contained in the batch
No.
* do you know if all lost records are contained in the last savepoint you
> to
Hi Juho,
you are right the problem has actually been narrowed down quite a bit over
time. Nevertheless, sharing the code (incl. flink-conf.yaml) might be a
good idea. Maybe something strikes the eye, that we have not thought about
so far. If you don't feel comfortable sharing the code on the ML, f
Sorry not posting on the mail list was my mistake :/
On Wed, 13 Feb 2019 at 15:01, Juho Autio wrote:
> Thanks for stepping in, did you post outside of the mailing list on
> purpose btw?
>
> This I did long time ago:
>
> To rule out for good any questions about sink behaviour, the job was
>> kil
Stefan (or anyone!), please, could I have some feedback on the findings
that I reported on Dec 21, 2018? This is still a major blocker..
On Thu, Jan 31, 2019 at 11:46 AM Juho Autio wrote:
> Hello, is there anyone that could help with this?
>
> On Fri, Jan 11, 2019 at 8:14 AM Juho Autio wrote:
>
Hello, is there anyone that could help with this?
On Fri, Jan 11, 2019 at 8:14 AM Juho Autio wrote:
> Stefan, would you have time to comment?
>
> On Wednesday, January 2, 2019, Juho Autio wrote:
>
>> Bump – does anyone know if Stefan will be available to comment the latest
>> findings? Thanks.
Stefan, would you have time to comment?
On Wednesday, January 2, 2019, Juho Autio wrote:
> Bump – does anyone know if Stefan will be available to comment the latest
> findings? Thanks.
>
> On Fri, Dec 21, 2018 at 2:33 PM Juho Autio wrote:
>
>> Stefan, I managed to analyze savepoint with bravo.
Bump – does anyone know if Stefan will be available to comment the latest
findings? Thanks.
On Fri, Dec 21, 2018 at 2:33 PM Juho Autio wrote:
> Stefan, I managed to analyze savepoint with bravo. It seems that the data
> that's missing from output *is* found in savepoint.
>
> I simplified my test
Stefan, I managed to analyze savepoint with bravo. It seems that the data
that's missing from output *is* found in savepoint.
I simplified my test case to the following:
- job 1 has bee running for ~10 days
- savepoint X created & job 1 cancelled
- job 2 started with restore from savepoint X
The
Hi Stefan,
Bravo doesn't currently support reading a reducer state. I gave it a try
but couldn't get to a working implementation yet. If anyone can provide
some insight on how to make this work, please share at github:
https://github.com/king/bravo/pull/11
Thanks.
On Tue, Oct 23, 2018 at 3:32 PM
I was glad to find that bravo had now been updated to support installing
bravo to a local maven repo.
I was able to load a checkpoint created by my job, thanks to the example
provided in bravo README, but I'm still missing the essential piece.
My code was:
OperatorStateReader reader = ne
Hi Stefan,
Sorry but it doesn't seem immediately clear to me what's a good way to use
https://github.com/king/bravo.
How are people using it? Would you for example modify build.gradle somehow
to publish the bravo as a library locally/internally? Or add code directly
in the bravo project (locally)
Good then, I'll try to analyze the savepoints with Bravo. Thanks!
> How would you assume that backpressure would influence your updates?
Updates to each local state still happen event-by-event, in a single
reader/writing thread.
Sure, just an ignorant guess by me. I'm not familiar with most of Fl
Hi,
> Am 04.10.2018 um 16:08 schrieb Juho Autio :
>
> > you could take a look at Bravo [1] to query your savepoints and to check if
> > the state in the savepoint complete w.r.t your expectations
>
> Thanks. I'm not 100% if this is the case, but to me it seemed like the missed
> ids were being
> you could take a look at Bravo [1] to query your savepoints and to check
if the state in the savepoint complete w.r.t your expectations
Thanks. I'm not 100% if this is the case, but to me it seemed like the
missed ids were being logged by the reducer soon after the job had started
(after restori
Hi,
you could take a look at Bravo [1] to query your savepoints and to check if the
state in the savepoint complete w.r.t your expectations. I somewhat doubt that
there is a general problem with the state/savepoints because many users are
successfully running it on a large state and I am not aw
Thanks for the suggestions!
> In general, it would be tremendously helpful to have a minimal working
example which allows to reproduce the problem.
Definitely. The problem with reproducing has been that this only seems to
happen in the bigger production data volumes.
That's why I'm hoping to fin
Hi Juho,
another idea to further narrow down the problem could be to simplify the
job to not use a reduce window but simply a time window which outputs the
window events. Then counting the input and output events should allow you
to verify the results. If you are not seeing missing events, then it
Hi Juho,
can you try to reduce the job to minimal reproducible example and share the job
and input?
For example:
- some simple records as input, e.g. tuples of primitive types saved as cvs
- minimal deduplication job which processes them and misses records
- check if it happens for shorter windo
Sorry to insist, but we seem to be blocked for any serious usage of state
in Flink if we can't rely on it to not miss data in case of restore.
Would anyone have suggestions for how to troubleshoot this? So far I have
verified with DEBUG logs that our reduce function gets to process also the
data t
Hi Andrey,
To rule out for good any questions about sink behaviour, the job was killed
and started with an additional Kafka sink.
The same number of ids were missed in both outputs: KafkaSink &
BucketingSink.
I wonder what would be the next steps in debugging?
On Fri, Sep 21, 2018 at 3:49 PM Ju
Thanks, Andrey.
> so it means that the savepoint does not loose at least some dropped
records.
I'm not sure what you mean by that? I mean, it was known from the
beginning, that not everything is lost before/after restoring a savepoint,
just some records around the time of restoration. It's not 10
Hi Juho,
so it means that the savepoint does not loose at least some dropped records.
If it is feasible for your setup, I suggest to insert one more map function
after reduce and before sink.
The map function should be called right after window is triggered but before
flushing to s3.
The resul
Hi Andrey!
I was finally able to gather the DEBUG logs that you suggested. In short,
the reducer logged that it processed at least some of the ids that were
missing from the output.
"At least some", because I didn't have the job running with DEBUG logs for
the full 24-hour window period. So I was
Hi Juho,
> only when the 24-hour window triggers, BucketingSink gets a burst of input
This is of course totally true, my understanding is the same. We cannot exclude
problem there for sure, just savepoints are used a lot w/o problem reports and
BucketingSink is known to be problematic with s3.
Andrey, thank you very much for the debugging suggestions, I'll try them.
In the meanwhile two more questions, please:
> Just to keep in mind this problem with s3 and exclude it for sure. I
would also check whether the size of missing events is around the batch
size of BucketingSink or not.
Fair
Hi,
true, StreamingFileSink does not support s3 in 1.6.0, it is planned for the
next 1.7 release, sorry for confusion.
The old BucketingSink has in general problem with s3. Internally BucketingSink
queries s3 as a file system
to list already written file parts (batches) and determine index of t
Hi,
Using StreamingFileSink is not a convenient option for production use for
us as it doesn't support s3*. I could use StreamingFileSink just to verify,
but I don't see much point in doing so. Please consider my previous comment:
> I realized that BucketingSink must not play any role in this pro
Ok, I think before further debugging the window reduced state,
could you try the new ‘StreamingFileSink’ [1] introduced in Flink 1.6.0 instead
of the previous 'BucketingSink’?
Cheers,
Andrey
[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html
Yes, sorry for my confusing comment. I just meant that it seems like
there's a bug somewhere now that the output is missing some data.
> I would wait and check the actual output in s3 because it is the main
result of the job
Yes, and that's what I have already done. There seems to be always some
Hi Juho,
So it is a per key deduplication job.
Yes, I would wait and check the actual output in s3 because it is the main
result of the job and
> The late data around the time of taking savepoint might be not included into
> the savepoint but it should be behind the snapshotted offset in Kafka
Thanks for your answer!
I check for the missed data from the final output on s3. So I wait until
the next day, then run the same thing re-implemented in batch, and compare
the output.
> The late data around the time of taking savepoint might be not included
into the savepoint but it should be beh
Hi Juho,
Where exactly does the data miss? When do you notice that?
Do you check it:
- debugging `DistinctFunction.reduce` right after resume in the middle of the
day
or
- some distinct records miss in the final output of BucketingSink in s3 after
window result is actually triggered and saved
I changed to allowedLateness=0, no change, still missing data when
restoring from savepoint.
On Tue, Aug 21, 2018 at 10:43 AM Juho Autio wrote:
> I realized that BucketingSink must not play any role in this problem. This
> is because only when the 24-hour window triggers, BucketinSink gets a bur
I realized that BucketingSink must not play any role in this problem. This
is because only when the 24-hour window triggers, BucketinSink gets a burst
of input. Around the state restoring point (middle of the day) it doesn't
get any input, so it can't lose anything either (right?).
I will next try
Some data is silently lost on my Flink stream job when state is restored
from a savepoint.
Do you have any debugging hints to find out where exactly the data gets
dropped?
My job gathers distinct values using a 24-hour window. It doesn't have any
custom state management.
When I cancel the job wi
36 matches
Mail list logo