Re: S3 file source - continuous monitoring - many files missed

Fabian Hueske Wed, 25 Jul 2018 01:13:10 -0700

Hi,

Thanks for creating the Jira issue.
I'm not sure if I would consider this a blocker but it is certainly an
important problem to fix.


Anyway, in the original version Flink checkpoints the modification
timestamp up to which all files have been read (or at least up to which
point it *thinks* to have everything processed in case of S3).
In case of a recovery, the timestamp is reset to the checkpointed value and
all files with a larger mod timestamp are processed again. This reset of
the read position together with resetting the state of all operators
results in exactly-once state consistency.

In order to avoid that the data of files is added twice to the state of an
operator, an the monitoring sink must ensure that it does not read data
again that was processed before the checkpoint was committed.
So, if you add an offset to the mod timestamp and track processed files
with a ts larger than the checkpointed mod timestamp by file name, these
names must be included in the checkpoint as well.

Best, Fabian


2018-07-25 6:34 GMT+02:00 Averell <lvhu...@gmail.com>:

> Hello Fabian,
>
> I created the JIRA bug https://issues.apache.org/jira/browse/FLINK-9940
> BTW, I have one more question: Is it worth to checkpoint that list of
> processed files? Does the current implementation of file-source guarantee
> exactly-once?
>
> Thanks for your support.
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Re: S3 file source - continuous monitoring - many files missed

Reply via email to