Hi, Thanks for creating the Jira issue. I'm not sure if I would consider this a blocker but it is certainly an important problem to fix.
Anyway, in the original version Flink checkpoints the modification timestamp up to which all files have been read (or at least up to which point it *thinks* to have everything processed in case of S3). In case of a recovery, the timestamp is reset to the checkpointed value and all files with a larger mod timestamp are processed again. This reset of the read position together with resetting the state of all operators results in exactly-once state consistency. In order to avoid that the data of files is added twice to the state of an operator, an the monitoring sink must ensure that it does not read data again that was processed before the checkpoint was committed. So, if you add an offset to the mod timestamp and track processed files with a ts larger than the checkpointed mod timestamp by file name, these names must be included in the checkpoint as well. Best, Fabian 2018-07-25 6:34 GMT+02:00 Averell <lvhu...@gmail.com>: > Hello Fabian, > > I created the JIRA bug https://issues.apache.org/jira/browse/FLINK-9940 > BTW, I have one more question: Is it worth to checkpoint that list of > processed files? Does the current implementation of file-source guarantee > exactly-once? > > Thanks for your support. > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >