Hi,
StreamingFileSink would not remove committed files, so if you use a non-latest
checkpoint to restore state, you may need to perform a manual cleanup.
WRT the part id issue, StreamingFileSink will track the global max part number,
and use this value + 1 as the new id upon restoring. In this way, we avoid file
name conflicts with the previous execution (see[1]).
[1]
https://github.com/apache/flink/blob/93dfdd05a84f933473c7b22437e12c03239f9462/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L276
Best,
Paul Lam
> 在 2019年11月21日,10:01,Lei Nie 写道:
>
> Hello,
> I would like clarification on the StreamingFileSink, thank you.
>
> From my testing, it seems that resuming job from checkpoint does not also
> restore the rolling part counter.
>
> E.g, job may have stopped with last file:
> part-6-71
>
> But when resuming from most recent checkpoint:
> part-6-89
> (There is unexplained gap).
>
> This is a problem if I am having an issue with my job, and need to roll back
> more than one checkpoint. After rolling back to the 4th last checkpoint, e.g,
> the data will be written into different part file names, causing duplication.
> -
> For example, checkpoints:
> chk-17, chk-18, chk-19, chk-20
>
> Original data:
> part-1-5, part-1-6, part-1-7
>
> Rollback to chk-17, which writes part-1-18, but with the same data as
> part-1-5! This is duplicate.
> --
> Am I correct? How to avoid this?