Re: StreamingFileSink duplicate data

2019-11-21 Thread Paul Lam
Hi,

StreamingFileSink would not remove committed files, so if you use a non-latest 
checkpoint to restore state, you may need to perform a manual cleanup.

WRT the part id issue, StreamingFileSink will track the global max part number, 
and use this value + 1 as the new id upon restoring. In this way, we avoid file 
name conflicts with the previous execution (see[1]).

[1] 
https://github.com/apache/flink/blob/93dfdd05a84f933473c7b22437e12c03239f9462/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L276

Best,
Paul Lam

> 在 2019年11月21日,10:01,Lei Nie  写道:
> 
> Hello,
> I would like clarification on the StreamingFileSink, thank you.
> 
> From my testing, it seems that resuming job from checkpoint does not also 
> restore the rolling part counter.
> 
> E.g, job may have stopped with last file:
> part-6-71
> 
> But when resuming from most recent checkpoint:
> part-6-89
> (There is unexplained gap).
> 
> This is a problem if I am having an issue with my job, and need to roll back 
> more than one checkpoint. After rolling back to the 4th last checkpoint, e.g, 
> the data will be written into different part file names, causing duplication.
> -
> For example, checkpoints:
> chk-17, chk-18, chk-19, chk-20
> 
> Original data:
> part-1-5, part-1-6, part-1-7
> 
> Rollback to chk-17, which writes part-1-18, but with the same data as 
> part-1-5! This is duplicate.
> --
> Am I correct? How to avoid this?



StreamingFileSink duplicate data

2019-11-20 Thread Lei Nie
Hello,
I would like clarification on the StreamingFileSink, thank you.

>From my testing, it seems that resuming job from checkpoint does *not* also
restore the rolling part counter.

E.g, job may have stopped with last file:
*part-6-71*

But when resuming from most recent checkpoint:
*part-6-89*
(There is unexplained gap).

This is a problem if I am having an issue with my job, and need to roll
back *more than one checkpoint*. After rolling back to the 4th last
checkpoint, e.g, the data will be written into *different part file names*,
causing duplication.
-
For example, checkpoints:
*chk-17, chk-18, chk-19, chk-20*

Original data:
*part-1-5, part-1-6, part-1-7*

Rollback to *chk-17*, which writes *part-1-18*, but with the same data as
*part-1-5*! This is duplicate.
--
Am I correct? How to avoid this?