Hi Shachar I think you could refer to [1] to know the directory structure of checkpoints. The '_metadata' file contains all information of which checkpointed data file belongs, e.g. file paths under 'shared' folder. As I said before, you need to call Checkpoints#loadCheckpointMetadata to load '_metadata' to know which files belonging to that checkpoint.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#directory-structure Best Yun Tang ________________________________ From: Shachar Carmeli <carmeli....@gmail.com> Sent: Sunday, April 12, 2020 15:32 To: user@flink.apache.org <user@flink.apache.org> Subject: Re: Flink incremental checkpointing - how long does data is kept in the share folder Thank you for the quick response Your answer related to the checkpoint folder that contains the _metadata file e.g. chk-1829 What about the "shared" folder , how do I know which files in that folder are still relevant and which are left over from a failed checkpoint , they are not directly related to the _metadata checkpoint or am I missing something? On 2020/04/07 18:37:57, Yun Tang <myas...@live.com> wrote: > Hi Shachar > > Why do we see data that is older from lateness configuration > There might existed three reasons: > > 1. RocksDB really still need that file in current checkpoint. If we upload > one file named as 42.sst at 2/4 at some old checkpoint, current checkpoint > could still include that 42.sst file again if that file is never be compacted > since then. This is possible in theory. > 2. Your checkpoint size is large and checkpoint coordinator could not > remove as fast as possible before exit. > 3. That file is created by a crash task manager and not known to > checkpoint coordinator. > > How do I know that the files belong to a valid checkpoint and not a > checkpoint of a crushed job - so we can delete those files > You have to call Checkpoints#loadCheckpointMetadata[1] to load latest > _metadata in checkpoint directory and compare the file paths with current > files in checkpoint directory. The ones are not in the checkpoint meta and > older than latest checkpoint could be removed. You could follow this to debug > or maybe I could write a tool to help know what files could be deleted later. > > [1] > https://github.com/apache/flink/blob/693cb6adc42d75d1db720b45013430a4c6817d4a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L96 > > Best > Yun Tang > > ________________________________ > From: Shachar Carmeli <carmeli....@gmail.com> > Sent: Tuesday, April 7, 2020 16:19 > To: user@flink.apache.org <user@flink.apache.org> > Subject: Flink incremental checkpointing - how long does data is kept in the > share folder > > We are using Flink 1.6.3 and keeping the checkpoint in CEPH ,retaining only > one checkpoint at a time , using incremental and using rocksdb. > > We run windows with lateness of 3 days , which means that we expect that no > data in the checkpoint share folder will be kept after 3-4 days ,Still We see > that there is data from more than that > e.g. > If today is 7/4 there are some files from the 2/4 > > Sometime we see checkpoints that we assume (due to the fact that its index > number is not in synch) that it belongs to a job that crushed and the > checkpoint was not used to restore the job > > My questions are > > Why do we see data that is older from lateness configuration > How do I know that the files belong to a valid checkpoint and not a > checkpoint of a crushed job - so we can delete those files >