[ 
https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337224#comment-17337224
 ] 

Nicolaus Weidner commented on FLINK-5763:
-----------------------------------------

[~sewen] [~klion26] We stumbled across this ticket while working on savepoint 
deletion for VVP. The ticket description sounds like a savepoint may reference 
files outside its own directory (see "Scattered Checkpoint Files", where I 
assume <target> to be the "savepoint directory"). However, the discussion on 
this ticket and linked commits look like it was only about relative vs absolute 
paths, to enable relocation. Can you clarify in this regard?

Out motivation is: We want to enable VVP users to trigger savepoint/checkpoint 
deletions.
 * If files in a given checkpoint directory can be referenced by other 
savepoints or checkpoints, it would be dangerous to just delete the directory, 
because those referencing savepoints would be corrupted.
 * If a savepoint references files outside its directory, and those files are 
not used by any other savepoint/checkpoint, deletion would not be dangerous, 
but simply deleting the savepoint directory would not remove everything.

> Make savepoints self-contained and relocatable
> ----------------------------------------------
>
>                 Key: FLINK-5763
>                 URL: https://issues.apache.org/jira/browse/FLINK-5763
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ufuk Celebi
>            Assignee: Congxian Qiu
>            Priority: Critical
>              Labels: pull-request-available, usability
>             Fix For: 1.11.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> After a user has triggered a savepoint, a single savepoint file will be 
> returned as a handle to the savepoint. A savepoint to {{<target>}} creates a 
> savepoint file like {{<target>/savepoint-<randomSuffix>}}.
> This file contains the metadata of the corresponding checkpoint, but not the 
> actual program state. While this works well for short term management 
> (pause-and-resume a job), it makes it hard to manage savepoints over longer 
> periods of time.
> h4. Problems
> h5. Scattered Checkpoint Files
> For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this 
> results in the savepoint referencing files from the checkpoint directory 
> (usually different than <target>). For users, it is virtually impossible to 
> tell which checkpoint files belong to a savepoint and which are lingering 
> around. This can easily lead to accidentally invalidating a savepoint by 
> deleting checkpoint files.
> h5. Savepoints Not Relocatable
> Even if a user is able to figure out which checkpoint files belong to a 
> savepoint, moving these files will invalidate the savepoint as well, because 
> the metadata file references absolute file paths.
> h5. Forced to Use CLI for Disposal
> Because of the scattered files, the user is in practice forced to use Flink’s 
> CLI to dispose a savepoint. This should be possible to handle in the scope of 
> the user’s environment via a file system delete operation.
> h4. Proposal
> In order to solve the described problems, savepoints should contain all their 
> state, both metadata and program state, inside a single directory. 
> Furthermore the metadata must only hold relative references to the checkpoint 
> files. This makes it obvious which files make up the state of a savepoint and 
> it is possible to move savepoints around by moving the savepoint directory.
> h5. Desired File Layout
> Triggering a savepoint to {{<target>}} creates a directory as follows:
> {code}
> <target>/savepoint-<jobId>-<randomSuffix>
>   +-- _metadata
>   +-- data-<randomSuffix> [1 or more]
> {code}
> We include the JobID in the savepoint directory name in order to give some 
> hints about which job a savepoint belongs to.
> h5. CLI
> - Trigger: When triggering a savepoint to {{<target>}} the savepoint 
> directory will be returned as the handle to the savepoint.
> - Restore: Users can restore by pointing to the directory or the _metadata 
> file. The data files should be required to be in the same directory as the 
> _metadata file.
> - Dispose: The disposal command should be deprecated and eventually removed. 
> While deprecated, disposal can happen by specifying the directory or the 
> _metadata file (same as restore).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to