[ 
https://issues.apache.org/jira/browse/FLINK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824491#comment-16824491
 ] 

seed Z commented on FLINK-12296:
--------------------------------

Unchaining the operator makes the new job be able to save the state via 
checkpoint.

It is unfortunately not able to resume from the old checkpoint (RocksDB backed 
incremental checkpoint) likely due to the way the path is composed

> Data loss silently in RocksDBStateBackend when more than one operator(has 
> states) chained in a single task 
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12296
>                 URL: https://issues.apache.org/jira/browse/FLINK-12296
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 1.6.3, 1.6.4, 1.7.2, 1.8.0
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Congxian Qiu(klion26)
>            Priority: Critical
>
> As the mail list said[1], there may be a problem when more than one operator 
> chained in a single task, and all the operators have states, we'll encounter 
> data loss silently problem.
> Currently, the local directory we used is like below
> ../local_state_root_1/allocation_id/job_id/vertex_id_subtask_idx/chk_1/(state),
>  
> if more than one operator chained in a single task, and all the operators 
> have states, then all the operators will share the same local 
> directory(because the vertext_id is the same), this will lead a data loss 
> problem. 
>  
> The path generation logic is below:
> {code:java}
> // LocalRecoveryDirectoryProviderImpl.java
> @Override
> public File subtaskSpecificCheckpointDirectory(long checkpointId) {
>    return new File(subtaskBaseDirectory(checkpointId), 
> checkpointDirString(checkpointId));
> }
> @VisibleForTesting
> String subtaskDirString() {
>    return Paths.get("jid_" + jobID, "vtx_" + jobVertexID + "_sti_" + 
> subtaskIndex).toString();
> }
> @VisibleForTesting
> String checkpointDirString(long checkpointId) {
>    return "chk_" + checkpointId;
> }
> {code}
> [1] 
> [https://app.smartmailcloud.com/web-share/MDkE4DArUT2eoSv86xq772I1HDgMNTVhLEmsnbQ7]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to