[ https://issues.apache.org/jira/browse/FLINK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Congxian Qiu(klion26) updated FLINK-12296: ------------------------------------------ Description: As the mail list said[1], there may be a problem when more than one operator chained in a single task, and all the operators have states, we'll encounter data loss silently problem. Currently, the local directory we used is like below ../local_state_root_1/allocation_id/job_id/vertex_id_subtask_idx/chk_1/(state), if more than one operator chained in a single task, and all the operators have states, then all the operators will share the same local directory(because the vertext_id is the same), this will lead a data loss problem. The path generation logic is below: {code:java} // LocalRecoveryDirectoryProviderImpl.java @Override public File subtaskSpecificCheckpointDirectory(long checkpointId) { return new File(subtaskBaseDirectory(checkpointId), checkpointDirString(checkpointId)); } @VisibleForTesting String subtaskDirString() { return Paths.get("jid_" + jobID, "vtx_" + jobVertexID + "_sti_" + subtaskIndex).toString(); } @VisibleForTesting String checkpointDirString(long checkpointId) { return "chk_" + checkpointId; } {code} [1] [https://app.smartmailcloud.com/web-share/MDkE4DArUT2eoSv86xq772I1HDgMNTVhLEmsnbQ7] was: As the mail list said[1], there may be a problem when more than one operator chained in a single task, and all the operators have states, this will be data loss silently. [1] https://app.smartmailcloud.com/web-share/MDkE4DArUT2eoSv86xq772I1HDgMNTVhLEmsnbQ7 > Data loss silently in RocksDBStateBackend when more than one operator chained > in a single task > ----------------------------------------------------------------------------------------------- > > Key: FLINK-12296 > URL: https://issues.apache.org/jira/browse/FLINK-12296 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Reporter: Congxian Qiu(klion26) > Assignee: Congxian Qiu(klion26) > Priority: Major > > As the mail list said[1], there may be a problem when more than one operator > chained in a single task, and all the operators have states, we'll encounter > data loss silently problem. > Currently, the local directory we used is like below > ../local_state_root_1/allocation_id/job_id/vertex_id_subtask_idx/chk_1/(state), > > if more than one operator chained in a single task, and all the operators > have states, then all the operators will share the same local > directory(because the vertext_id is the same), this will lead a data loss > problem. > > The path generation logic is below: > {code:java} > // LocalRecoveryDirectoryProviderImpl.java > @Override > public File subtaskSpecificCheckpointDirectory(long checkpointId) { > return new File(subtaskBaseDirectory(checkpointId), > checkpointDirString(checkpointId)); > } > @VisibleForTesting > String subtaskDirString() { > return Paths.get("jid_" + jobID, "vtx_" + jobVertexID + "_sti_" + > subtaskIndex).toString(); > } > @VisibleForTesting > String checkpointDirString(long checkpointId) { > return "chk_" + checkpointId; > } > {code} > [1] > [https://app.smartmailcloud.com/web-share/MDkE4DArUT2eoSv86xq772I1HDgMNTVhLEmsnbQ7] -- This message was sent by Atlassian JIRA (v7.6.3#76005)