[ https://issues.apache.org/jira/browse/FLINK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576147#comment-17576147 ]
Yun Tang commented on FLINK-28843: ---------------------------------- Thanks for reporting this bug! The root cause is that the native savepoint could contain the relative file state handles (all files under {{chk-x}} folder would be {{{}RelativeFileStateHandle{}}}), and the snapshot on changelog state-backend might not trigger the materialization part, which leads to the newly created {{chk-y}} folder does not contain previous snapshots. Thus, once restoring from {{{}chk-y{}}}, relocatable {{chk-x/file-1}} would be transferred to {{{}chk-y/file-1{}}}, resulting in the file not found exception. Since we already give docs that native savepoint is relocatable (refer to [https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpoints_vs_savepoints/#capabilities-and-limitations] ), we might have to let changelog state-backend trigger materialization on the 1st checkpoint if restored snapshot containing relative file state handles. cc [~roman] [~ym] [~Yanfei Lei] > Failed to restore from changelog checkpoint in claim mode > --------------------------------------------------------- > > Key: FLINK-28843 > URL: https://issues.apache.org/jira/browse/FLINK-28843 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Affects Versions: 1.15.0, 1.15.1 > Reporter: Lihe Ma > Priority: Critical > > # When native checkpoint is enabled and incremental checkpointing is enabled > in rocksdb statebackend,if state data is greater than > state.storage.fs.memory-threshold,it will be stored in a data file > (FileStateHandle,RelativeFileStateHandle, etc) rather than stored with > ByteStreamStateHandle in checkpoint metadata, like base-path1/chk-1/file1. > # Then restore the job from base-path1/chk-1 in claim mode,using changelog > statebackend,and the checkpoint path is set to base-path2, then new > checkpoint will be saved in base-path2/chk-2, previous checkpoint file > (base-path1/chk-1/file1) is needed. > # Then restore the job from base-path2/chk-2 in changelog statebackend, > flink will try to read base-path2/chk-2/file1, rather than the actual file > location base-path1/chk-1/file1, which leads to FileNotFoundException and job > failed. > > How to reproduce? > # Set state.storage.fs.memory-threshold to a small value, like '20b'. > # {{run > org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase#testSwitchFromDisablingToEnablingInClaimMode}} -- This message was sent by Atlassian Jira (v8.20.10#820010)