[ https://issues.apache.org/jira/browse/FLINK-27127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chesnay Schepler reassigned FLINK-27127: ---------------------------------------- Assignee: Chesnay Schepler > Local recovery is not triggered on task manager process restart > --------------------------------------------------------------- > > Key: FLINK-27127 > URL: https://issues.apache.org/jira/browse/FLINK-27127 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.15.0 > Reporter: Abdullah alkhawatrah > Assignee: Chesnay Schepler > Priority: Blocker > > Hey, > I am experimenting with the support of local recovery after process restart > introduced in 1.15. I am trying this on minikube. > So far, it seems that every time a pod restarts, remote recovery is triggered. > I have created a repo with everything needed to test it locally with > minikube: [https://github.com/akhawatrahTW/flink-local-recovery-test]. > The readme contains the steps to reproduce. > > Based on the documentation, I was expecting to have local recovery triggered > on pod restarts since the needed configs are set: > [https://github.com/akhawatrahTW/flink-local-recovery-test/blob/bfef14e45f475ba953a05b50b8829d9d33bdcec6/k8s/flink-configuration-configmap.yaml#L27.] > So was expecting to see something similar to this in the logs of the > recreated task manager pod: > *Expected:* > {code:java} > 2022-04-07 09:17:17,637 INFO > org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation > [] - Starting to restore from state handle: > IncrementalLocalKeyedStateHandle{metaDataState=File State: > file:/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/c2756339-8938-4949-84ff-d7ee3f4c55cf > [479 bytes]} > DirectoryKeyedStateHandle{directoryStateHandle=DirectoryStateHandle{directory=/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/5455302ce9554a1f81365aee368f267e}, > keyGroupRange=KeyGroupRange{startKeyGroup=86, endKeyGroup=127}} without > rescaling.{code} > > > But for some reason, remote recovery it triggered: > *Actual:* > {code:java} > 2022-04-07 09:17:18,405 INFO > org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation > [] - Finished restoring from state handle: > IncrementalRemoteKeyedStateHandle{backendIdentifier=544f3300-36bd-40a6-9ee3-f78b0e47dfd6, > stateHandleId=c2753d01-2f6b-49f0-9ca1-df6b54c61490, > keyGroupRange=KeyGroupRange{startKeyGroup=0, endKeyGroup=42}, > checkpointId=1208, > sharedState={001526.sst=ByteStreamStateHandle{handleName='f5a113d0-8094-40e7-a1b1-adc4cfc690c2', > dataBytes=23107}, > 001527.sst=ByteStreamStateHandle{handleName='3806411e-8213-406a-bbd8-e498ab19d118', > dataBytes=15579}, > 001528.sst=ByteStreamStateHandle{handleName='4fef6ead-1522-4f61-a6ad-399b334b41ca', > dataBytes=15839}, > 001529.sst=ByteStreamStateHandle{handleName='f1324a0c-3eae-46b0-acc2-c03d32b0c24a', > dataBytes=16055}}, > privateState={OPTIONS-001237=ByteStreamStateHandle{handleName='2e36d07b-5f91-4c9d-9778-5a16bb6254d5', > dataBytes=9924}, > MANIFEST-001234=ByteStreamStateHandle{handleName='4c95b38a-4afa-4154-9c89-9518d6384a25', > dataBytes=27356}, > CURRENT=ByteStreamStateHandle{handleName='17bd5bab-c369-470a-bf29-e76279cef2ba', > dataBytes=16}}, > metaStateHandle=ByteStreamStateHandle{handleName='15827f44-0ab2-4562-b8eb-812b8d260206', > dataBytes=479}, registered=false} without rescaling.{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)