[ https://issues.apache.org/jira/browse/FLINK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502183#comment-17502183 ]
Dawid Wysakowicz edited comment on FLINK-26274 at 3/7/22, 10:28 AM: -------------------------------------------------------------------- I tried testing the feature in a following way: # I set up a minikube with Flink cluster following the documentation and a MinIO (for a dfs implementation) # I run the StateMachineExample job: {{./bin/flink run -p 6 -m localhost:8081 ./examples/streaming/StateMachineExample.jar --backend rocks --incremental-checkpoints true --checkpoint-dir s3://checkpoints/checkpoints}} # I scaled down taskmanagers from 3 to 2 than back from 2 to 3. Checked the new taskmanagers logs to see if it tried restoring from the local state which it did # I scaled down taskmanagers from 3 to 1 than from 1 to 3 to check slot allocation. I got into an infinite restart loop. I am attaching jobmanagers logs ([^jobmanager_local_restore_2.log] ). Have I done something wrong? Why can't it restore? * flink-conf in ConfigMap: {code} "flink-conf.yaml": "state.backend.local-recovery: true process.taskmanager.working-dir: /pv jobmanager.rpc.address: flink-jobmanager taskmanager.numberOfTaskSlots: 2 blob.server.port: 6124 jobmanager.rpc.port: 6123 taskmanager.rpc.port: 6122 queryable-state.proxy.ports: 6125 jobmanager.memory.process.size: 1600m taskmanager.memory.process.size: 1728m parallelism.default: 2 s3.endpoint: http://minio-hl:9000 s3.path.style.access: true s3.access.key: flinkflink s3.secret.key: flinkflink state.storage.fs.memory-threshold: 0 ", {code} * taskmanager config: {code} containers: - name: taskmanager image: dawidwys/flink:1.15-SN-s3 args: - taskmanager - '-Dtaskmanager.resource-id=$(POD_NAME)' {code} * commit id: 26d7c09 was (Author: dawidwys): I tried testing the feature in a following way: 1. I set up a minikube with Flink cluster following the documentation and a MinIO (for a dfs implementation) 2. I run the StateMachineExample job: {{./bin/flink run -p 6 -m localhost:8081 ./examples/streaming/StateMachineExample.jar --backend rocks --incremental-checkpoints true --checkpoint-dir s3://checkpoints/checkpoints}} 3. I scaled down taskmanagers from 3 to 2 than back from 2 to 3. Checked the new taskmanagers logs to see if it tried restoring from the local state which it did 4. I scaled down taskmanagers from 3 to 1 than from 1 to 3 to check slot allocation. I got into an infinite restart loop. I am attaching jobmanagers logs ([^jobmanager_local_restore_2.log] ). Have I done something wrong? Why can't it restore? * flink-conf in ConfigMap: {code} "flink-conf.yaml": "state.backend.local-recovery: true process.taskmanager.working-dir: /pv jobmanager.rpc.address: flink-jobmanager taskmanager.numberOfTaskSlots: 2 blob.server.port: 6124 jobmanager.rpc.port: 6123 taskmanager.rpc.port: 6122 queryable-state.proxy.ports: 6125 jobmanager.memory.process.size: 1600m taskmanager.memory.process.size: 1728m parallelism.default: 2 s3.endpoint: http://minio-hl:9000 s3.path.style.access: true s3.access.key: flinkflink s3.secret.key: flinkflink state.storage.fs.memory-threshold: 0 ", {code} * taskmanager config: {code} containers: - name: taskmanager image: dawidwys/flink:1.15-SN-s3 args: - taskmanager - '-Dtaskmanager.resource-id=$(POD_NAME)' {code} * commit id: 26d7c09 > Test local recovery works across TaskManager process restarts > ------------------------------------------------------------- > > Key: FLINK-26274 > URL: https://issues.apache.org/jira/browse/FLINK-26274 > Project: Flink > Issue Type: Technical Debt > Components: Runtime / Coordination > Affects Versions: 1.15.0 > Reporter: Till Rohrmann > Assignee: Dawid Wysakowicz > Priority: Blocker > Labels: release-testing > Fix For: 1.15.0 > > Attachments: jobmanager_local_restore_2.log > > > This ticket is a testing task for > [FLIP-201|https://cwiki.apache.org/confluence/x/wJuqCw]. > When enabling local recovery and configuring a working directory that can be > re-read after a process failure, Flink should now be able to recover locally. > We should test whether this is the case. Please take a look at the > documentation [1, 2] to see how to configure Flink to make use of it. > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/working_directory/ > [2] > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#enabling-local-recovery-across-pod-restarts -- This message was sent by Atlassian Jira (v8.20.1#820001)