[jira] [Comment Edited] (FLINK-26274) Test local recovery works across TaskManager process restarts

Dawid Wysakowicz (Jira) Mon, 07 Mar 2022 02:29:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502183#comment-17502183
 ]


Dawid Wysakowicz edited comment on FLINK-26274 at 3/7/22, 10:28 AM:
--------------------------------------------------------------------

I tried testing the feature in a following way:
# I set up a minikube with Flink cluster following the documentation and a 
MinIO (for a dfs implementation)
# I run the StateMachineExample job: {{./bin/flink run -p 6 -m localhost:8081 
./examples/streaming/StateMachineExample.jar --backend rocks 
--incremental-checkpoints true --checkpoint-dir s3://checkpoints/checkpoints}}
# I scaled down taskmanagers from 3 to 2 than back from 2 to 3. Checked the new 
taskmanagers logs to see if it tried restoring from the local state which it did
# I scaled down taskmanagers from 3 to 1 than from 1 to 3 to check slot 
allocation. I got into an infinite restart loop. I am attaching jobmanagers 
logs ([^jobmanager_local_restore_2.log] ). Have I done something wrong? Why 
can't it restore?

* flink-conf in ConfigMap:
{code}
        "flink-conf.yaml": "state.backend.local-recovery: true
                process.taskmanager.working-dir: /pv
                jobmanager.rpc.address: flink-jobmanager
                taskmanager.numberOfTaskSlots: 2
                blob.server.port: 6124
                jobmanager.rpc.port: 6123
                taskmanager.rpc.port: 6122
                queryable-state.proxy.ports: 6125
                jobmanager.memory.process.size: 1600m
                taskmanager.memory.process.size: 1728m
                parallelism.default: 2
                s3.endpoint: http://minio-hl:9000
                s3.path.style.access: true
                s3.access.key: flinkflink
                s3.secret.key: flinkflink
                state.storage.fs.memory-threshold: 0
                ",
{code}

* taskmanager config:
{code}
      containers:
        - name: taskmanager
          image: dawidwys/flink:1.15-SN-s3
          args:
            - taskmanager
            - '-Dtaskmanager.resource-id=$(POD_NAME)'
{code}

* commit id: 26d7c09 


was (Author: dawidwys):
I tried testing the feature in a following way:
1. I set up a minikube with Flink cluster following the documentation and a 
MinIO (for a dfs implementation)
2. I run the StateMachineExample job: {{./bin/flink run -p 6 -m localhost:8081 
./examples/streaming/StateMachineExample.jar --backend rocks 
--incremental-checkpoints true --checkpoint-dir s3://checkpoints/checkpoints}}
3. I scaled down taskmanagers from 3 to 2 than back from 2 to 3. Checked the 
new taskmanagers logs to see if it tried restoring from the local state which 
it did
4. I scaled down taskmanagers from 3 to 1 than from 1 to 3 to check slot 
allocation. I got into an infinite restart loop. I am attaching jobmanagers 
logs ([^jobmanager_local_restore_2.log] ). Have I done something wrong? Why 
can't it restore?

* flink-conf in ConfigMap:
{code}
        "flink-conf.yaml": "state.backend.local-recovery: true
                process.taskmanager.working-dir: /pv
                jobmanager.rpc.address: flink-jobmanager
                taskmanager.numberOfTaskSlots: 2
                blob.server.port: 6124
                jobmanager.rpc.port: 6123
                taskmanager.rpc.port: 6122
                queryable-state.proxy.ports: 6125
                jobmanager.memory.process.size: 1600m
                taskmanager.memory.process.size: 1728m
                parallelism.default: 2
                s3.endpoint: http://minio-hl:9000
                s3.path.style.access: true
                s3.access.key: flinkflink
                s3.secret.key: flinkflink
                state.storage.fs.memory-threshold: 0
                ",
{code}

* taskmanager config:
{code}
      containers:
        - name: taskmanager
          image: dawidwys/flink:1.15-SN-s3
          args:
            - taskmanager
            - '-Dtaskmanager.resource-id=$(POD_NAME)'
{code}

* commit id: 26d7c09 

> Test local recovery works across TaskManager process restarts
> -------------------------------------------------------------
>
>                 Key: FLINK-26274
>                 URL: https://issues.apache.org/jira/browse/FLINK-26274
>             Project: Flink
>          Issue Type: Technical Debt
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Till Rohrmann
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>         Attachments: jobmanager_local_restore_2.log
>
>
> This ticket is a testing task for 
> [FLIP-201|https://cwiki.apache.org/confluence/x/wJuqCw].
> When enabling local recovery and configuring a working directory that can be 
> re-read after a process failure, Flink should now be able to recover locally. 
> We should test whether this is the case. Please take a look at the 
> documentation [1, 2] to see how to configure Flink to make use of it.
> [1] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/working_directory/
> [2] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#enabling-local-recovery-across-pod-restarts



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-26274) Test local recovery works across TaskManager process restarts

Reply via email to