[ https://issues.apache.org/jira/browse/FLINK-35857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chenyuzhi updated FLINK-35857: ------------------------------ Description: Using flink kubernetes operator, with config: {code:java} kubernetes.operator.job.restart.failed=true {code} We got different failed-job restart result in two case. Case1: A job with period checkpoint enable and an intial checkpoint path, when it failed, the operator will auto redeploy the deployment with the same job_id and latest checkpoint path !image-2024-07-17-15-03-29-618.png|width=763,height=301! Case2: A job with period checkpoint enable but no intial checkpoint, when it failed, the operator will auto redeploy the deployment with different job_id and no intial checkpoint path. !image-2024-07-17-15-04-32-913.png|width=759,height=287! In the case2, the redeploy behaviour may case data inconsitence. For example the kafka source connector may consume data from earliest/latest offset. Thus i think a job with period checkpoint enable but no intial checkpoint, should be restart with the same job_id and latest checkpoint path, just like case1. was: Using flink kubernetes operator, with config: {code:java} kubernetes.operator.job.restart.failed=true {code} We got different failed-job restart result in two case. Case1: A job with period checkpoint enable and an intial checkpoint path, when it failed, the operator will auto redeploy the deployment with the same job_id and latest checkpoint path !image-2024-07-17-15-03-29-618.png|width=763,height=301! Case2: A job with period checkpoint enable but no intial checkpoint, when it failed, the operator will auto redeploy the deployment with different job_id and no intial checkpoint path. !image-2024-07-17-15-04-32-913.png|width=759,height=287! I think in the case2, the redeploy behaviour may case data inconsitence. For example the kafka source connector may consume data from earliest/latest offset. Thus i think a job with period checkpoint enable but no intial checkpoint, should be restart with the same job_id and latest checkpoint path, just like case1. > Operator restart failed job without latest checkpoint > ----------------------------------------------------- > > Key: FLINK-35857 > URL: https://issues.apache.org/jira/browse/FLINK-35857 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.6.1 > Environment: flink kubernetes operator version: 1.6.1 > flink version 1.15.2 > flink job config: > *execution.shutdown-on-application-finish=false* > Reporter: chenyuzhi > Priority: Major > Attachments: image-2024-07-17-15-03-29-618.png, > image-2024-07-17-15-04-32-913.png > > > Using flink kubernetes operator, with config: > {code:java} > kubernetes.operator.job.restart.failed=true {code} > We got different failed-job restart result in two case. > Case1: > A job with period checkpoint enable and an intial checkpoint path, when it > failed, the operator will auto redeploy the deployment with the same job_id > and latest checkpoint path > > !image-2024-07-17-15-03-29-618.png|width=763,height=301! > > Case2: > A job with period checkpoint enable but no intial checkpoint, when it > failed, the operator will auto redeploy the deployment with different job_id > and no intial checkpoint path. > !image-2024-07-17-15-04-32-913.png|width=759,height=287! > > In the case2, the redeploy behaviour may case data inconsitence. For example > the kafka source connector may consume data from earliest/latest offset. > > Thus i think a job with period checkpoint enable but no intial checkpoint, > should be restart with the same job_id and latest checkpoint path, just like > case1. -- This message was sent by Atlassian Jira (v8.20.10#820010)