[jira] [Updated] (FLINK-35857) Operator restart failed job without latest checkpoint

chenyuzhi (Jira) Wed, 17 Jul 2024 00:10:03 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


chenyuzhi updated FLINK-35857:
------------------------------
    Description: 
Using flink kubernetes operator, with config: 
{code:java}
kubernetes.operator.job.restart.failed=true {code}
We got different failed-job restart result in two case. 

Case1:  

 A job with period checkpoint enable and an intial checkpoint path, when it 
failed, the operator will auto redeploy the deployment with the same job_id and 
latest checkpoint path 

 

!image-2024-07-17-15-03-29-618.png|width=763,height=301!

 

Case2:

 A job with period checkpoint enable but  no intial checkpoint, when it failed, 
the operator will auto redeploy the deployment with different job_id  and no 
intial checkpoint path.

!image-2024-07-17-15-04-32-913.png|width=759,height=287!

 

In the case2, the redeploy behaviour may case data inconsitence. For example 
the kafka source connector may consume data from earliest/latest offset.

 

Thus i think  a job with period checkpoint enable but  no intial checkpoint, 
should be restart with the same job_id and latest checkpoint path, just like 
case1.

  was:
Using flink kubernetes operator, with config: 
{code:java}
kubernetes.operator.job.restart.failed=true {code}
We got different failed-job restart result in two case. 

Case1:  

 A job with period checkpoint enable and an intial checkpoint path, when it 
failed, the operator will auto redeploy the deployment with the same job_id and 
latest checkpoint path 

 

!image-2024-07-17-15-03-29-618.png|width=763,height=301!

 

Case2:

 A job with period checkpoint enable but  no intial checkpoint, when it failed, 
the operator will auto redeploy the deployment with different job_id  and no 
intial checkpoint path.

!image-2024-07-17-15-04-32-913.png|width=759,height=287!

 

I think in the case2, the redeploy behaviour may case data inconsitence. For 
example the kafka source connector may consume data from earliest/latest offset.

 

Thus i think  a job with period checkpoint enable but  no intial checkpoint, 
should be restart with the same job_id and latest checkpoint path, just like 
case1.


> Operator restart failed job without latest checkpoint
> -----------------------------------------------------
>
>                 Key: FLINK-35857
>                 URL: https://issues.apache.org/jira/browse/FLINK-35857
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1
>         Environment:  flink kubernetes operator version: 1.6.1
> flink version 1.15.2
> flink job config:
> *execution.shutdown-on-application-finish=false*
>            Reporter: chenyuzhi
>            Priority: Major
>         Attachments: image-2024-07-17-15-03-29-618.png, 
> image-2024-07-17-15-04-32-913.png
>
>
> Using flink kubernetes operator, with config: 
> {code:java}
> kubernetes.operator.job.restart.failed=true {code}
> We got different failed-job restart result in two case. 
> Case1:  
>  A job with period checkpoint enable and an intial checkpoint path, when it 
> failed, the operator will auto redeploy the deployment with the same job_id 
> and latest checkpoint path 
>  
> !image-2024-07-17-15-03-29-618.png|width=763,height=301!
>  
> Case2:
>  A job with period checkpoint enable but  no intial checkpoint, when it 
> failed, the operator will auto redeploy the deployment with different job_id  
> and no intial checkpoint path.
> !image-2024-07-17-15-04-32-913.png|width=759,height=287!
>  
> In the case2, the redeploy behaviour may case data inconsitence. For example 
> the kafka source connector may consume data from earliest/latest offset.
>  
> Thus i think  a job with period checkpoint enable but  no intial checkpoint, 
> should be restart with the same job_id and latest checkpoint path, just like 
> case1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35857) Operator restart failed job without latest checkpoint

Reply via email to