[ 
https://issues.apache.org/jira/browse/FLINK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642849#comment-17642849
 ] 

Thomas Weise commented on FLINK-29109:
--------------------------------------

[~gyfora] thanks for catching this. Because the jobId assigned by Flink is 
deterministic (HighAvailabilityOptions.HA_CLUSTER_ID), we will also need to 
apply the random jobId for stateless upgrade mode for Flink version >= 1.16 to 
avoid the checkpoint path collisions. 

https://github.com/apache/flink/blob/e70fe68dea764606180ca3728184c00fc63ea0ff/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L227

> Checkpoint path conflict with stateless upgrade mode
> ----------------------------------------------------
>
>                 Key: FLINK-29109
>                 URL: https://issues.apache.org/jira/browse/FLINK-29109
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.1.0
>            Reporter: Thomas Weise
>            Assignee: Thomas Weise
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.2.0
>
>
> A stateful job with stateless upgrade mode (yes, there are such use cases) 
> fails with checkpoint path conflict due to constant jobId and FLINK-19358 
> (applies to Flink < 1.16x). Since with stateless upgrade mode the checkpoint 
> id resets on restart the job is going to write to previously used locations 
> and fail. The workaround is to rotate the jobId on every redeploy when the 
> upgrade mode is stateless. While this can be worked around externally it is 
> best done in the operator itself because reconciliation resolves when a 
> restart is actually required while rotating jobId externally may trigger 
> unnecessary restarts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to