[ https://issues.apache.org/jira/browse/FLINK-28187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557513#comment-17557513 ]
Aitozi edited comment on FLINK-28187 at 6/22/22 3:16 PM: --------------------------------------------------------- I'm afraid of not clearly expressing my meaning. I will try to give an example about what I think: 1. Submit the job with {{Generation1}} , and JobID is generated {{ns/name@Generation1}} 2. The submit timeout but actually succeed and the last reconcile spec not updated 3. User change the spec and the generation become {{Generation2}} (Before the observer have sync the job status and update the last reconcile spec) 4. The observer observe the job with JobID {{ns/name@Generation2}} not match the first job 5. The reconciler reconcile to submit the job with {{Generation2}}. In this sequence, the job {{ns/name@Generation1}} will be orphaned. was (Author: aitozi): I'm afraid of not clearly expressing my meaning. I will try to give an example about what I think: 1. Submit the job with {{Generation1}} , and JobID is generated {{ns/name@Generation1}} 2. The submit timeout but actually succeed and the last reconcile spec not updated 3. User change the spec and the generation become {{Generation2}} (Before the observer have sync the job status and update the last reconcile spec) 4. The observer observe the job with JobID {{ns/name@Generation2}} not match the first job 5. The reconciler reconcile to submit the job with {{Generation2}}. In this sequence, the job1 at generation1 will be orphaned. > Duplicate job submission for FlinkSessionJob > -------------------------------------------- > > Key: FLINK-28187 > URL: https://issues.apache.org/jira/browse/FLINK-28187 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.0.0 > Reporter: Jeesmon Jacob > Priority: Critical > Attachments: flink-operator-log.txt > > > During a session job submission if a deployment error (ex: > concurrent.TimeoutException) is hit, operator will submit the job again. But > first submission could have succeeded in jobManager side and second > submission could result in duplicate job. Operator log attached. > Per [~gyfora]: > The problem is that in case a deployment error was hit, the > SessionJobObserver will not be able to tell whether it has submitted the job > or not. So it will simply try to submit it again. We have to find a mechanism > to correlate Jobs on the cluster with the SessionJob CR itself. Maybe we > could override the job name itself for this purpose or something like that. -- This message was sent by Atlassian Jira (v8.20.7#820007)