[ 
https://issues.apache.org/jira/browse/FLINK-38106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011971#comment-18011971
 ] 

vikasap commented on FLINK-38106:
---------------------------------

We are seeing similar "Job Not Found" events, except that the job is actually 
not really stuck. We see that the issue is with the way reconciliation of 
job-ids does not happen cleanly. 

> Job gets indefinitely stuck with "Job Not Found" events
> -------------------------------------------------------
>
>                 Key: FLINK-38106
>                 URL: https://issues.apache.org/jira/browse/FLINK-38106
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.12.1
>            Reporter: Rasmus Bilgram
>            Priority: Major
>
> We are running flink jobs using last-state upgradeMode. We have experienced 
> that when upgrading the job with a different job graph the job ends up in a 
> undesireable state where we only see "Job Not Found" events, no HA metadata 
> and restoring is only possible from latest savepoint.
> From the logs, flink does not allow changing the job graph when restoring 
> from checkpoint it is only possible to do such upgrade using upgradeMode: 
> savepoint and we have used that to reproduce the issue.
> Steps:
> 1. Upgrade a job with a job graph change using last-state upgradeMode.
> 2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is 
> no operator for the state [id]" and restarts
> 3. When the Job manager starts /overview will return empty list of jobs to 
> the operator
> 4. Operator put RECONCILING as status - since it is not FAILED no 
> redeployments are attempted
> 5. Operator starts producing "Job Not Found" events
> 6. We observed that the HA metadata is also missing
> 7. Job is stuck until we manually restore from savepoint
> Wearealittleconcernedifthiscanbecausedbyotherissues,maybeOOMonjobmanagerthenperhapsFAILEDstateisbettertotriggerretrymechanisms.
> Alternatively it would be great if rollback from latest checkpoint (with 
> former job graph) would be possible. We tried to rollback mechanism but it 
> complained about no HA metadata.
> It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to