[
https://issues.apache.org/jira/browse/FLINK-38106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011971#comment-18011971
]
vikasap commented on FLINK-38106:
---------------------------------
We are seeing similar "Job Not Found" events, except that the job is actually
not really stuck. We see that the issue is with the way reconciliation of
job-ids does not happen cleanly.
> Job gets indefinitely stuck with "Job Not Found" events
> -------------------------------------------------------
>
> Key: FLINK-38106
> URL: https://issues.apache.org/jira/browse/FLINK-38106
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: 1.12.1
> Reporter: Rasmus Bilgram
> Priority: Major
>
> We are running flink jobs using last-state upgradeMode. We have experienced
> that when upgrading the job with a different job graph the job ends up in a
> undesireable state where we only see "Job Not Found" events, no HA metadata
> and restoring is only possible from latest savepoint.
> From the logs, flink does not allow changing the job graph when restoring
> from checkpoint it is only possible to do such upgrade using upgradeMode:
> savepoint and we have used that to reproduce the issue.
> Steps:
> 1. Upgrade a job with a job graph change using last-state upgradeMode.
> 2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is
> no operator for the state [id]" and restarts
> 3. When the Job manager starts /overview will return empty list of jobs to
> the operator
> 4. Operator put RECONCILING as status - since it is not FAILED no
> redeployments are attempted
> 5. Operator starts producing "Job Not Found" events
> 6. We observed that the HA metadata is also missing
> 7. Job is stuck until we manually restore from savepoint
> Wearealittleconcernedifthiscanbecausedbyotherissues,maybeOOMonjobmanagerthenperhapsFAILEDstateisbettertotriggerretrymechanisms.
> Alternatively it would be great if rollback from latest checkpoint (with
> former job graph) would be possible. We tried to rollback mechanism but it
> complained about no HA metadata.
> It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631
--
This message was sent by Atlassian Jira
(v8.20.10#820010)