[ 
https://issues.apache.org/jira/browse/FLINK-38106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rasmus Bilgram updated FLINK-38106:
-----------------------------------
    Description: 
We are running flink jobs using last-state upgradeMode. We have experienced 
that when upgrading the job with a different job graph the job ends up in a 
undesireable state where we only see "Job Not Found" events, no HA metadata and 
restoring is only possible from latest savepoint.
>From the logs, flink does not allow changing the job graph when restoring from 
>checkpoint it is only possible to do such upgrade using upgradeMode: savepoint 
>and we have used that to reproduce the issue.

Steps:
1. Upgrade a job with a job graph change using last-state upgradeMode.
2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is 
no operator for the state [id]" and restarts
3. When the Job manager starts /overview will return empty list of jobs to the 
operator
4. Operator put RECONCILING as status - since it is not FAILED no redeployments 
are attempted
5. Operator starts producing "Job Not Found" events
6. We observed that the HA metadata is also missing
7. Job is stuck until we manually restore from savepoint

We are a little concerned if this can be caused by other issues, maybe OOM on 
jobmanager then perhaps FAILED state is better to trigger retry mechanisms.
Alternatively it would be great if rollback from latest checkpoint (with former 
job graph) would be possible. We tried to rollback mechanism but it complained 
about no HA metadata.

It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631

  was:
We are running flink jobs using last-state upgradeMode. We have experienced 
that when upgrading the job with a different job graph the job ends up in a 
undesireable state where we only see "Job Not Found" events, no HA metadata and 
restoring is only possible from latest savepoint.
>From the logs, flink does not allow changing the job graph when restoring from 
>checkpoint it is only possible to do such upgrade using upgradeMode: savepoint 
>and we have used that to reproduce the issue.

Steps:
1. Upgrade a job with a job graph change using last-state upgradeMode.
2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is 
no operator for the state [id]" and restarts
3. When the Job manager starts /overview will return empty list of jobs to the 
operator
4. Operator put RECONCILING as status - since it is not FAILED no redeployments 
are attempted
5. Operator starts producing "Job Not Found" events
6. We observed that the HA metadata is also missing
7. Job is stuck until we manually restore from savepoint

Wearealittleconcernedifthiscanbecausedbyotherissues,maybeOOMonjobmanagerthenperhapsFAILEDstateisbettertotriggerretrymechanisms.
Alternatively it would be great if rollback from latest checkpoint (with former 
job graph) would be possible. We tried to rollback mechanism but it complained 
about no HA metadata.

It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631


> Job gets indefinitely stuck with "Job Not Found" events
> -------------------------------------------------------
>
>                 Key: FLINK-38106
>                 URL: https://issues.apache.org/jira/browse/FLINK-38106
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.12.1
>            Reporter: Rasmus Bilgram
>            Priority: Major
>
> We are running flink jobs using last-state upgradeMode. We have experienced 
> that when upgrading the job with a different job graph the job ends up in a 
> undesireable state where we only see "Job Not Found" events, no HA metadata 
> and restoring is only possible from latest savepoint.
> From the logs, flink does not allow changing the job graph when restoring 
> from checkpoint it is only possible to do such upgrade using upgradeMode: 
> savepoint and we have used that to reproduce the issue.
> Steps:
> 1. Upgrade a job with a job graph change using last-state upgradeMode.
> 2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is 
> no operator for the state [id]" and restarts
> 3. When the Job manager starts /overview will return empty list of jobs to 
> the operator
> 4. Operator put RECONCILING as status - since it is not FAILED no 
> redeployments are attempted
> 5. Operator starts producing "Job Not Found" events
> 6. We observed that the HA metadata is also missing
> 7. Job is stuck until we manually restore from savepoint
> We are a little concerned if this can be caused by other issues, maybe OOM on 
> jobmanager then perhaps FAILED state is better to trigger retry mechanisms.
> Alternatively it would be great if rollback from latest checkpoint (with 
> former job graph) would be possible. We tried to rollback mechanism but it 
> complained about no HA metadata.
> It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to