[ 
https://issues.apache.org/jira/browse/FLINK-30305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643494#comment-17643494
 ] 

Alexis Sarda-Espinosa edited comment on FLINK-30305 at 12/5/22 6:17 PM:
------------------------------------------------------------------------

Since I saw "Deleting JobManager deployment and HA metadata" in the logs, I 
thought it had been the operator, but if that's done by Flink itself then I 
guess there's not much the operator can do. I prefer savepoint upgrades because 
it shuts down the job cleanly, but if it's impossible to detect these 
scenarios, I'd have to use an alternative. Nevertheless, couldn't the operator 
detect that there was a successful savepoint and the job hasn't started since, 
allowing further spec changes?


was (Author: asardaes):
Since I saw "Deleting JobManager deployment and HA metadata" in the logs, I 
thought it had been the operator, but if that's done by Flink itself then I 
guess there's not much the operator can do. I prefer savepoint upgrades because 
it shuts down the job cleanly, but if it's impossible to detect this scenarios, 
I'd have to use an alternative. Nevertheless, couldn't the operator detect that 
there was a successful savepoint and the job hasn't started since, allowing 
further spec changes?

> Operator deletes HA metadata during stateful upgrade, preventing potential 
> manual rollback
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30305
>                 URL: https://issues.apache.org/jira/browse/FLINK-30305
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0
>            Reporter: Alexis Sarda-Espinosa
>            Priority: Major
>
> I was testing resiliency of jobs with Kubernetes-based HA enabled, upgrade 
> mode = {{savepoint}}, and with _automatic_ rollback _disabled_ in the 
> operator. After the job was running, I purposely created an erroneous spec by 
> changing my pod template to include an entry in {{envFrom -> secretRef}} with 
> a name that doesn't exist. Schema validation passed, so the operator tried to 
> upgrade the job, but the new pod hangs with {{CreateContainerConfigError}}, 
> and I see this in the operator logs:
> {noformat}
> >>> Status | Info    | UPGRADING       | The resource is being upgraded
> Deleting deployment with terminated application before new deployment
> Deleting JobManager deployment and HA metadata.
> {noformat}
> Afterwards, even if I remove the non-existing entry from my pod template, the 
> operator can no longer propagate the new spec because "Job is not running yet 
> and HA metadata is not available, waiting for upgradeable state".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to