[jira] [Comment Edited] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

Adrian Vasiliu (Jira) Wed, 01 Dec 2021 07:44:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451893#comment-17451893
 ]


Adrian Vasiliu edited comment on FLINK-25098 at 12/1/21, 3:43 PM:
------------------------------------------------------------------

[~trohrmann] Again, we are not killing any process with our code. The use-case 
is:
1. Flink gets deployed in Kubernetes.
2. The user decides to uninstall (then, possibly, reinstall). For that, the K8S 
way is to delete the K8S custom resource which deployed Flink.
=> Flink configmaps remain (which, as you point out, is intentional).

Thanks for the doc pointer.

> The problem is that you are using storage that is not persistent as Flink 
> would need it to be.

[https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up]
 says:

"To keep HA data while restarting the Flink cluster, simply delete the 
deployment (via {{{}kubectl delete deployment <cluster-id>{}}}). All the Flink 
cluster related resources will be deleted (e.g. JobManager Deployment, 
TaskManager pods, services, Flink conf ConfigMap). HA related ConfigMaps will 
be retained because they do not set the owner reference. When restarting the 
cluster, all previously running jobs will be recovered and restarted from the 
latest successful checkpoint."

I would think there are two distinct use-cases for uninstallation:

1. The user wants to uninstall, then reinstall while preserving data from the 
previous install. In this case, per Flink constraint, if persistent storage is 
enabled, the PV holding it MUST not be removed, otherwise Flink will break at 
reinstall (as reported here).
2. The user wants a full uninstall, no data left behind, including the 
persistent volume. Then he may decide to reinstall from scratch.

>From your description and from the doc, it looks to me that Flink HA supports 
>well the first use-case, and not so well the latter. Do I get it well?

I would think there should be a way to configure Flink HA to tell whether we 
want it to do a full cleanup at uninstallation, or not. That's because a 
typical requirement for uninstalls in enterprise env. is to have nothing left 
behind, including the deletion of persistent storage... If a user needs the 
persistent storage to be kept, it does that through configuration of the 
persistent volume claim / persistent volume, but that's optional.


was (Author: JIRAUSER280892):
[~trohrmann] Again, we are not killing any process with our code. The use-case 
is:
1. Flink gets deployed in Kubernetes.
2. The user decides to uninstall (then, possibly, reinstall). For that, the K8S 
way is to delete the K8S custom resource which deployed Flink.
=> Flink configmaps remain (which, as you point out, is intentional).

Thanks for the doc pointer. 

> The problem is that you are using storage that is not persistent as Flink 
> would need it to be.

Now,  
[https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up]
 says:

"To keep HA data while restarting the Flink cluster, simply delete the 
deployment (via {{{}kubectl delete deployment <cluster-id>{}}}). All the Flink 
cluster related resources will be deleted (e.g. JobManager Deployment, 
TaskManager pods, services, Flink conf ConfigMap). HA related ConfigMaps will 
be retained because they do not set the owner reference. When restarting the 
cluster, all previously running jobs will be recovered and restarted from the 
latest successful checkpoint."

I would think there are two distinct use-cases for uninstallation:

1. The user wants to uninstall, then reinstall while preserving data from the 
previous install. In this case, per Flink constraint, if persistant storage is 
enabled, the PV holding it MUST not be removed, otherwise Flink will break at 
reinstall (as reported here).
2. The user wants a full uninstall, no data left behind, including the 
persistent volume. Then he may decide to reinstall from scratch.

>From your description and from the doc, it looks to me that Flink HA supports 
>well the first use-case, and not so well the latter. Do I get it well?

I would think there should be a way to configure Flink HA to tell whether we 
want it to do a full cleanup at uninstallation, or not. That's because a 
typical requirement for uninstalls in enterprise env. is to have nothing left 
behind...

> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
>                 Key: FLINK-25098
>                 URL: https://issues.apache.org/jira/browse/FLINK-25098
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.2, 1.13.3
>         Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>            Reporter: Adrian Vasiliu
>            Priority: Critical
>         Attachments: 
> iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, 
> jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///<dir>{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

Reply via email to