[ https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451893#comment-17451893 ]
Adrian Vasiliu edited comment on FLINK-25098 at 12/1/21, 3:43 PM: ------------------------------------------------------------------ [~trohrmann] Again, we are not killing any process with our code. The use-case is: 1. Flink gets deployed in Kubernetes. 2. The user decides to uninstall (then, possibly, reinstall). For that, the K8S way is to delete the K8S custom resource which deployed Flink. => Flink configmaps remain (which, as you point out, is intentional). Thanks for the doc pointer. > The problem is that you are using storage that is not persistent as Flink > would need it to be. [https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up] says: "To keep HA data while restarting the Flink cluster, simply delete the deployment (via {{{}kubectl delete deployment <cluster-id>{}}}). All the Flink cluster related resources will be deleted (e.g. JobManager Deployment, TaskManager pods, services, Flink conf ConfigMap). HA related ConfigMaps will be retained because they do not set the owner reference. When restarting the cluster, all previously running jobs will be recovered and restarted from the latest successful checkpoint." I would think there are two distinct use-cases for uninstallation: 1. The user wants to uninstall, then reinstall while preserving data from the previous install. In this case, per Flink constraint, if persistent storage is enabled, the PV holding it MUST not be removed, otherwise Flink will break at reinstall (as reported here). 2. The user wants a full uninstall, no data left behind, including the persistent volume. Then he may decide to reinstall from scratch. >From your description and from the doc, it looks to me that Flink HA supports >well the first use-case, and not so well the latter. Do I get it well? I would think there should be a way to configure Flink HA to tell whether we want it to do a full cleanup at uninstallation, or not. That's because a typical requirement for uninstalls in enterprise env. is to have nothing left behind, including the deletion of persistent storage... If a user needs the persistent storage to be kept, it does that through configuration of the persistent volume claim / persistent volume, but that's optional. was (Author: JIRAUSER280892): [~trohrmann] Again, we are not killing any process with our code. The use-case is: 1. Flink gets deployed in Kubernetes. 2. The user decides to uninstall (then, possibly, reinstall). For that, the K8S way is to delete the K8S custom resource which deployed Flink. => Flink configmaps remain (which, as you point out, is intentional). Thanks for the doc pointer. > The problem is that you are using storage that is not persistent as Flink > would need it to be. Now, [https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up] says: "To keep HA data while restarting the Flink cluster, simply delete the deployment (via {{{}kubectl delete deployment <cluster-id>{}}}). All the Flink cluster related resources will be deleted (e.g. JobManager Deployment, TaskManager pods, services, Flink conf ConfigMap). HA related ConfigMaps will be retained because they do not set the owner reference. When restarting the cluster, all previously running jobs will be recovered and restarted from the latest successful checkpoint." I would think there are two distinct use-cases for uninstallation: 1. The user wants to uninstall, then reinstall while preserving data from the previous install. In this case, per Flink constraint, if persistant storage is enabled, the PV holding it MUST not be removed, otherwise Flink will break at reinstall (as reported here). 2. The user wants a full uninstall, no data left behind, including the persistent volume. Then he may decide to reinstall from scratch. >From your description and from the doc, it looks to me that Flink HA supports >well the first use-case, and not so well the latter. Do I get it well? I would think there should be a way to configure Flink HA to tell whether we want it to do a full cleanup at uninstallation, or not. That's because a typical requirement for uninstalls in enterprise env. is to have nothing left behind... > Jobmanager CrashLoopBackOff in HA configuration > ----------------------------------------------- > > Key: FLINK-25098 > URL: https://issues.apache.org/jira/browse/FLINK-25098 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.13.2, 1.13.3 > Environment: Reproduced with: > * Persistent jobs storage provided by the rocks-cephfs storage class. > * OpenShift 4.9.5. > Reporter: Adrian Vasiliu > Priority: Critical > Attachments: > iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, > jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt > > > In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink > 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to > CrashLoopBackoff for all replicas. > Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of > jobmanager pod: > [^jm-flink-ha-jobmanager-log.txt] > [^jm-flink-ha-tls-proxy-log.txt] > Reproduced with: > * Persistent jobs storage provided by the {{rocks-cephfs}} storage class > (shared by all replicas - ReadWriteMany) and mount path set via > {{{}high-availability.storageDir: file///<dir>{}}}. > * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not > a "one-shot" trouble. > Remarks: > * This is a follow-up of > https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524. > > * Picked Critical severity as HA is critical for our product. -- This message was sent by Atlassian Jira (v8.20.1#820001)