[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

Adrian Vasiliu (Jira) Wed, 01 Dec 2021 01:46:34 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451670#comment-17451670
 ]


Adrian Vasiliu commented on FLINK-25098:
----------------------------------------

[~trohrmann] 
> How exactly are you tearing down the initial cluster?

AFAIK we just rely on Flink's own tearing down when the removal of the K8S 
deployment is triggered by the removal of Custom Resource.

> When tearing down the initial cluster, are you also deleting the PVC or the 
> PV?

Not explicitly but Kubernetes does remove both PVC and PV (trying to list the 
PV previously bound to the deleted PVC, we see it doesn't exist anymore, so it 
couldn't be reused by the new PVC after redeployment.

> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
>                 Key: FLINK-25098
>                 URL: https://issues.apache.org/jira/browse/FLINK-25098
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.2, 1.13.3
>         Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>            Reporter: Adrian Vasiliu
>            Priority: Critical
>         Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///<dir>{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

Reply via email to