[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

Till Rohrmann (Jira) Wed, 01 Dec 2021 09:09:09 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451929#comment-17451929
 ]


Till Rohrmann commented on FLINK-25098:
---------------------------------------

Deleting a K8s deployment should effectively lead to killing the Flink process 
with a {{SIGTERM}} signal that runs in the pods that get terminated. The 
problem is that Flink cannot distinguish between the failure and the shutdown 
case because it looks the same. Therefore, we cannot easily tell Flink to clean 
things up when receiving a {{SIGTERM}}.

There is a discussion about adding support for a shutdown command for a Flink 
cluster that will cancel all running jobs, clean the related HA information up 
and then shut down. This has not been implemented though.

If you want to uninstall everything, then I would suggest to explicitly remove 
the config maps atm. They should have the label {{high-availability}} and the 
{{<cluster-id>}}.

[~neeraj.laad], the reason why there is no checkpoint 2373 might be that it has 
failed. The logs are from the run that failed. In order to fully debug the 
problem I would need the logs from the run that produced the checkpoint 2372.



> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
>                 Key: FLINK-25098
>                 URL: https://issues.apache.org/jira/browse/FLINK-25098
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.2, 1.13.3
>         Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>            Reporter: Adrian Vasiliu
>            Priority: Critical
>         Attachments: 
> iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, 
> jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///<dir>{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

Reply via email to