[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450968#comment-17450968
 ] 

Till Rohrmann commented on FLINK-25098:
---------------------------------------

Thanks for reporting this issue [~adrianalexvasiliu]. From the attached logs I 
cannot see anything wrong. The JM process cannot read 
{{/mnt/pv/flink-ha-storage/default/submittedJobGrapha600b0596ee1}} because it 
does not exist. In order to better understand the problem, I would need the 
logs of the other JM processes and what happened before 
{{eventprocessor--7fc4-eve-29ee-ep-jobmanager-0}} took over the leadership. So 
ideally you could provide the logs for the whole lifetime of the job 
{{609559678972cbfee4830395f4c47e3f}}. Moreover, it would be great if you could 
turn on the {{DEBUG}} log level.

What you could also try out is whether the same problem occurs when using S3, 
GCS or HDFS as the persistent storage. What I would like to rule out is that 
the problem is related to the ReadWriteMany PV using {{rocks-cephfs}}.

> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
>                 Key: FLINK-25098
>                 URL: https://issues.apache.org/jira/browse/FLINK-25098
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.2, 1.13.3
>         Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>            Reporter: Adrian Vasiliu
>            Priority: Critical
>         Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///<dir>{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to