[jira] [Commented] (FLINK-20110) Support 'merge' method for first_value and last_value UDAF

2024-06-07 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853131#comment-17853131
 ] 

Adrian Vasiliu commented on FLINK-20110:


We are also hurt by this issue. Any update regarding the chances this gets 
fixed?

> Support 'merge' method for first_value and last_value UDAF
> --
>
> Key: FLINK-20110
> URL: https://issues.apache.org/jira/browse/FLINK-20110
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Runtime
>Affects Versions: 1.12.0
>Reporter: hailong wang
>Priority: Major
>  Labels: auto-unassigned, pull-request-available
> Fix For: 1.20.0
>
>
> From the user-zh email,  when use first_value function in hop window, It 
> throws the exception because first_vaue does not implement the merge method.
> We can support 'merge' method for first_value and last_value UDAF



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-20110) Support 'merge' method for first_value and last_value UDAF

2024-06-07 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853131#comment-17853131
 ] 

Adrian Vasiliu edited comment on FLINK-20110 at 6/7/24 12:36 PM:
-

We are also hurt by this issue. Any update regarding the chances this gets 
fixed?
Or would 
 * 
*Fix Version/s:* 
[1.20.0|https://issues.apache.org/jira/issues/?jql=project+%3D+FLINK+AND+fixVersion+%3D+1.20.0]

mean the fix is already onboarded for 1.20?


was (Author: JIRAUSER280892):
We are also hurt by this issue. Any update regarding the chances this gets 
fixed?

> Support 'merge' method for first_value and last_value UDAF
> --
>
> Key: FLINK-20110
> URL: https://issues.apache.org/jira/browse/FLINK-20110
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Runtime
>Affects Versions: 1.12.0
>Reporter: hailong wang
>Priority: Major
>  Labels: auto-unassigned, pull-request-available
> Fix For: 1.20.0
>
>
> From the user-zh email,  when use first_value function in hop window, It 
> throws the exception because first_vaue does not implement the merge method.
> We can support 'merge' method for first_value and last_value UDAF



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34575) Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.

2024-03-19 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824327#comment-17824327
 ] 

Adrian Vasiliu edited comment on FLINK-34575 at 3/19/24 4:39 PM:
-

Thanks Marijn for getting 
[https://github.com/apache/flink-connector-hbase/pull/41] merged. I suppose at 
some point this will be complemented by
[https://github.com/apache/flink/pull/24352|https://github.com/apache/flink/pull/24352]
 .



> I don't think a security fix is necessary, since Flink isn't affected 
> directly by it.

Good to know. Now, vulnerability scanners don't know it, and in enterprise 
context vulnerabilities create trouble / extra processes even when the 
vulnerability can't really be exploited.


was (Author: JIRAUSER280892):
Thanks Marijn for getting 
[https://github.com/apache/flink-connector-hbase/pull/41] merged. I suppose at 
some point this will be complemented by 
[https://github.com/apache/flink/pull/24352.]

> I don't think a security fix is necessary, since Flink isn't affected 
> directly by it.

Good to know. Now, vulnerability scanners don't know it, and in enterprise 
context vulnerabilities create trouble / extra processes even when the 
vulnerability can't really be exploited.

> Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.
> -
>
> Key: FLINK-34575
> URL: https://issues.apache.org/jira/browse/FLINK-34575
> Project: Flink
>  Issue Type: Technical Debt
>Affects Versions: 1.18.1
>Reporter: Adrian Vasiliu
>Priority: Major
> Fix For: hbase-4.0.0
>
>
> Since Feb. 19, medium/high CVEs have been found for commons-compress 1.24.0:
> [https://nvd.nist.gov/vuln/detail/CVE-2024-25710]
> https://nvd.nist.gov/vuln/detail/CVE-2024-26308
> [https://github.com/apache/flink/pull/24352] has been opened automatically on 
> Feb. 21 by dependabot for bumping commons-compress to v1.26.0 which fixes the 
> CVEs, but two CI checks are red on the PR.
> Flink's dependency on commons-compress has been upgraded to v1.24.0 in Oct 
> 2023 (https://issues.apache.org/jira/browse/FLINK-33329).
> v1.24.0 is the version currently in the master 
> branch:[https://github.com/apache/flink/blob/master/pom.xml#L727-L729|https://github.com/apache/flink/blob/master/pom.xml#L727-L729).].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31966) Flink Kubernetes operator lacks TLS support

2024-04-16 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-31966:
---
Attachment: image-2024-04-16-16-33-39-644.png

> Flink Kubernetes operator lacks TLS support 
> 
>
> Key: FLINK-31966
> URL: https://issues.apache.org/jira/browse/FLINK-31966
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Adrian Vasiliu
>Assignee: Tony Garrard
>Priority: Major
> Fix For: kubernetes-operator-1.8.0
>
> Attachments: image-2024-04-16-16-33-39-644.png
>
>
> *Summary*
> The Flink Kubernetes operator lacks support inside the FlinkDeployment 
> operand for configuring Flink with TLS (both one-way and mutual) for the 
> internal communication between jobmanagers and taskmanagers, and for the 
> external REST endpoint. Although a workaround exists to configure the job and 
> task managers, this breaks the operator and renders it unable to reconcile.
> *Additional information*
>  * The Apache Flink operator supports passing through custom flink 
> configuration to be applied to job and task managers.
>  * If you supply SSL-based properties, the operator can no longer speak to 
> the deployed job manager. The operator is reading the flink conf and using it 
> to create a connection to the job manager REST endpoint, but it uses the 
> truststore file paths within flink-conf.yaml, which are unresolvable from the 
> operator. This leaves the operator hanging in a pending state as it cannot 
> complete a reconcile.
> *Proposal*
> Our proposal is to make changes to the operator code. A simple change exists 
> that would be enough to enable anonymous SSL at the REST endpoint, but more 
> invasive changes would be required to enable full mTLS throughout.
> The simple change to enable anonymous SSL would be for the operator to parse 
> flink-conf and podTemplate to identify the Kubernetes resource that contains 
> the certificate from the job manager keystore and use it inside the 
> operator’s trust store.
> In the case of mutual TLS, further changes are required: the operator would 
> need to generate a certificate signed by the same issuing authority as the 
> job manager’s certificates and then use it in a keystore when challenged by 
> that job manager. We propose that the operator becomes responsible for making 
> CertificateSigningRequests to generate certificates for job manager, task 
> manager and operator. The operator can then coordinate deploying the job and 
> task managers with the correct flink-conf and volume mounts. This would also 
> work for anonymous SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31966) Flink Kubernetes operator lacks TLS support

2024-04-16 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-31966:
---
Attachment: (was: image-2024-04-16-16-33-39-644.png)

> Flink Kubernetes operator lacks TLS support 
> 
>
> Key: FLINK-31966
> URL: https://issues.apache.org/jira/browse/FLINK-31966
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Adrian Vasiliu
>Assignee: Tony Garrard
>Priority: Major
> Fix For: kubernetes-operator-1.8.0
>
>
> *Summary*
> The Flink Kubernetes operator lacks support inside the FlinkDeployment 
> operand for configuring Flink with TLS (both one-way and mutual) for the 
> internal communication between jobmanagers and taskmanagers, and for the 
> external REST endpoint. Although a workaround exists to configure the job and 
> task managers, this breaks the operator and renders it unable to reconcile.
> *Additional information*
>  * The Apache Flink operator supports passing through custom flink 
> configuration to be applied to job and task managers.
>  * If you supply SSL-based properties, the operator can no longer speak to 
> the deployed job manager. The operator is reading the flink conf and using it 
> to create a connection to the job manager REST endpoint, but it uses the 
> truststore file paths within flink-conf.yaml, which are unresolvable from the 
> operator. This leaves the operator hanging in a pending state as it cannot 
> complete a reconcile.
> *Proposal*
> Our proposal is to make changes to the operator code. A simple change exists 
> that would be enough to enable anonymous SSL at the REST endpoint, but more 
> invasive changes would be required to enable full mTLS throughout.
> The simple change to enable anonymous SSL would be for the operator to parse 
> flink-conf and podTemplate to identify the Kubernetes resource that contains 
> the certificate from the job manager keystore and use it inside the 
> operator’s trust store.
> In the case of mutual TLS, further changes are required: the operator would 
> need to generate a certificate signed by the same issuing authority as the 
> job manager’s certificates and then use it in a keystore when challenged by 
> that job manager. We propose that the operator becomes responsible for making 
> CertificateSigningRequests to generate certificates for job manager, task 
> manager and operator. The operator can then coordinate deploying the job and 
> task managers with the correct flink-conf and volume mounts. This would also 
> work for anonymous SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-28 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450147#comment-17450147
 ] 

Adrian Vasiliu edited comment on FLINK-22014 at 11/28/21, 10:50 PM:


[~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?


was (Author: JIRAUSER280892):
[~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.

[jira] [Commented] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-28 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450147#comment-17450147
 ] 

Adrian Vasiliu commented on FLINK-22014:


[~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
> ~[?:?]
>at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) [?:?]
>at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> [?:?]
>at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
>at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDisp

[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-28 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450147#comment-17450147
 ] 

Adrian Vasiliu edited comment on FLINK-22014 at 11/28/21, 10:51 PM:


[~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?


was (Author: JIRAUSER280892):
[~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.ut

[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-28 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450147#comment-17450147
 ] 

Adrian Vasiliu edited comment on FLINK-22014 at 11/28/21, 10:52 PM:


Hello [~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?


was (Author: JIRAUSER280892):
[~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. 

[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-28 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450147#comment-17450147
 ] 

Adrian Vasiliu edited comment on FLINK-22014 at 11/28/21, 10:53 PM:


Hello [~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be 
reopened or should a new issue be open?


was (Author: JIRAUSER280892):
Hello [~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be reopen 
or should a new issue be open?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entr

[jira] [Comment Edited] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-28 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450147#comment-17450147
 ] 

Adrian Vasiliu edited comment on FLINK-22014 at 11/28/21, 10:54 PM:


Hello [~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be 
reopened or should we open a new issue?


was (Author: JIRAUSER280892):
Hello [~trohrmann] [~mlushchytski] or anyone knowing:

In our company, we observe symptoms similar to those reported in this issue 
while using Flink 1.13.2. That is, job managers in CrashLoopbackOff with 
similar errors in their logs. The storage is in a ReadWriteMany PV using 
rook-cephfs storage class.

The Flink version information from the log of job manager:
Starting StandaloneSessionClusterEntrypoint (Version: 1.13.2, Scala: 2.11, 
Rev:5f007ff, Date:2021-07-23T04:35:55+02:00)

The issue happens in a non-systematic manner, but observed it in at least 3 
deployments on different OpenShift clusters. 
Reducing the number of Flink job managers from 3 (HA) to 1 (non-HA) avoids the 
CrashloopbackOff.

For some reason, while previously the "Fix version(s)" field of this issue has 
been assigned different values, it presently shows "None". 

Has this been observed by others with Flink 1.13.2? Should this issue be 
reopened or should a new issue be open?

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster en

[jira] [Commented] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-29 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450640#comment-17450640
 ] 

Adrian Vasiliu commented on FLINK-22014:


[~trohrmann] OK, thanks. I'll open a new issue with the logs of the job 
manager. We did reproduce it with: Flink 1.13.2 and Flink 1.13.3. Not yet tried 
Flink 1.14.

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
> ~[?:?]
>at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) [?:?]
>at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> [?:?]
>at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
>at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
> JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
>at 
> org.apache

[jira] [Created] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)
Adrian Vasiliu created FLINK-25098:
--

 Summary: Jobmanager CrashLoopBackOff in HA configuration
 Key: FLINK-25098
 URL: https://issues.apache.org/jira/browse/FLINK-25098
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.13.3, 1.13.2
 Environment: Reproduced with:
* Persistent jobs storage provided by the rocks-cephfs storage class.
* OpenShift 4.9.5.
Reporter: Adrian Vasiliu


In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the `jobmanager` and tls-proxy` containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Remarks:
* This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
* Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Attachment: jm-flink-ha-jobmanager-log.txt

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the `jobmanager` and tls-proxy` containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Remarks:
> * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
> * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-22014) Flink JobManager failed to restart after failure in kubernetes HA setup

2021-11-29 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450716#comment-17450716
 ] 

Adrian Vasiliu commented on FLINK-22014:


Opened https://issues.apache.org/jira/browse/FLINK-25098 .

> Flink JobManager failed to restart after failure in kubernetes HA setup
> ---
>
> Key: FLINK-22014
> URL: https://issues.apache.org/jira/browse/FLINK-22014
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.11.3, 1.12.2, 1.13.0
>Reporter: Mikalai Lushchytski
>Priority: Major
>  Labels: k8s-ha, pull-request-available
> Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, 
> scalyr-logs (1).txt
>
>
> After the JobManager pod failed and the new one started, it was not able to 
> recover jobs due to the absence of recovery data in storage - config map 
> pointed at not existing file.
>   
>  Due to this the JobManager pod entered into the `CrashLoopBackOff`state and 
> was not able to recover - each attempt failed with the same error so the 
> whole cluster became unrecoverable and not operating.
>   
>  I had to manually delete the config map and start the jobs again without the 
> save point.
>   
>  If I tried to emulate the failure further by deleting job manager pod 
> manually, the new pod every time recovered well and issue was not 
> reproducible anymore artificially.
>   
>  Below is the failure log:
> {code:java}
> 2021-03-26 08:22:57,925 INFO 
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - 
> Starting the SlotManager.
>  2021-03-26 08:22:57,928 INFO 
> org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - 
> Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver
> {configMapName='stellar-flink-cluster-dispatcher-leader'}.
>  2021-03-26 08:22:57,931 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job 
> ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, 
> 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from 
> KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'}
> 2021-03-26 08:22:57,933 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6.
>  2021-03-26 08:22:58,029 INFO 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] 
> - Stopping SessionDispatcherLeaderProcess.
>  2021-03-26 08:28:22,677 INFO 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
> DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint. java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
> ~[?:?]
>at java.util.concurrent.CompletableFuture.completeThrowable(Unknown 
> Source) [?:?]
>at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> [?:?]
>at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
>at java.lang.Thread.run(Unknown Source) [?:?] Caused by: 
> org.apache.flink.util.FlinkRuntimeException: Could not recover job with job 
> id 198c46bac791e73ebcc565a550fa4ff6.
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2]
>at 
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113
>  undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more 
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted 
> JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
>at 
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171
>  undefined) 

[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Description: 
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the `jobmanager` and tls-proxy` containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.

  was:
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the `jobmanager` and tls-proxy` containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Remarks:
* This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
* Picked Critical severity as HA is critical for our product.


> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the `jobmanager` and tls-proxy` containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Attachment: jm-flink-ha-tls-proxy-log.txt

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the `jobmanager` and tls-proxy` containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Remarks:
> * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
> * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Description: 
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]



Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.

  was:
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the `jobmanager` and tls-proxy` containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.


> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Description: 
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany).
 * OpenShift 4.9.5.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.

  was:
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]



Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.


> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany).
>  * OpenShift 4.9.5.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Description: 
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany).
 * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a 
"one-shot" trouble.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.

  was:
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany).
 * OpenShift 4.9.5.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.


> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany).
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Description: 
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany) and mount path set via 
{{{}{{high-availability.storageDir: file///}}{}}}.
 * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a 
"one-shot" trouble.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.

  was:
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany).
 * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a 
"one-shot" trouble.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.


> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}{{high-availability.storageDir: file///}}{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Description: 
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany) and mount path set via 
{{{}{{high-availability.storageDir: file///{}.
 * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a 
"one-shot" trouble.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.

  was:
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany) and mount path set via 
{{{}{{high-availability.storageDir: file///}}{}}}.
 * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a 
"one-shot" trouble.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.


> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}{{high-availability.storageDir: file///{}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-29 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-25098:
---
Description: 
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany) and mount path set via 
{{{}high-availability.storageDir: file///{}}}.
 * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a 
"one-shot" trouble.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.

  was:
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), 
turning to Flink HA by using 3 replicas of the jobmanager leads to 
CrashLoopBackoff for all replicas.

Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
jobmanager pod:
[^jm-flink-ha-jobmanager-log.txt]
[^jm-flink-ha-tls-proxy-log.txt]

Reproduced with:
 * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
(shared by all replicas - ReadWriteMany) and mount path set via 
{{{}{{high-availability.storageDir: file///{}.
 * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a 
"one-shot" trouble.

Remarks:
 * This is a follow-up of 
https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
 
 * Picked Critical severity as HA is critical for our product.


> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-30 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451072#comment-17451072
 ] 

Adrian Vasiliu commented on FLINK-25098:


Yes, since then we also identified the K8S configmaps as being the cause. 
The scenario is:
1. Flink cluster deployed and receiving a Flink job. All good.
2. Uninstall - all K8S objects go away, except the Flink configmaps. 
3. Reinstall => crashloopbackoff.

I hear now from colleagues that the issue with Flink CMs being left behind at 
uninstall time has already been raised on the user list. 
Your take on it?

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-30 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451072#comment-17451072
 ] 

Adrian Vasiliu edited comment on FLINK-25098 at 11/30/21, 11:36 AM:


Yes, since then we also identified the K8S configmaps as being the cause. 
The scenario is:
1. Flink cluster deployed and receiving a Flink job. All good.
2. Uninstall - all K8S objects go away, except the Flink configmaps. 
3. Reinstall => crashloopbackoff.

I hear now from colleagues that the issue with Flink CMs being left behind at 
uninstall time has already been raised, see 
https://lists.apache.org/thread/ml9dp9jqytnn303wypqoor7b32o1y32y. 
Your take on it?


was (Author: JIRAUSER280892):
Yes, since then we also identified the K8S configmaps as being the cause. 
The scenario is:
1. Flink cluster deployed and receiving a Flink job. All good.
2. Uninstall - all K8S objects go away, except the Flink configmaps. 
3. Reinstall => crashloopbackoff.

I hear now from colleagues that the issue with Flink CMs being left behind at 
uninstall time has already been raised on the user list. 
Your take on it?

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-11-30 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451072#comment-17451072
 ] 

Adrian Vasiliu edited comment on FLINK-25098 at 11/30/21, 11:37 AM:


Yes, since then we also identified the K8S configmaps as being the cause. 
The scenario is:
1. Flink cluster deployed and receiving a Flink job. All good.
2. Uninstall - all K8S objects go away, except the Flink configmaps. 
3. Reinstall => crashloopbackoff.

Manually deleting the configmaps before reinstalling fixes the issue. But we do 
need an automatic cleanup...

I hear now from colleagues that the issue with Flink CMs being left behind at 
uninstall time has already been raised, see 
[https://lists.apache.org/thread/ml9dp9jqytnn303wypqoor7b32o1y32y]. 
Your take on it?


was (Author: JIRAUSER280892):
Yes, since then we also identified the K8S configmaps as being the cause. 
The scenario is:
1. Flink cluster deployed and receiving a Flink job. All good.
2. Uninstall - all K8S objects go away, except the Flink configmaps. 
3. Reinstall => crashloopbackoff.

I hear now from colleagues that the issue with Flink CMs being left behind at 
uninstall time has already been raised, see 
https://lists.apache.org/thread/ml9dp9jqytnn303wypqoor7b32o1y32y. 
Your take on it?

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-12-01 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451670#comment-17451670
 ] 

Adrian Vasiliu commented on FLINK-25098:


[~trohrmann] 
> How exactly are you tearing down the initial cluster?

AFAIK we just rely on Flink's own tearing down when the removal of the K8S 
deployment is triggered by the removal of Custom Resource.

> When tearing down the initial cluster, are you also deleting the PVC or the 
> PV?

Not explicitly but Kubernetes does remove both PVC and PV (trying to list the 
PV previously bound to the deleted PVC, we see it doesn't exist anymore, so it 
couldn't be reused by the new PVC after redeployment.

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: jm-flink-ha-jobmanager-log.txt, 
> jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-12-01 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451893#comment-17451893
 ] 

Adrian Vasiliu commented on FLINK-25098:


[~trohrmann] Again, we are not killing any process with our code. The use-case 
is:
1. Flink gets deployed in Kubernetes.
2. The user decides to uninstall (then, possibly, reinstall). For that, the K8S 
way is to delete the K8S custom resource which deployed Flink.
=> Flink configmaps remain (which, as you point out, is intentional).

Thanks for the doc pointer. 

> The problem is that you are using storage that is not persistent as Flink 
> would need it to be.

Now,  
[https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up]
 says:

"To keep HA data while restarting the Flink cluster, simply delete the 
deployment (via {{{}kubectl delete deployment {}}}). All the Flink 
cluster related resources will be deleted (e.g. JobManager Deployment, 
TaskManager pods, services, Flink conf ConfigMap). HA related ConfigMaps will 
be retained because they do not set the owner reference. When restarting the 
cluster, all previously running jobs will be recovered and restarted from the 
latest successful checkpoint."

I would think there are two distinct use-cases for uninstallation:

1. The user wants to uninstall, then reinstall while preserving data from the 
previous install. In this case, per Flink constraint, if persistant storage is 
enabled, the PV holding it MUST not be removed, otherwise Flink will break at 
reinstall (as reported here).
2. The user wants a full uninstall, no data left behind, including the 
persistent volume. Then he may decide to reinstall from scratch.

>From your description and from the doc, it looks to me that Flink HA supports 
>well the first use-case, and not so well the latter. Do I get it well?

I would think there should be a way to configure Flink HA to tell whether we 
want it to do a full cleanup at uninstallation, or not. That's because a 
typical requirement for uninstalls in enterprise env. is to have nothing left 
behind...

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Priority: Critical
> Attachments: 
> iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, 
> jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 
> 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to 
> CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of 
> jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class 
> (shared by all replicas - ReadWriteMany) and mount path set via 
> {{{}high-availability.storageDir: file///{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not 
> a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of 
> https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
>  
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

2021-12-01 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451893#comment-17451893
 ] 

Adrian Vasiliu edited comment on FLINK-25098 at 12/1/21, 3:43 PM:
--

[~trohrmann] Again, we are not killing any process with our code. The use-case 
is:
1. Flink gets deployed in Kubernetes.
2. The user decides to uninstall (then, possibly, reinstall). For that, the K8S 
way is to delete the K8S custom resource which deployed Flink.
=> Flink configmaps remain (which, as you point out, is intentional).

Thanks for the doc pointer.

> The problem is that you are using storage that is not persistent as Flink 
> would need it to be.

[https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up]
 says:

"To keep HA data while restarting the Flink cluster, simply delete the 
deployment (via {{{}kubectl delete deployment {}}}). All the Flink 
cluster related resources will be deleted (e.g. JobManager Deployment, 
TaskManager pods, services, Flink conf ConfigMap). HA related ConfigMaps will 
be retained because they do not set the owner reference. When restarting the 
cluster, all previously running jobs will be recovered and restarted from the 
latest successful checkpoint."

I would think there are two distinct use-cases for uninstallation:

1. The user wants to uninstall, then reinstall while preserving data from the 
previous install. In this case, per Flink constraint, if persistent storage is 
enabled, the PV holding it MUST not be removed, otherwise Flink will break at 
reinstall (as reported here).
2. The user wants a full uninstall, no data left behind, including the 
persistent volume. Then he may decide to reinstall from scratch.

>From your description and from the doc, it looks to me that Flink HA supports 
>well the first use-case, and not so well the latter. Do I get it well?

I would think there should be a way to configure Flink HA to tell whether we 
want it to do a full cleanup at uninstallation, or not. That's because a 
typical requirement for uninstalls in enterprise env. is to have nothing left 
behind, including the deletion of persistent storage... If a user needs the 
persistent storage to be kept, it does that through configuration of the 
persistent volume claim / persistent volume, but that's optional.


was (Author: JIRAUSER280892):
[~trohrmann] Again, we are not killing any process with our code. The use-case 
is:
1. Flink gets deployed in Kubernetes.
2. The user decides to uninstall (then, possibly, reinstall). For that, the K8S 
way is to delete the K8S custom resource which deployed Flink.
=> Flink configmaps remain (which, as you point out, is intentional).

Thanks for the doc pointer. 

> The problem is that you are using storage that is not persistent as Flink 
> would need it to be.

Now,  
[https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up]
 says:

"To keep HA data while restarting the Flink cluster, simply delete the 
deployment (via {{{}kubectl delete deployment {}}}). All the Flink 
cluster related resources will be deleted (e.g. JobManager Deployment, 
TaskManager pods, services, Flink conf ConfigMap). HA related ConfigMaps will 
be retained because they do not set the owner reference. When restarting the 
cluster, all previously running jobs will be recovered and restarted from the 
latest successful checkpoint."

I would think there are two distinct use-cases for uninstallation:

1. The user wants to uninstall, then reinstall while preserving data from the 
previous install. In this case, per Flink constraint, if persistant storage is 
enabled, the PV holding it MUST not be removed, otherwise Flink will break at 
reinstall (as reported here).
2. The user wants a full uninstall, no data left behind, including the 
persistent volume. Then he may decide to reinstall from scratch.

>From your description and from the doc, it looks to me that Flink HA supports 
>well the first use-case, and not so well the latter. Do I get it well?

I would think there should be a way to configure Flink HA to tell whether we 
want it to do a full cleanup at uninstallation, or not. That's because a 
typical requirement for uninstalls in enterprise env. is to have nothing left 
behind...

> Jobmanager CrashLoopBackOff in HA configuration
> ---
>
> Key: FLINK-25098
> URL: https://issues.apache.org/jira/browse/FLINK-25098
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.2, 1.13.3
> Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>Reporter: Adrian Vasiliu
>Prior

[jira] [Created] (FLINK-34575) Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.

2024-03-04 Thread Adrian Vasiliu (Jira)
Adrian Vasiliu created FLINK-34575:
--

 Summary: Vulnerabilities in commons-compress 1.24.0; upgrade to 
1.26.0 needed.
 Key: FLINK-34575
 URL: https://issues.apache.org/jira/browse/FLINK-34575
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.18.1
Reporter: Adrian Vasiliu


Since Feb. 19, medium/high CVEs have been found for commons-compress 1.24.0:
[https://nvd.nist.gov/vuln/detail/CVE-2024-25710]
https://nvd.nist.gov/vuln/detail/CVE-2024-26308

[https://github.com/apache/flink/pull/24352] has been opened automatically on 
Feb. 21 by dependabot for bumping commons-compress to v1.26.0 which fixes the 
CVEs, but two CI checks are red on the PR.

Flink's dependency on commons-compress has been upgraded to v1.24.0 in Oct 2023 
(https://issues.apache.org/jira/browse/FLINK-33329).
v1.24.0 is the version currently in the master 
branch:[https://github.com/apache/flink/blob/master/pom.xml#L727-L729|https://github.com/apache/flink/blob/master/pom.xml#L727-L729).].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34575) Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.

2024-03-05 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823762#comment-17823762
 ] 

Adrian Vasiliu edited comment on FLINK-34575 at 3/5/24 8:18 PM:


Thanks Martijn. I couldn't find an estimated date for 1.20, would you know?
Also, no bug / security fix release planned for 1.18.x / 1.19.x, before 1.20?
I ask because of the pressure from vulnerability scanners...


was (Author: JIRAUSER280892):
Thanks Martijn. I couldn't find an estimated date for 1.20, would you know?
Also, no bug fix release planned for 1.18.x / 1.19.x, before 1.20?
I ask because of the pressure from vulnerability scanners...

> Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.
> -
>
> Key: FLINK-34575
> URL: https://issues.apache.org/jira/browse/FLINK-34575
> Project: Flink
>  Issue Type: Technical Debt
>Affects Versions: 1.18.1
>Reporter: Adrian Vasiliu
>Priority: Major
>
> Since Feb. 19, medium/high CVEs have been found for commons-compress 1.24.0:
> [https://nvd.nist.gov/vuln/detail/CVE-2024-25710]
> https://nvd.nist.gov/vuln/detail/CVE-2024-26308
> [https://github.com/apache/flink/pull/24352] has been opened automatically on 
> Feb. 21 by dependabot for bumping commons-compress to v1.26.0 which fixes the 
> CVEs, but two CI checks are red on the PR.
> Flink's dependency on commons-compress has been upgraded to v1.24.0 in Oct 
> 2023 (https://issues.apache.org/jira/browse/FLINK-33329).
> v1.24.0 is the version currently in the master 
> branch:[https://github.com/apache/flink/blob/master/pom.xml#L727-L729|https://github.com/apache/flink/blob/master/pom.xml#L727-L729).].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34575) Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.

2024-03-05 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823762#comment-17823762
 ] 

Adrian Vasiliu commented on FLINK-34575:


Thanks Martijn. I couldn't find an estimated date for 1.20, would you know?
Also, no bug fix release planned for 1.18.x / 1.19.x, before 1.20?
I ask because of the pressure from vulnerability scanners...

> Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.
> -
>
> Key: FLINK-34575
> URL: https://issues.apache.org/jira/browse/FLINK-34575
> Project: Flink
>  Issue Type: Technical Debt
>Affects Versions: 1.18.1
>Reporter: Adrian Vasiliu
>Priority: Major
>
> Since Feb. 19, medium/high CVEs have been found for commons-compress 1.24.0:
> [https://nvd.nist.gov/vuln/detail/CVE-2024-25710]
> https://nvd.nist.gov/vuln/detail/CVE-2024-26308
> [https://github.com/apache/flink/pull/24352] has been opened automatically on 
> Feb. 21 by dependabot for bumping commons-compress to v1.26.0 which fixes the 
> CVEs, but two CI checks are red on the PR.
> Flink's dependency on commons-compress has been upgraded to v1.24.0 in Oct 
> 2023 (https://issues.apache.org/jira/browse/FLINK-33329).
> v1.24.0 is the version currently in the master 
> branch:[https://github.com/apache/flink/blob/master/pom.xml#L727-L729|https://github.com/apache/flink/blob/master/pom.xml#L727-L729).].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-34575) Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.

2024-03-07 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824327#comment-17824327
 ] 

Adrian Vasiliu commented on FLINK-34575:


Thanks Marijn for getting 
[https://github.com/apache/flink-connector-hbase/pull/41] merged. I suppose at 
some point this will be complemented by 
[https://github.com/apache/flink/pull/24352.]

> I don't think a security fix is necessary, since Flink isn't affected 
> directly by it.

Good to know. Now, vulnerability scanners don't know it, and in enterprise 
context vulnerabilities create trouble / extra processes even when the 
vulnerability can't really be exploited.

> Vulnerabilities in commons-compress 1.24.0; upgrade to 1.26.0 needed.
> -
>
> Key: FLINK-34575
> URL: https://issues.apache.org/jira/browse/FLINK-34575
> Project: Flink
>  Issue Type: Technical Debt
>Affects Versions: 1.18.1
>Reporter: Adrian Vasiliu
>Priority: Major
> Fix For: hbase-4.0.0
>
>
> Since Feb. 19, medium/high CVEs have been found for commons-compress 1.24.0:
> [https://nvd.nist.gov/vuln/detail/CVE-2024-25710]
> https://nvd.nist.gov/vuln/detail/CVE-2024-26308
> [https://github.com/apache/flink/pull/24352] has been opened automatically on 
> Feb. 21 by dependabot for bumping commons-compress to v1.26.0 which fixes the 
> CVEs, but two CI checks are red on the PR.
> Flink's dependency on commons-compress has been upgraded to v1.24.0 in Oct 
> 2023 (https://issues.apache.org/jira/browse/FLINK-33329).
> v1.24.0 is the version currently in the master 
> branch:[https://github.com/apache/flink/blob/master/pom.xml#L727-L729|https://github.com/apache/flink/blob/master/pom.xml#L727-L729).].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-23568) Plaintext Java Keystore Password Risks in the flink-conf.yaml File

2023-03-06 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696965#comment-17696965
 ] 

Adrian Vasiliu commented on FLINK-23568:


No feedback on this issue which is critical for production-grade usages of 
Flink?
AFAIK this still holds similarly with Flink 1.16/1.17.

> Plaintext Java Keystore Password Risks in the flink-conf.yaml File
> --
>
> Key: FLINK-23568
> URL: https://issues.apache.org/jira/browse/FLINK-23568
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.11.3
>Reporter: Hui Wang
>Priority: Major
>
> When REST SSL is enabled, the plaintext password of the Java keystore needs 
> to be configured in the flink-conf.yaml configuration of Flink, which poses 
> great security risks. It is hoped that the community can provide the 
> capability of encrypting and storing passwords in the flink-conf.yaml file.
>  
> {code:java}
> security.ssl.internal.keystore-password: keystore_password
> security.ssl.internal.key-password: key_password
> security.ssl.internal.truststore-password: truststore_password{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31966) Flink Kubernetes operator lacks TLS support

2023-04-28 Thread Adrian Vasiliu (Jira)
Adrian Vasiliu created FLINK-31966:
--

 Summary: Flink Kubernetes operator lacks TLS support 
 Key: FLINK-31966
 URL: https://issues.apache.org/jira/browse/FLINK-31966
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.4.0
Reporter: Adrian Vasiliu


*Summary*

The Flink Kubernetes operator lacks support inside the FlinkDeployment operand 
for configuring Flink with TLS (both one-way and mutual) for the internal 
communication between jobmanagers and taskmanagers, and for the external REST 
endpoint. Although a workaround exists to configure the job and task managers, 
this breaks the operator and renders it unable to reconcile.



*Additional information*
 * The Apache Flink operator supports passing through custom flink 
configuration to be applied to job and task managers.

 * If you supply SSL-based properties, the operator can no longer speak to the 
deployed job manager. The operator is reading the flink conf and using it to 
create a connection to the job manager REST endpoint, but it uses the 
truststore file paths within flink-conf.yaml, which are unresolvable from the 
operator. This leaves the operator hanging in a pending state as it cannot 
complete a reconcile.

 

*Proposal*

Our proposal is to make changes to the operator code. A simple change exists 
that would be enough to enable anonymous SSL at the REST endpoint, but more 
invasive changes would be required to enable full mTLS throughout.

 

The simple change to enable anonymous SSL would be for the operator to parse 
flink-conf and podTemplate to identify the Kubernetes resource that contains 
the certificate from the job manager keystore and use it inside the operator’s 
trust store.

 

In the case of mutual TLS, further changes are required: the operator would 
need to generate a certificate signed by the same issuing authority as the job 
manager’s certificates and then use it in a keystore when challenged by that 
job manager. We propose that the operator becomes responsible for making 
CertificateSigningRequests to generate certificates for job manager, task 
manager and operator. The operator can then coordinate deploying the job and 
task managers with the correct flink-conf and volume mounts. This would also 
work for anonymous SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-31966) Flink Kubernetes operator lacks TLS support

2023-04-28 Thread Adrian Vasiliu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Vasiliu updated FLINK-31966:
---
Description: 
*Summary*

The Flink Kubernetes operator lacks support inside the FlinkDeployment operand 
for configuring Flink with TLS (both one-way and mutual) for the internal 
communication between jobmanagers and taskmanagers, and for the external REST 
endpoint. Although a workaround exists to configure the job and task managers, 
this breaks the operator and renders it unable to reconcile.

*Additional information*
 * The Apache Flink operator supports passing through custom flink 
configuration to be applied to job and task managers.
 * If you supply SSL-based properties, the operator can no longer speak to the 
deployed job manager. The operator is reading the flink conf and using it to 
create a connection to the job manager REST endpoint, but it uses the 
truststore file paths within flink-conf.yaml, which are unresolvable from the 
operator. This leaves the operator hanging in a pending state as it cannot 
complete a reconcile.

*Proposal*

Our proposal is to make changes to the operator code. A simple change exists 
that would be enough to enable anonymous SSL at the REST endpoint, but more 
invasive changes would be required to enable full mTLS throughout.

The simple change to enable anonymous SSL would be for the operator to parse 
flink-conf and podTemplate to identify the Kubernetes resource that contains 
the certificate from the job manager keystore and use it inside the operator’s 
trust store.

In the case of mutual TLS, further changes are required: the operator would 
need to generate a certificate signed by the same issuing authority as the job 
manager’s certificates and then use it in a keystore when challenged by that 
job manager. We propose that the operator becomes responsible for making 
CertificateSigningRequests to generate certificates for job manager, task 
manager and operator. The operator can then coordinate deploying the job and 
task managers with the correct flink-conf and volume mounts. This would also 
work for anonymous SSL.

  was:
*Summary*

The Flink Kubernetes operator lacks support inside the FlinkDeployment operand 
for configuring Flink with TLS (both one-way and mutual) for the internal 
communication between jobmanagers and taskmanagers, and for the external REST 
endpoint. Although a workaround exists to configure the job and task managers, 
this breaks the operator and renders it unable to reconcile.



*Additional information*
 * The Apache Flink operator supports passing through custom flink 
configuration to be applied to job and task managers.

 * If you supply SSL-based properties, the operator can no longer speak to the 
deployed job manager. The operator is reading the flink conf and using it to 
create a connection to the job manager REST endpoint, but it uses the 
truststore file paths within flink-conf.yaml, which are unresolvable from the 
operator. This leaves the operator hanging in a pending state as it cannot 
complete a reconcile.

 

*Proposal*

Our proposal is to make changes to the operator code. A simple change exists 
that would be enough to enable anonymous SSL at the REST endpoint, but more 
invasive changes would be required to enable full mTLS throughout.

 

The simple change to enable anonymous SSL would be for the operator to parse 
flink-conf and podTemplate to identify the Kubernetes resource that contains 
the certificate from the job manager keystore and use it inside the operator’s 
trust store.

 

In the case of mutual TLS, further changes are required: the operator would 
need to generate a certificate signed by the same issuing authority as the job 
manager’s certificates and then use it in a keystore when challenged by that 
job manager. We propose that the operator becomes responsible for making 
CertificateSigningRequests to generate certificates for job manager, task 
manager and operator. The operator can then coordinate deploying the job and 
task managers with the correct flink-conf and volume mounts. This would also 
work for anonymous SSL.


> Flink Kubernetes operator lacks TLS support 
> 
>
> Key: FLINK-31966
> URL: https://issues.apache.org/jira/browse/FLINK-31966
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Adrian Vasiliu
>Priority: Critical
>
> *Summary*
> The Flink Kubernetes operator lacks support inside the FlinkDeployment 
> operand for configuring Flink with TLS (both one-way and mutual) for the 
> internal communication between jobmanagers and taskmanagers, and for the 
> external REST endpoint. Although a workaround exists to configure the job and 
> task managers, this breaks the o

[jira] [Commented] (FLINK-31966) Flink Kubernetes operator lacks TLS support

2023-04-30 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718031#comment-17718031
 ] 

Adrian Vasiliu commented on FLINK-31966:


[~MartijnVisser] Referring to your changes of the issue type and priority:
* We opened the issue with type = bug, not "new feature", because the operator 
breaks when configuring Flink with TLS, and there is no indication in the 
documentation that this is not supported.
* Priority = critical was our take because not being able to secure the Flink 
deployment is perceived as critical for enterprise deployments.

> Flink Kubernetes operator lacks TLS support 
> 
>
> Key: FLINK-31966
> URL: https://issues.apache.org/jira/browse/FLINK-31966
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Adrian Vasiliu
>Priority: Major
>
> *Summary*
> The Flink Kubernetes operator lacks support inside the FlinkDeployment 
> operand for configuring Flink with TLS (both one-way and mutual) for the 
> internal communication between jobmanagers and taskmanagers, and for the 
> external REST endpoint. Although a workaround exists to configure the job and 
> task managers, this breaks the operator and renders it unable to reconcile.
> *Additional information*
>  * The Apache Flink operator supports passing through custom flink 
> configuration to be applied to job and task managers.
>  * If you supply SSL-based properties, the operator can no longer speak to 
> the deployed job manager. The operator is reading the flink conf and using it 
> to create a connection to the job manager REST endpoint, but it uses the 
> truststore file paths within flink-conf.yaml, which are unresolvable from the 
> operator. This leaves the operator hanging in a pending state as it cannot 
> complete a reconcile.
> *Proposal*
> Our proposal is to make changes to the operator code. A simple change exists 
> that would be enough to enable anonymous SSL at the REST endpoint, but more 
> invasive changes would be required to enable full mTLS throughout.
> The simple change to enable anonymous SSL would be for the operator to parse 
> flink-conf and podTemplate to identify the Kubernetes resource that contains 
> the certificate from the job manager keystore and use it inside the 
> operator’s trust store.
> In the case of mutual TLS, further changes are required: the operator would 
> need to generate a certificate signed by the same issuing authority as the 
> job manager’s certificates and then use it in a keystore when challenged by 
> that job manager. We propose that the operator becomes responsible for making 
> CertificateSigningRequests to generate certificates for job manager, task 
> manager and operator. The operator can then coordinate deploying the job and 
> task managers with the correct flink-conf and volume mounts. This would also 
> work for anonymous SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31966) Flink Kubernetes operator lacks TLS support

2023-05-10 Thread Adrian Vasiliu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721224#comment-17721224
 ] 

Adrian Vasiliu commented on FLINK-31966:


[~martijnvisser] [~gyfora] Ok, fair enough. I'd just mention security 
guidelines such as 
[https://www.ncsc.gov.uk/collection/cloud/the-cloud-security-principles/principle-1-data-in-transit-protection]
 which refer to the internal communication, not only the external one: "data is 
protected in transit as it flows between internal components within the 
service".
"Would you be interested in working on it?": thanks for the proposal, it's 
possible, but not short term. I'll let you know if/when starting to work on a 
candidate implementation. By the way, for our knowledge, are there plans for 
FlinkDeployment CRD changes for some other reasons? Just to know if this would 
be the occasion to introduce additional config parameters that would make 
simpler/saner the configuration of operator's TLS.

> Flink Kubernetes operator lacks TLS support 
> 
>
> Key: FLINK-31966
> URL: https://issues.apache.org/jira/browse/FLINK-31966
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Adrian Vasiliu
>Priority: Major
>
> *Summary*
> The Flink Kubernetes operator lacks support inside the FlinkDeployment 
> operand for configuring Flink with TLS (both one-way and mutual) for the 
> internal communication between jobmanagers and taskmanagers, and for the 
> external REST endpoint. Although a workaround exists to configure the job and 
> task managers, this breaks the operator and renders it unable to reconcile.
> *Additional information*
>  * The Apache Flink operator supports passing through custom flink 
> configuration to be applied to job and task managers.
>  * If you supply SSL-based properties, the operator can no longer speak to 
> the deployed job manager. The operator is reading the flink conf and using it 
> to create a connection to the job manager REST endpoint, but it uses the 
> truststore file paths within flink-conf.yaml, which are unresolvable from the 
> operator. This leaves the operator hanging in a pending state as it cannot 
> complete a reconcile.
> *Proposal*
> Our proposal is to make changes to the operator code. A simple change exists 
> that would be enough to enable anonymous SSL at the REST endpoint, but more 
> invasive changes would be required to enable full mTLS throughout.
> The simple change to enable anonymous SSL would be for the operator to parse 
> flink-conf and podTemplate to identify the Kubernetes resource that contains 
> the certificate from the job manager keystore and use it inside the 
> operator’s trust store.
> In the case of mutual TLS, further changes are required: the operator would 
> need to generate a certificate signed by the same issuing authority as the 
> job manager’s certificates and then use it in a keystore when challenged by 
> that job manager. We propose that the operator becomes responsible for making 
> CertificateSigningRequests to generate certificates for job manager, task 
> manager and operator. The operator can then coordinate deploying the job and 
> task managers with the correct flink-conf and volume mounts. This would also 
> work for anonymous SSL.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)