[jira] [Created] (FLINK-32095) HiveDialectITCase crashed with exit code 239

2023-05-14 Thread Weijie Guo (Jira)
Weijie Guo created FLINK-32095:
--

 Summary: HiveDialectITCase crashed with exit code 239
 Key: FLINK-32095
 URL: https://issues.apache.org/jira/browse/FLINK-32095
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.17.1
Reporter: Weijie Guo


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48957=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=22740

May 13 02:10:09 [ERROR] Crashed tests:
May 13 02:10:09 [ERROR] org.apache.flink.connectors.hive.HiveDialectITCase
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532)
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:479)
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:322)
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266)
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314)
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159)
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:932)
May 13 02:10:09 [ERROR] at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
May 13 02:10:09 [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
May 13 02:10:09 [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
May 13 02:10:09 [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
May 13 02:10:09 [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
May 13 02:10:09 [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
May 13 02:10:09 [ERROR] at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
May 13 02:10:09 [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
May 13 02:10:09 [ERROR] at 
org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
May 13 02:10:09 [ERROR] at 
org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
May 13 02:10:09 [ERROR] at 
org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
May 13 02:10:09 [ERROR] at 
org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
May 13 02:10:09 [ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
May 13 02:10:09 [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
May 13 02:10:09 [ERROR] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
May 13 02:10:09 [ERROR] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
May 13 02:10:09 [ERROR] at java.lang.reflect.Method.invoke(Method.java:498)
May 13 02:10:09 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
May 13 02:10:09 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
May 13 02:10:09 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
May 13 02:10:09 [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
May 13 02:10:09 [ERROR] Caused by: 
org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM 
terminated without properly saying goodbye. VM crash or System.exit called?
May 13 02:10:09 [ERROR] Command was /bin/sh -c cd 
/__w/1/s/flink-connectors/flink-connector-hive && 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx1536m 
-jar 
/__w/1/s/flink-connectors/flink-connector-hive/target/surefire/surefirebooter2973058874035532114.jar
 /__w/1/s/flink-connectors/flink-connector-hive/target/surefire 
2023-05-13T01-46-05_580-jvmRun3 surefire1860158651016882706tmp 
surefire_277931085391834517755tmp




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


How to pass the TLS certs to the latest version of flink-connector-pulsar

2023-05-14 Thread Bauddhik Anand
I am trying to connect my Flink application to a Pulsar topic for ingesting
data. The topic is active and i am able to ingest the data via a normal
Java application.

When i try to use the Flink application to ingest the data from the same
topic, using the latest version of flink-connector-pulsar i.e 4.0.0-1.17, i
do not find in the documenation anywhere how to pass to pass the TLS certs.

I tried with below code:


final StreamExecutionEnvironment envn =
StreamExecutionEnvironment.getExecutionEnvironment();

Configuration config = new Configuration();

config.setString("pulsar.client.authentication","tls");
config.setString("pulsar.client.tlsCertificateFilePath",tlsCert);
config.setString("pulsar.client.tlsKeyFilePath",tlsKey);

config.setString("pulsar.client.tlsTrustCertsFilePath",tlsTrustCert);

 PulsarSource pulsarSource = PulsarSource.builder()
.setServiceUrl("serviceurl")
.setAdminUrl("adminurl")
.setStartCursor(StartCursor.earliest())
.setTopics("topicname")
.setDeserializationSchema(new SimpleStringSchema())
.setSubscriptionName("test-sub")
.setConfig(config)
.build();


pulsarStream.map(new MapFunction() {
private static final long serialVersionUID =
-999736771747691234L;

public String map(String value) throws Exception {
  return "Receiving from Pulsar : " + value;
}
  }).print();


envn.execute();


As per documentation i did not find any inbuilt method in the PulsarSource
class to pass the TLS certs, i tried using the PulsarClient options as
config and pass it to PulsarSource as option.

This doesn't seem to work, as when i try to deploy the app, the Flink job
is submitted and JobManager throws the below error.

Caused by: sun.security.validator.ValidatorException: PKIX path
building failed:
sun.security.provider.certpath.SunCertPathBuilderException: unable to
find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(Unknown Source) ~[?:?]
at sun.security.validator.PKIXValidator.engineValidate(Unknown
Source) ~[?:?]
at sun.security.validator.Validator.validate(Unknown Source) ~[?:?]
at sun.security.ssl.X509TrustManagerImpl.validate(Unknown Source) ~[?:?]


Caused by: sun.security.provider.certpath.SunCertPathBuilderException:
unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(Unknown
Source) ~[?:?]
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(Unknown
Source) ~[?:?]
at java.security.cert.CertPathBuilder.build(Unknown Source) ~[?:?]
at sun.security.validator.PKIXValidator.doBuild(Unknown Source) ~[?:?]

I have already verified the certs path and it is correct, also i am using
the same path as a volume mount for my other apps and they work fine.

My question is :

How i can pass the certs to the latest version of the
*flink-connector-pulsar* i.e *4.0.0-1.17*


Re:Re: Re: [DISCUSS] FLIP-305: Support atomic for CREATE TABLE AS SELECT(CTAS) statement

2023-05-14 Thread Mang Zhang
Hi Jingsong,


Thank you for your reply!
We introduced `TwoPhaseCatalogTable` for two reasons:
1. The `TwoPhaseCatalogTable` of different data sources can have more 
operations, if through Catalog, there can only be simple create table and drop 
table, not flexible enough; For example, deleting a temporary directory, or 
using rename table in a relational database to implement atomic semantics in 
flink;
2. Facilitate subsequent extensions, such as support for replace table, 
extended data lake storage support;
>And for `TwoPhase`, maybe `StagedXXX` like Spark is better?
Regarding naming, at first, use `StagedCatalogTable`, but after offline 
discussions with yuxia and Lincoln, we think there is already 
TwoPhaseCommittingSink/TwoPhaseCommitSinkFunction in Flink, in order to keep 
the naming unity so change to `TwoPhaseCatalogTable`.







--

Best regards,
Mang Zhang





At 2023-05-12 13:02:14, "Jingsong Li"  wrote:
>Hi Mang,
>
>Thanks for starting this FLIP.
>
>I have some doubts about the `TwoPhaseCatalogTable`. Generally, our
>Flink design places execution in the TableFactory or directly in the
>Catalog, so introducing an executable table makes me feel a bit
>strange. (Spark is this style, but Flink may not be)
>
>And for `TwoPhase`, maybe `StagedXXX` like Spark is better?
>
>Best,
>Jingsong
>
>On Wed, May 10, 2023 at 9:29 PM Mang Zhang  wrote:
>>
>> Hi Ron,
>>
>>
>> First of all, thank you for your reply!
>> After our offline communication, what you said is mainly in the compilePlan 
>> scenario, but currently compilePlanSql does not support non INSERT 
>> statements, otherwise it will throw an exception.
>> >Unsupported SQL query! compilePlanSql() only accepts a single SQL statement 
>> >of type INSERT
>> But it's a good point that I will seriously consider.
>> Non-atomic CTAS can be supported relatively easily;
>> But atomic CTAS needs more adaptation work, so I'm going to leave it as is 
>> and follow up with a separate issue to implement CTAS support for 
>> compilePlanSql.
>>
>>
>>
>>
>>
>>
>> --
>>
>> Best regards,
>> Mang Zhang
>>
>>
>>
>>
>>
>> At 2023-04-23 17:52:07, "liu ron"  wrote:
>> >Hi, Mang
>> >
>> >I have a question about the implementation details. For the atomicity case,
>> >since the target table is not created before the JobGraph is generated, but
>> >then the target table is required to exist when optimizing plan to generate
>> >the JobGraph. So how do you solve this problem?
>> >
>> >Best,
>> >Ron
>> >
>> >yuxia  于2023年4月20日周四 09:35写道:
>> >
>> >> Share some insights about the new TwoPhaseCatalogTable proposed after
>> >> offline discussion with Mang.
>> >> The main or important reason is that the TwoPhaseCatalogTable enables
>> >> external connectors to implement theirs own logic for commit / abort.
>> >> In FLIP-218, for atomic CTAS, the Catalog will then just drop the table
>> >> when the job fail. It's not ideal for it's too generic to work well.
>> >> For example, some connectors will need to clean some temporary files in
>> >> abort method. And the actual connector can know the specific logic for
>> >> aborting.
>> >>
>> >> Best regards,
>> >> Yuxia
>> >>
>> >>
>> >> 发件人: "zhangmang1" 
>> >> 收件人: "dev" , "Jing Ge" 
>> >> 抄送: "ron9 liu" , "lincoln 86xy" <
>> >> lincoln.8...@gmail.com>, luoyu...@alumni.sjtu.edu.cn
>> >> 发送时间: 星期三, 2023年 4 月 19日 下午 3:13:36
>> >> 主题: Re:Re: [DISCUSS] FLIP-305: Support atomic for CREATE TABLE AS
>> >> SELECT(CTAS) statement
>> >>
>> >> hi, Jing
>> >> Thank you for your reply.
>> >> >1. It looks like you found another way to design the atomic CTAS with new
>> >> >serializable TwoPhaseCatalogTable instead of making Catalog serializable
>> >> as
>> >> >described in FLIP-218. Did I understand correctly?
>> >> Yes, when I was implementing the FLIP-218 solution, I encountered problems
>> >> with Catalog/CatalogTable serialization deserialization, for example, 
>> >> after
>> >> deserialization CatalogTable could not be converted to Hive Table. Also,
>> >> Catalog serialization is still a heavy operation, but it may not actually
>> >> be necessary, we just need Create Table.
>> >> Therefore, the TwoPhaseCatalogTable program is proposed, which also
>> >> facilitates the implementation of the subsequent data lake, ReplaceTable
>> >> and other functions.
>> >>
>> >> >2. I am a little bit confused about the isStreamingMode parameter of
>> >> >Catalog#twoPhaseCreateTable(...), since it is the selector argument(code
>> >> >smell) we should commonly avoid in the public interface. According to the
>> >> >FLIP,  isStreamingMode will be used by the Catalog to determine whether 
>> >> >to
>> >> >support atomic or not. With this selector argument, there will be two
>> >> >different logics built within one method and it is hard to follow without
>> >> >reading the code or the doc carefully(another concern is to keep the doc
>> >> >and code alway be consistent) i.e. sometimes there will be no difference
>> >> by
>> >> >using true/false isStreamingMode, sometimes 

Re: [NOTICE] Flink master branch now uses Maven 3.8.6

2023-05-14 Thread yuxia
Thanks Chesnay for the efforts. Happy to see we can use Maven 3.8 finnally.

Best regards,
Yuxia

- 原始邮件 -
发件人: "Jing Ge" 
收件人: "dev" 
发送时间: 星期六, 2023年 5 月 13日 下午 4:37:58
主题: Re: [NOTICE] Flink master branch now uses Maven 3.8.6

Great news! We can finally get rid of additional setup to use maven 3.8.
Thanks @Chesnay for your effort!

Best regards,
Jing

On Sat, May 13, 2023 at 5:12 AM David Anderson 
wrote:

> Chesnay, thank you for all your hard work on this!
>
> David
>
> On Fri, May 12, 2023 at 4:03 PM Chesnay Schepler 
> wrote:
> >
> >
> >   What happened?
> >
> > I have just merged the last commits to properly support Maven 3.3+ on
> > the Flink master branch.
> >
> > mvnw and CI have been updated to use Maven 3.8.6.
> >
> >
> >   What does this mean for me?
> >
> >   * You can now use Maven versions beyond 3.2.5 (duh).
> >   o Most versions should work, but 3.8.6 was the most tested and is
> > thus recommended.
> >   o 3.8.*5* is known to *NOT* work.
> >   * Starting from 1.18.0 you need to use Maven 3.8.6 for releases.
> >   o This may change to a later version until the release of 1.18.0.
> >   o There have been too many issues with recent Maven releases to
> > make a range acceptable.
> >   * *All dependencies that are bundled by a module must be marked as
> > optional.*
> >   o *This is verified on CI
> > <
> https://github.com/apache/flink/blob/master/tools/ci/flink-ci-tools/src/main/java/org/apache/flink/tools/ci/optional/ShadeOptionalChecker.java
> >.*
> >   o *Background info can be found in the wiki
> >  >.*
> >
> >
> >   Can I continue using Maven 3.2.5?
> >
> > For now, yes, but support will eventually be removed.
> >
> >
> >   Does this affect users?
> >
> > No.
> >
> >
> > Please ping me if you run into any issues.
>


[jira] [Created] (FLINK-32094) startScheduling.BATCH performance regression since May 11th

2023-05-14 Thread Martijn Visser (Jira)
Martijn Visser created FLINK-32094:
--

 Summary: startScheduling.BATCH performance regression since May 
11th
 Key: FLINK-32094
 URL: https://issues.apache.org/jira/browse/FLINK-32094
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Reporter: Martijn Visser


http://codespeed.dak8s.net:8000/timeline/#/?exe=5=startScheduling.BATCH=on=on=off=2=200





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0, release candidate #2

2023-05-14 Thread Tamir Sagi
Hey Guyla,

1. The operator watches its own namespace as well. The error still happens(The 
only way to overcome this is to change kubernetes.rest-service.exposed.type to 
'ClusterIP'). It seems connected to RestClient uses k8s client internally which 
needs NodeList permissions but instead of reading from Service account it looks 
for kube.config file. [1]

ClusterIP Service
https://github.com/apache/flink/blob/release-1.17.0/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/services/ClusterIPService.java#L44-L53

NodePort Service
https://github.com/apache/flink/blob/release-1.17.0/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/services/NodePortService.java#L62

2. Yes, it occurred upon deletion. The leader continued normally where the idle 
pods constantly printed those errors. (did not crash though).
I created a bug: https://issues.apache.org/jira/browse/FLINK-32093

3. I believe it is crucial to have all cluster configurations in cluster's 
dashboard, particularly in production.  If the operator had a UI, it could have 
filled that void 

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#kubernetes-config-file

From: Gyula Fóra 
Sent: Sunday, May 14, 2023 4:25 PM
To: dev@flink.apache.org 
Cc: Anthony Garrard ; Hao t Chang 
Subject: Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0, release 
candidate #2

EXTERNAL EMAIL



@Tamir:

For 1:
Do you get the same problems with 1.4.0 or is this a regression in 1.5.0?
If you set the helm chart so the operator also watches its own namespaces
like I mentioned in the jira do you still get the error?

For 2:
This error happens when you delete the FlinkDeployment? Can you open a jira
and share the logs?

Operator/autoscaler configs are not sent to the Flink application so they
won’t show up on the ui. This is intentional.

Gyula

On Sun, 14 May 2023 at 15:03, Tamir Sagi 
wrote:

> Hey Guyla , dev-team
>
> I deployed rc-2 with helm on AWS EKS with HA enabled (3 pods).
>
> The operator watches 3 namespaces.
>
> I successfully deployed an application cluster(Flink 1.17) via pod
> template. I encountered the following errors
>
>1. 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
>executing: GET at: https://172.20.0.1/api/v1/nodes. Message:
>Forbidden!Configured service account doesn't have access. Service account
>may have been revoked. nodes is forbidden: User
>"system:serviceaccount:dev-0-flink-clusters:
>*dev-0-xsight-flink-operator-sa*" cannot list resource "nodes" in API
>group "" at the cluster scope."
>Seems like the role is correct. I comment in the following ticket:
>https://issues.apache.org/jira/browse/FLINK-32041
>In addition, I noticed that kubernetes.rest-service.exposed.type was
>on NodePort​, once I changed it to ClusterIP​ the above error
>disappeared. [1]
>
>Is there any chance it looks for kube.config file instead of reading
>the service account?
>
>2. When the cluster is deleted, the idle pods (not leaders) repeatedly
>throw the following error :
>[2023-05-14T12:00:50,388][Error] {} [i.f.k.c.i.i.c.SharedProcessor]:
>apps/v1/namespaces/dev-0-flink-shadow-clusters/deployments failed invoking
>InformerEventSource{resourceClass: Deployment} event handler: Cannot
>receive event after a delete event received
>java.lang.IllegalStateException: Cannot receive event after a delete
>event received (enclosed stacktrace)
>
> In addition, I'm not sure whether it's an issue or not, but autoscaler
> configurations (per cluster) are not shown neither in Flink web UI nor in
> the response when calling /jobmanager/config.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#accessing-flinks-web-ui
>
> Thanks,
> Tamir
> --
> *From:* Jim Busche 
> *Sent:* Saturday, May 13, 2023 5:59 PM
> *To:* dev@flink.apache.org ; Hao t Chang <
> htch...@us.ibm.com>; Anthony Garrard 
> *Subject:* Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0,
> release candidate #2
>
> EXTERNAL EMAIL
>
>
>
>
> Hi Guyla,
>
> I was able to deploy rc-2 with helm on a kind cluster and it was able to
> deploy the sample.  But I'm still struggling on OpenShift with rc-2.
> There's some kind of RBAC permission issue that I haven't been able to
> solve when it deploys the flinkdep or flinksessionjobs.
>
>
> oc get flinkdep
>
> NAMEJOB STATUS   LIFECYCLE STATE
>
> basic-exampleUPGRADING
>
> basic-session-deployment-only-exampleUPGRADING
>
>
>
> oc get flinksessionjobs
>
> NAME JOB STATUS   LIFECYCLE STATE
>
> basic-session-job-only-example
>
>
> oc describe flinkdep basic-example
> …
>
> Status:
>
>   Cluster Info:
>
>   

[jira] [Created] (FLINK-32093) Upon Delete deployment idle pods throw - java.lang.IllegalStateException: Cannot receive event after a delete event received

2023-05-14 Thread Tamir Sagi (Jira)
Tamir Sagi created FLINK-32093:
--

 Summary: Upon Delete deployment idle pods throw - 
java.lang.IllegalStateException: Cannot receive event after a delete event 
received
 Key: FLINK-32093
 URL: https://issues.apache.org/jira/browse/FLINK-32093
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Tamir Sagi
 Attachments: event-error.txt

After a deployment is deleted , idle pods throw 
java.lang.IllegalStateException: Cannot receive event after a delete event 
received

HA is enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0, release candidate #2

2023-05-14 Thread Gyula Fóra
@Tamir:

For 1:
Do you get the same problems with 1.4.0 or is this a regression in 1.5.0?
If you set the helm chart so the operator also watches its own namespaces
like I mentioned in the jira do you still get the error?

For 2:
This error happens when you delete the FlinkDeployment? Can you open a jira
and share the logs?

Operator/autoscaler configs are not sent to the Flink application so they
won’t show up on the ui. This is intentional.

Gyula

On Sun, 14 May 2023 at 15:03, Tamir Sagi 
wrote:

> Hey Guyla , dev-team
>
> I deployed rc-2 with helm on AWS EKS with HA enabled (3 pods).
>
> The operator watches 3 namespaces.
>
> I successfully deployed an application cluster(Flink 1.17) via pod
> template. I encountered the following errors
>
>1. 
> org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
>executing: GET at: https://172.20.0.1/api/v1/nodes. Message:
>Forbidden!Configured service account doesn't have access. Service account
>may have been revoked. nodes is forbidden: User
>"system:serviceaccount:dev-0-flink-clusters:
>*dev-0-xsight-flink-operator-sa*" cannot list resource "nodes" in API
>group "" at the cluster scope."
>Seems like the role is correct. I comment in the following ticket:
>https://issues.apache.org/jira/browse/FLINK-32041
>In addition, I noticed that kubernetes.rest-service.exposed.type was
>on NodePort​, once I changed it to ClusterIP​ the above error
>disappeared. [1]
>
>Is there any chance it looks for kube.config file instead of reading
>the service account?
>
>2. When the cluster is deleted, the idle pods (not leaders) repeatedly
>throw the following error :
>[2023-05-14T12:00:50,388][Error] {} [i.f.k.c.i.i.c.SharedProcessor]:
>apps/v1/namespaces/dev-0-flink-shadow-clusters/deployments failed invoking
>InformerEventSource{resourceClass: Deployment} event handler: Cannot
>receive event after a delete event received
>java.lang.IllegalStateException: Cannot receive event after a delete
>event received (enclosed stacktrace)
>
> In addition, I'm not sure whether it's an issue or not, but autoscaler
> configurations (per cluster) are not shown neither in Flink web UI nor in
> the response when calling /jobmanager/config.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#accessing-flinks-web-ui
>
> Thanks,
> Tamir
> --
> *From:* Jim Busche 
> *Sent:* Saturday, May 13, 2023 5:59 PM
> *To:* dev@flink.apache.org ; Hao t Chang <
> htch...@us.ibm.com>; Anthony Garrard 
> *Subject:* Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0,
> release candidate #2
>
> EXTERNAL EMAIL
>
>
>
>
> Hi Guyla,
>
> I was able to deploy rc-2 with helm on a kind cluster and it was able to
> deploy the sample.  But I'm still struggling on OpenShift with rc-2.
> There's some kind of RBAC permission issue that I haven't been able to
> solve when it deploys the flinkdep or flinksessionjobs.
>
>
> oc get flinkdep
>
> NAMEJOB STATUS   LIFECYCLE STATE
>
> basic-exampleUPGRADING
>
> basic-session-deployment-only-exampleUPGRADING
>
>
>
> oc get flinksessionjobs
>
> NAME JOB STATUS   LIFECYCLE STATE
>
> basic-session-job-only-example
>
>
> oc describe flinkdep basic-example
> …
>
> Status:
>
>   Cluster Info:
>
>   Error:
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
> Could not create Kubernetes cluster
> \"basic-example\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could
> not create Kubernetes cluster
> \"basic-example\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
> executing: POST at:
> https://172.30.0.1/apis/apps/v1/namespaces/default/deployments. Message:
> Forbidden!Configured service account doesn't have access. Service account
> may have been revoked. deployments.apps \"basic-example\" is forbidden:
> cannot set blockOwnerDeletion if an ownerReference refers to a resource you
> can't set finalizers on: , ."}]}
>
>   Job Manager Deployment Status:  MISSING
>
> I haven't been able to spot why/what's different between 1.5 and 1.4
> release (which still deploys fine.)
> Hoping someone has an idea of what might be wrong.
>
> Thanks, Jim
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to 

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0, release candidate #2

2023-05-14 Thread Tamir Sagi
Hey Guyla , dev-team

I deployed rc-2 with helm on AWS EKS with HA enabled (3 pods).

The operator watches 3 namespaces.

I successfully deployed an application cluster(Flink 1.17) via pod template. I 
encountered the following errors

  1.
org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
 executing: GET at: https://172.20.0.1/api/v1/nodes. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. nodes is forbidden: User 
"system:serviceaccount:dev-0-flink-clusters:dev-0-xsight-flink-operator-sa" 
cannot list resource "nodes" in API group "" at the cluster scope."
Seems like the role is correct. I comment in the following ticket: 
https://issues.apache.org/jira/browse/FLINK-32041
In addition, I noticed that kubernetes.rest-service.exposed.type was on 
NodePort​, once I changed it to ClusterIP​ the above error disappeared. [1]

Is there any chance it looks for kube.config file instead of reading the 
service account?

  2.  When the cluster is deleted, the idle pods (not leaders) repeatedly throw 
the following error :
[2023-05-14T12:00:50,388][Error] {} [i.f.k.c.i.i.c.SharedProcessor]: 
apps/v1/namespaces/dev-0-flink-shadow-clusters/deployments failed invoking 
InformerEventSource{resourceClass: Deployment} event handler: Cannot receive 
event after a delete event received
java.lang.IllegalStateException: Cannot receive event after a delete event 
received (enclosed stacktrace)

In addition, I'm not sure whether it's an issue or not, but autoscaler 
configurations (per cluster) are not shown neither in Flink web UI nor in the 
response when calling /jobmanager/config.

[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#accessing-flinks-web-ui

Thanks,
Tamir

From: Jim Busche 
Sent: Saturday, May 13, 2023 5:59 PM
To: dev@flink.apache.org ; Hao t Chang 
; Anthony Garrard 
Subject: Re: [VOTE] Apache Flink Kubernetes Operator Release 1.5.0, release 
candidate #2

EXTERNAL EMAIL



Hi Guyla,

I was able to deploy rc-2 with helm on a kind cluster and it was able to deploy 
the sample.  But I'm still struggling on OpenShift with rc-2.  There's some 
kind of RBAC permission issue that I haven't been able to solve when it deploys 
the flinkdep or flinksessionjobs.


oc get flinkdep

NAMEJOB STATUS   LIFECYCLE STATE

basic-exampleUPGRADING

basic-session-deployment-only-exampleUPGRADING



oc get flinksessionjobs

NAME JOB STATUS   LIFECYCLE STATE

basic-session-job-only-example


oc describe flinkdep basic-example
…

Status:

  Cluster Info:

  Error:  
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
 Could not create Kubernetes cluster 
\"basic-example\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could
 not create Kubernetes cluster 
\"basic-example\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
 executing: POST at: 
https://172.30.0.1/apis/apps/v1/namespaces/default/deployments. Message: 
Forbidden!Configured service account doesn't have access. Service account may 
have been revoked. deployments.apps \"basic-example\" is forbidden: cannot set 
blockOwnerDeletion if an ownerReference refers to a resource you can't set 
finalizers on: , ."}]}

  Job Manager Deployment Status:  MISSING

I haven't been able to spot why/what's different between 1.5 and 1.4 release 
(which still deploys fine.)
Hoping someone has an idea of what might be wrong.

Thanks, Jim

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.
java.lang.IllegalStateException: Cannot receive event after a delete event 
received
at 
io.javaoperatorsdk.operator.processing.event.ResourceState.markEventReceived(ResourceState.java:83)
at 
io.javaoperatorsdk.operator.processing.event.EventProcessor.markEventReceived(EventProcessor.java:197)
at