Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-26 Thread Yang Wang
Usually, you should use the HDFS nameservice instead of the NameNode hostname:port to avoid NN failover. And you could find the supported nameservice in the hdfs-site.xml in the key *dfs.nameservices*. Best, Yang On Fri, Mar 22, 2024 at 8:33 PM Sachin Mittal wrote: > So, when we create an EMR

Re: flink operator 高可用任务偶发性报错unable to update ConfigMapLock

2024-03-20 Thread Yang Wang
这种一般是因为APIServer那边有问题导致单次的ConfigMap renew lease annotation的操作失败,Flink默认会重试的 如果你发现因为这个SocketTimeoutException原因导致了任务Failover,可以把下面两个参数调大 high-availability.kubernetes.leader-election.lease-duration: 60s high-availability.kubernetes.leader-election.renew-deadline: 60s Best, Yang On Tue, Mar 12,

Re: Jobmanager restart after it has been requested to stop

2024-02-02 Thread Yang Wang
If you could find the "Deregistering Flink Kubernetes cluster, clusterId" in the JobManager log, then it is not the expected behavior. Having the full logs of JobManager Pod before restarted will help a lot. Best, Yang On Fri, Feb 2, 2024 at 1:26 PM Liting Liu (litiliu) via user <

Re: [DISCUSS] Hadoop 2 vs Hadoop 3 usage

2024-01-15 Thread Yang Wang
I could share some metrics about Alibaba Cloud EMR clusters. The ratio of Hadoop2 VS Hadoop3 is 1:3. Best, Yang On Thu, Dec 28, 2023 at 8:16 PM Martijn Visser wrote: > Hi all, > > I want to get some insights on how many users are still using Hadoop 2 > vs how many users are using Hadoop 3.

Re: Flink HA with Zookeeper and Docker Compose: unable to startup a working setup.

2024-01-15 Thread Yang Wang
Could you please configure the same HA configurations for TaskManager as well? It seems that the TaskManager container does not use a correct URL when contacting with ResourceManager. Best, Yang On Fri, Dec 29, 2023 at 11:13 PM Alessio Bernesco Làvore < alessio.berne...@gmail.com> wrote: >

Re: Deploying the K8S operator sample on GKE Autopilot : Association with remote system [akka.tcp://flink@basic-example.default:6123] has failed,

2024-01-15 Thread Yang Wang
Could you please directly use the JobManager Pod IP address instead of K8s service name(basic-example.default) and have a try with curl/wget? It seems that the JobManager K8s service could not be accessed. Best, Yang On Sat, Jan 13, 2024 at 1:24 AM LINZ, Arnaud wrote: > Hi, > > Some more

Re: Flink Kubernetes HA

2024-01-15 Thread Yang Wang
The fabric8 K8s client is using PATCH to replace get-and-update in v6.6.2. That's why you also need to give PATCH permission for the K8s service account. This would help to decrease the pressure of K8s APIServer. You could find more information here[1]. [1].

Re: Default Log4j properties in Native Kubernetes

2023-06-21 Thread Yang Wang
I assume you are using "*bin/flink run-application*" to submit a Flink application to K8s cluster. Then you could simply update your local log4j-console.properties, it will be shipped and mounted to JobManager/TaskManager pods via ConfigMap. Best, Yang Vladislav Keda 于2023年6月20日周二 22:15写道: >

Re: Flink实时计算平台在k8s上以Application模式启动作业如何实时同步作业状态到平台?

2023-04-11 Thread Yang Wang
可以通过JobResultStore[1]来获取任务最终的状态,flink-kubernetes-operator也是这样来获取的 [1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore Best, Yang Weihua Hu 于2023年3月22日周三 10:27写道: > Hi > > 我们内部最初版本是通过 cluster-id 来唯一标识一个 application,同时认为流式任务是长时间运行的,不应该主动退出。如果该 >

Re: 监控flink的prometheus经常OOM

2023-04-11 Thread Yang Wang
可以通过给Prometheus server来配置metric_relabel_configs[1]来控制采集哪些metrics [1]. https://prometheus.io/docs/prometheus/latest/configuration/configuration/#metric_relabel_configs Best, Yang casel.chen 于2023年3月22日周三 13:47写道: >

Re: "Error while retrieving the leader gateway" when using Kubernetes HA

2023-01-30 Thread Yang Wang
I assume you are using the standalone mode. Right? For the native K8s mode, the leader address should be *akka.tcp://flink@JM_POD_IP:6123/user/rpc/dispatcher_1 *when HA enabled. Best, Yang Anton Ippolitov via user 于2023年1月31日周二 00:21写道: > This is actually what I'm already doing, I'm only

Re: Apache Beam MinimalWordCount on Flink on Kubernetes using Flink Kubernetes Operator on GCP

2023-01-17 Thread Yang Wang
The "JAR file does not exist" exception comes from the JobManager side, not on the client. Please be aware that the local:// scheme in the jarURI means the path in the JobManager pod. You could use an init-container to download your user jar and mount it to the JobManager main-container. Refer to

Re: Supplying jar stored at S3 to flink to run the job in kubernetes

2023-01-16 Thread Yang Wang
Do you build your own flink-kubernetes-operator image with the flink-s3-fs plugin bundled[1]? [1]. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.3/docs/custom-resource/overview/#flinksessionjob-spec-overview Best, Yang Weihua Hu 于2023年1月17日周二 10:47写道: > Hi, Rahul

Re: Flink Job Manager Recovery from EKS Node Terminations

2023-01-11 Thread Yang Wang
First, JobManager does not store any persistent data to local when the Kubernetes HA + S3 used. It means that you do not need to mount a PV for JobMananger deployment. Secondly, node failures or terminations should not cause the CrashLoopBackOff status. One possible reason I could imagine is a

Re: The use of zookeeper in flink

2023-01-03 Thread Yang Wang
The reason why the running jobs try to failover with zookeeper outage is that the JobManager lost leadership. Having a standby JobManager or not makes no difference. Best, Yang Matthias Pohl via user 于2023年1月2日周一 20:51写道: > And I screwed up the reply again. -.- Here's my previous response for

Re: How to get failed streaming Flink job log in Flink Native K8s mode?

2023-01-03 Thread Yang Wang
I think you might need a sidecar container or daemonset to collect the Flink logs and store into a persistent storage. You could find more information here[1]. [1]. https://www.alibabacloud.com/blog/best-practices-of-kubernetes-log-collection_596356 Best, Yang hjw 于2022年12月22日周四 23:28写道: > On

Re: Stand alone K8s HA mode with Static Tokens Used by Service Accounts

2022-11-24 Thread Yang Wang
IIUC, the fabric8 Kubernetes-client 5.5.0 should already support to reload the latest kube config if received 401 error. Refer to the following PR[1] for more information. Please share your feedback here if it still could not work. [1]. https://github.com/fabric8io/kubernetes-client/pull/2731

Re: flink作业提交运行后如何监听作业状态发生变化?

2022-11-23 Thread Yang Wang
其实可以参考Flink Kubernetes Operator里面的做法,设置execution.shutdown-on-application-finish参数为false 然后通过轮询Flink RestAPI拿到job的状态,job结束了再主动停掉Application cluster Best, Yang JasonLee <17610775...@163.com> 于2022年11月24日周四 09:59写道: > Hi > > > 可以通过 Flink 的 Metric 和 Yarn 的 Api 去获取任务的状态(任务提交到 yarn 的话) > > > Best >

Re: Optimize ApplicationDeployer API design

2022-11-23 Thread Yang Wang
Just Kindly remind, you attached images could not show normally. Given that *ApplicationDeployer* is not only used for Yarn application mode, but also native Kubernetes, I am not sure which way you are referring to return the applicationId. We already print the applicationId in the client logs.

Re: support escaping `#` in flink job spec in Flink-operator

2022-11-08 Thread Yang Wang
This is a known limit of the current Flink options parser. Refer to FLINK-15358[1] for more information. [1]. https://issues.apache.org/jira/browse/FLINK-15358 Best, Yang Gyula Fóra 于2022年11月8日周二 14:41写道: > It is also possible that this is a problem of the Flink native Kubernetes >

Re: [DISCUSS ] add --jars to support users dependencies jars.

2022-10-27 Thread Yang Wang
Thanks Jacky Lau for starting this discussion. I understand that you are trying to find a convenient way to specify dependency jars along with user jar. However, let's try to narrow down by differentiating deployment modes. # Standalone mode No matter you are using the standalone mode on virtual

Re: configMap value error when using flink-operator?

2022-10-26 Thread Yang Wang
Maybe we could change the values of *taskmanager.numberOfTaskSlots* and *parallelism.default *in flink-conf.yaml of Kubernetes operator to 1, which are aligned with the default values in Flink codebase. Best, Yang Gyula Fóra 于2022年10月26日周三 15:17写道: > Hi! > > I agree that this might be

Re: batch job 结束时, flink-k8s-operator crd 状态展示不清晰

2022-10-25 Thread Yang Wang
从1.15开始,任务结束不会主动把JobManager删除掉了。所以Kubernetes Operator就可以正常查到Job状态并且更新 Best, Yang ¥¥¥ 于2022年10月25日周二 15:58写道: > 退订 > > > > > --原始邮件-- > 发件人: > "user-zh" > > 发送时间:2022年10月25日(星期二) 下午3:33 > 收件人:"user-zh" > 主题:batch

Re: Flink Native K8S RBAC

2022-10-20 Thread Yang Wang
I have created a ticket[1] to fill the missing part in the native K8s documentation. [1]. https://issues.apache.org/jira/browse/FLINK-29705 Best, Yang Gyula Fóra 于2022年10月20日周四 13:37写道: > Hi! > > As a reference you can look at how the Flink Kubernetes Operator manages > RBAC settings: > > >

Re: Activate Flink HA without checkpoints on k8S

2022-10-19 Thread Yang Wang
Add some more information to Gyula's comment. For application mode without checkpoint, you do not need to activate the HA since it will not take any effect and the Flink job will be submitted again after the JobManager restarted. Because the job submission happens on the JobManager side. For

Re: fail to mount hadoop-config-volume when using flink-k8s-operator

2022-10-13 Thread Yang Wang
Currently, exporting the env "HADOOP_CONF_DIR" could only work for native K8s integration. The flink client will try to create the hadoop-config-volume automatically if hadoop env found. If you want to set the HADOOP_CONF_DIR in the docker image, please also make sure the specified hadoop conf

Re: native k8s部署模式下使用HA架构的HDFS集群无法正常连接

2022-09-21 Thread Yang Wang
你在Flink client端提交任务前设置一下HADOOP_CONF_DIR环境变量 然后再运行flink run-application命令 Best, Yang yanfei lei 于2022年9月22日周四 11:04写道: > Hi Tino, > 从org.apache.flink.core.fs.FileSystem.java > < >

Re: serviceAccount permissions issue for high availability in operator 1.1

2022-09-20 Thread Yang Wang
ay to change that to use standalone K8s? I haven't > seen anything about that in the docs, besides a mention that standalone > support is coming in version 1.2 of the operator. > > Thanks, > > Javier > > > On Thu, Sep 8, 2022, 22:50 Yang Wang wrote: > >> Since t

Re: serviceAccount permissions issue for high availability in operator 1.1

2022-09-08 Thread Yang Wang
Since the flink-kubernetes-operator is using native K8s integration[1] by default, you need to give the permissions of pod and deployment as well as ConfigMap. You could find more information about the RBAC here[2]. [1].

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-08 Thread Yang Wang
uld be a warning then > > What about the 1st error we encountered regarding the kube/config file > exception? > > > Thank you so much, > Best, > Tamir > > -- > *From:* Yang Wang > *Sent:* Thursday, September 8, 2022 7:08 AM > *To:*

Re: [Flink Kubernetes Operator] FlinkSessionJob crd spec jarURI

2022-09-08 Thread Yang Wang
Given that the "local://" schema means the jar is available in the image/container of JobManager, so it could only be supported in the K8s application mode. If you configure the jarURI to "file://" schema for session cluster, it means that this jar file should be available in the

Re: Deploying Jobmanager on k8s as a Deployment

2022-09-07 Thread Yang Wang
For native K8s integration, the Flink ResourceManager will delete the JobManager K8s deployment as well as the HA data once the job reached a globally terminal state. However, it is indeed a problem for standalone mode since the JobManager will be restarted again even the job has finished. I

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-07 Thread Yang Wang
-events-insertion-cluster-config-map" already > exists. > > Log file is enclosed. > > Thanks, > Tamir. > > -- > *From:* Yang Wang > *Sent:* Monday, September 5, 2022 3:03 PM > *To:* Tamir Sagi > *Cc:* user@flink.apache.org ; Lihi Peretz < > lihi.per...@

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-05 Thread Yang Wang
Could you please check whether the "kubernetes.config.file" is configured to /opt/flink/.kube/config in the Flink configmap? It should be removed before creating the Flink configmap. Best, Yang Tamir Sagi 于2022年9月4日周日 18:08写道: > Hey All, > > We recently updated to Flink 1.15.1. We deploy

Re: How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?

2022-09-05 Thread Yang Wang
I do not think we could add an additional port to the rest service since it is created by Flink internally. Actually, I do not suggest scrapping the metrics from rest service. Instead, the port in the pod needs to be used. Because the metrics might not work correctly if multiple JobManagers are

Re: [E] Re: Kubernetes operator expose UI rest service as NodePort instead of default clusterIP

2022-09-05 Thread Yang Wang
"ip" might work. Let me try that. I >>> believe there should be a reason to always override the >>> "REST_SERVICE_EXPOSED_TYPE" to "ClusterIP". >>> >>> [1] https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html >>&

Re: Kubernetes operator expose UI rest service as NodePort instead of default clusterIP

2022-09-01 Thread Yang Wang
I am afraid the current flink-kubernetes-operator always overrides the "REST_SERVICE_EXPOSED_TYPE" to "ClusterIP". Could you please share why the ingress[1] could not meet your requirements? Compared with NodePort, I think it is a more graceful implementation. [1].

Re: 【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-31 Thread Yang Wang
我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM -Dkubernetes.taskmanager.service-account=wuzhiheng \ Best, Yang Xuyang 于2022年8月30日周二 23:22写道: > Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来 > 在 2022-08-30 03:45:43,"Wu,Zhiheng" 写道: > >【问题描述】 > >启用HA配置之后,taskmanager

Re: Error when run test case in Windows

2022-08-22 Thread Yang Wang
It is caused by the following assert. Maybe we could *File.pathSeparator* instead of "/". *assertThat(optional.get()).isEqualTo(hadoopHome + "/conf");* Would you like to create a ticket and attach a PR for this issue? Best, Yang hjw <1010445...@qq.com> 于2022年8月21日周日 19:44写道: > When I run mvn

Re: Flink Operator Resources Requests and Limits

2022-07-27 Thread Yang Wang
We have the *kubernetes.jobmanager.cpu.limit-factor* and *kubernetes.jobmanager.memory.limit-factor* to control the limit value. The resources limit memory will be set to memory/cpu * limit-factor. Best, Yang PACE, JAMES 于2022年7月28日周四 01:26写道: > That does not seem to work. > > > > For

Re: flink-k8s-operator中webhook的作用

2022-07-27 Thread Yang Wang
Webhook主要的作用是做CR的校验,避免提交到K8s上之后才发现 例如:parallelism被错误的设置为负值,jarURI没有设置等 Best, Yang Kyle Zhang 于2022年7月27日周三 18:59写道: > Hi,all > 最近在看flink-k8s-operator[1],架构里有一个flink-webhook,请问这个container的作用是什么,如果配置 > webhook.create=false对整体功能有什么影响? > > Best regards > > [1] > >

Re: NodePort conflict for multiple HA application-mode standalone Kubernetes deploys in same namespace

2022-07-24 Thread Yang Wang
Removing the nodePort for every different Flink application is necessary so that it could pick up a random port. Moreover, I believe you also need to change some other yamls. For example, having a different name for JobManager/TaskManager yamls, update the jobmanager-service.yaml and

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.1.0 released

2022-07-24 Thread Yang Wang
Congrats! Thanks Gyula for driving this release, and thanks to all contributors! Best, Yang Gyula Fóra 于2022年7月25日周一 10:44写道: > The Apache Flink community is very happy to announce the release of Apache > Flink Kubernetes Operator 1.1.0. > > The Flink Kubernetes Operator allows users to

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.1.0 released

2022-07-24 Thread Yang Wang
Congrats! Thanks Gyula for driving this release, and thanks to all contributors! Best, Yang Gyula Fóra 于2022年7月25日周一 10:44写道: > The Apache Flink community is very happy to announce the release of Apache > Flink Kubernetes Operator 1.1.0. > > The Flink Kubernetes Operator allows users to

Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-19 Thread Yang Wang
e 方式提交运行的任务,则用 FlinkDeployment,并配置好 job 部分。 会自动创建 > flink 集群,并根据 job 配置运行job。 > 这种方式不需要考虑集群创建、任务提交的步骤,本身就是一体。 > (2)对于 session 集群的创建,也是用 FlinkDeployment ,只是不需要指定 job 配置即可。 > (3)配合通过(2)方式创建的 session 集群,则可以配合 FlinkSessionJob 提交任务。 > > Yang Wang 于2022年7月12日周二 17:10写道: > &g

Re: standalone mode support in the kubernetes operator (FLIP-25)

2022-07-18 Thread Yang Wang
r? > What is the advantag doing so. > > Yang Wang 于2022年7月14日周四 10:55写道: > > > > I think the standalone mode support is expected to be done in the > version 1.2.0[1], which will be released on Oct 1 (ETA). > > > > [1]. > https://cwiki.apache.org/confluence/display/FLIN

Re: Retrying connect to server: 0.0.0.0/0.0.0.0:8030

2022-07-13 Thread Yang Wang
确认一下你是否正确设置了HADOOP_CONF_DIR环境变量 Best, Yang lishiyuan0506 于2022年7月14日周四 09:41写道: > 打扰大家一下,请问一下各位在yarn提交flink的时候,有没有遇到过Retrying connect to server: > 0.0.0.0/0.0.0.0:8030这个异常 > > > hadoop的classpath没问题,Spark和MR在Yarn上跑也没问题,就flink有这样的问题 > > > | | > lishiyuan0506 > | > | > lishiyuan0...@163.com > | >

Re: standalone mode support in the kubernetes operator (FLIP-25)

2022-07-13 Thread Yang Wang
I think the standalone mode support is expected to be done in the version 1.2.0[1], which will be released on Oct 1 (ETA). [1]. https://cwiki.apache.org/confluence/display/FLINK/Release+Schedule+and+Planning Best, Yang Javier Vegas 于2022年7月14日周四 06:25写道: > Hello! The operator docs >

Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-12 Thread Yang Wang
方式是什么原理。 > > > > yidan zhao 于2022年7月12日周二 12:48写道: > > > > > > 我理解的 k8s 集群内是组成 k8s 的机器,是必须在 pod 内?我在k8s的node上也不可以是吧。 > > > > > > Yang Wang 于2022年7月12日周二 12:07写道: > > > > > > > > 日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink >

Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-11 Thread Yang Wang
日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run 否则你就需要NodePort或者LoadBalancer的方式了 2022-07-12 10:23:23,021 WARN org.apache.flink.kubernetes.KubernetesClusterDescriptor [] - Please note that Flink client operations(e.g. cancel, list, stop,

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.0.1 released

2022-06-28 Thread Yang Wang
Thanks Gyula for working on the first patch release for the Flink Kubernetes Operator project. Best, Yang Gyula Fóra 于2022年6月28日周二 00:22写道: > The Apache Flink community is very happy to announce the release of Apache > Flink Kubernetes Operator 1.0.1. > > The Flink Kubernetes Operator

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.0.1 released

2022-06-28 Thread Yang Wang
Thanks Gyula for working on the first patch release for the Flink Kubernetes Operator project. Best, Yang Gyula Fóra 于2022年6月28日周二 00:22写道: > The Apache Flink community is very happy to announce the release of Apache > Flink Kubernetes Operator 1.0.1. > > The Flink Kubernetes Operator

Re: Flink k8s Operator on AWS?

2022-06-26 Thread Yang Wang
Could you please share the JobManager logs of failed deployment? It will also help a lot if you could show the pending pod status via "kubectl describe ". Given that the current Flink Kubernetes Operator is built on top of native K8s integration[1], the Flink ResourceManager should allocate

Re: Flink operator - ignore ssl cert validation

2022-06-23 Thread Yang Wang
Do you mean the HttpArtifactFetcher could not support HTTPS? cc @Aitozi Best, Yang calvin beloy 于2022年6月22日周三 22:10写道: > Sorry typo "jarring" should be "jar url". > > Sent from Yahoo Mail on Android >

Re: Flink Operator - Support for k8s HA jobmanager

2022-06-23 Thread Yang Wang
Matyas's answer is on the point. You need to mount a shared volume for all the JobManager pods so that the uploaded jars are visible for them all. Best, Yang Őrhidi Mátyás 于2022年6月23日周四 04:34写道: > I guess the problem here is that your JM pods do not have access to a > common upload folder.

Re: HTTP 404 while creating resource with flink kubernetes operator and frabric8 client

2022-06-23 Thread Yang Wang
Do you have installed the operator along with CRD[1]? [1]. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.0/docs/try-flink-kubernetes-operator/quick-start/#deploying-the-operator Best, Yang yu'an huang 于2022年6月23日周四 13:04写道: > Hi, > > It seems that you can't find

Re: Flink k8s Operator on AWS?

2022-06-23 Thread Yang Wang
just to get a few files into a pod container... I feel it's just a bit > much. > But again, this is my frustration with k8s, not with Flink ;-) > > Cheers, > Matt > > On Wed, Jun 22, 2022 at 5:32 AM Yang Wang wrote: > >> Matyas and Gyula have shared many great informations a

Re: Flink k8s Operator on AWS?

2022-06-21 Thread Yang Wang
Matyas and Gyula have shared many great informations about how to make the Flink Kubernetes Operator work on the EKS. One more input about how to prepare the user jars. If you are more familiar with K8s, you could use persistent volume to provide the user jars and them mount the volume to

Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions

2022-06-16 Thread Yang Wang
Could you please have a try with high availability enabled[1]? If HA enabled, the internal jobmanager rpc service will not be created. Instead, the TaskManager retrieves the JobManager address via HA services and connects to it via pod ip. [1].

Re: Flink k8s HA 手动删除作业deployment导致的异常

2022-06-12 Thread Yang Wang
Zhanghao的回答已经非常全面了,我再补充小点,删除Deployment保留HA ConfigMap是预期内的行为,文档里面有说明[1] 之所以这样设计有两点原因: (1.) 任务可能会被重启,但使用相同的cluster-id,并且希望从之前的checkpoint恢复 (2.) 单纯的删除ConfigMap会导致存储在DFS(e.g. HDFS、S3、OSS)上面的HA数据泄露 [1].

[ANNOUNCE] Apache Flink Kubernetes Operator 1.0.0 released

2022-06-05 Thread Yang Wang
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.0.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. This is the first production ready release

[ANNOUNCE] Apache Flink Kubernetes Operator 1.0.0 released

2022-06-05 Thread Yang Wang
The Apache Flink community is very happy to announce the release of Apache Flink Kubernetes Operator 1.0.0. The Flink Kubernetes Operator allows users to manage their Apache Flink applications and their lifecycle through native k8s tooling like kubectl. This is the first production ready release

Re: Flink Kubernetes Operator v1.0 ETA

2022-06-01 Thread Yang Wang
If everything goes well, I will close the VOTE for RC4 on Friday night, which should run for more than 48 hours. And then finalize the release. Best, Yang Gyula Fóra 于2022年6月1日周三 23:30写道: > Hi Jeesmon! > > We are currently working through the release process. We are now in the > middle of

Re: multiple pipeline deployment using flink k8s operator

2022-06-01 Thread Yang Wang
The current application mode has the limitation that only one job could be submitted when HA enabled[1]. So a feasible solution is to use the session mode[2], it will be supported in the coming release-1.0.0. However, I am afraid it still could not satisfy your requirement "2 task managers (one

Re: Deployment on k8s via API

2022-05-17 Thread Yang Wang
Maybe you could have a try on the flink-kubernetes-operator[1]. It is designed for using Kubernetes CRD to manage the Flink applications. [1]. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-0.1/ Best, Yang Devin Bost 于2022年5月18日周三 08:29写道: > Hi, > > I'm looking at

Re: Running in application mode on YARN without fat jar

2022-05-16 Thread Yang Wang
The usrlib for YARN only works for 1.15.0 and later versions. Refer to the ticket[1] for more information. [1]. https://issues.apache.org/jira/browse/FLINK-24897 Best, Yang Pavel Penkov 于2022年5月16日周一 22:59写道: > I can't manage to run an application on YARN because of classpath issues. > Flink

Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-16 Thread Yang Wang
e were no more logs there. > > How can I reproduce the issue ? > > On Thu, May 12, 2022 at 10:35 AM Yang Wang wrote: > >> The SUSPENDED state is usually caused by lost leadership. Maybe you could >> find more information about leader in the JobManager and TaskManager logs.

Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-11 Thread Yang Wang
The SUSPENDED state is usually caused by lost leadership. Maybe you could find more information about leader in the JobManager and TaskManager logs. Best, Yang Xiaolong Wang 于2022年5月11日周三 19:18写道: > Hello, > > Recently our Flink jobs on Native K8s encountered failing in the > `SUSPENDED`

Re: Flink Application + HA与HistoryServer的使用问题

2022-05-11 Thread Yang Wang
可以临时通过-D "$internal.pipeline.job-id="来自定义job id,但是个内部参数 你可以看下[1],了解更多讨论的信息 [1]. https://issues.apache.org/jira/browse/FLINK-19358 Best, Yang 谭家良 于2022年5月11日周三 22:17写道: > > > 我使用的Application模式:Kubernetes > 我使用的HA模式:Kubernetes HA > > > 目前Application + HA发现所有的Job

Re: Flink Kubernetes operator not having a scale subresource

2022-05-06 Thread Yang Wang
Currently, the flink-kubernetes-operator is using Flink native K8s integration[1], which means Flink ResourceManager will dynamically allocate TaskManager on demand. So the users do not need to specify the replicas of TaskManager. Just like Gyula said, one possible solution to make "kubectl

Re: Using the official flink operator and kubernetes secrets

2022-05-04 Thread Yang Wang
rvers" >> - "my.kafka.host:9093" >> - "--kafka.sasl.username" >> - "$(KAFKA_SASL_USERNAME)" >> - "--kafka.sasl.password" >> - "$(KAFKA_SASL_PASSWORD)" >> ​ >> >> It would be a great addition, simpli

Re: Using the official flink operator and kubernetes secrets

2022-05-03 Thread Yang Wang
Flink could not support environment replacement in the args. I think you could access the env via "*System.getenv()*" in the user main method. It should work since the user main method is executed in the JobManager side. Best, Yang Őrhidi Mátyás 于2022年4月28日周四 19:27写道: > Also, > > just

Re: flink operator sometimes cannot start jobmanager after upgrading

2022-05-02 Thread Yang Wang
I am afraid we do not handle the scenario that the JobManager deployment is deleted externally. Best, Yang Őrhidi Mátyás 于2022年5月2日周一 16:52写道: > I filed a Jira for tracking this issue: > https://issues.apache.org/jira/browse/FLINK-27468 > > On Mon, May 2, 2022 at 10:31 AM Őrhidi Mátyás >

Re: flink 任务对接k8s的第三方jar包管理问题

2022-04-25 Thread Yang Wang
* 使用flink run命令来提交任务到running的Session集群的话,只能是本地的jar * 也可以使用rest接口来提交,先上传到JobManager端[1],然后运行上传的jar[2] * 最后可以尝试一下flink-kubernetes-operator项目,目前Session job是支持远程jar的[3],项目还在不断完善 [1]. https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/rest_api/#jars-upload [2].

Re: how to setup working dir in Flink operator

2022-04-25 Thread Yang Wang
Using the pod template to configure the local SSD(via host-path or local PV) is the correct way. After that, either "java.io.tmpdir" or "process.taskmanager.working-dir" in CR should take effect. Maybe you need to share the complete pod yaml and logs of failed TaskManager. nit: if the

Re: FlinkSQL 对接k8s的提交问题

2022-04-25 Thread Yang Wang
目前Application模式确实不能支持已经生成好的JobGraph运行,我能想到一个work around的办法是就先写一个user jar直接把JobGraph提交到local的集群里面 就像下面这样 public class JobGraphRunner { private static final Logger LOG = LoggerFactory.getLogger(JobGraphRunner.class); public static void main(String[] args) throws Exception { final

Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-23 Thread Yang Wang
track this issue. > > > > BRs, > > Chenyu > > > > *From: *"Zheng, Chenyu" > *Date: *Friday, April 22, 2022 at 6:26 PM > *To: *Yang Wang > *Cc: *"u...@flink.apache.org" , " > user-zh@flink.apache.org" > *Subject: *Re: JobManager d

Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-23 Thread Yang Wang
track this issue. > > > > BRs, > > Chenyu > > > > *From: *"Zheng, Chenyu" > *Date: *Friday, April 22, 2022 at 6:26 PM > *To: *Yang Wang > *Cc: *"user@flink.apache.org" , " > user...@flink.apache.org" > *Subject: *Re: JobManager d

Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-22 Thread Yang Wang
The root cause might be you APIServer is overloaded or not running normally. And then all the pods events of taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in FlinkResourceManager. So the two taskmanagers are not recognized by ResourceManager and then registration are

Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-22 Thread Yang Wang
The root cause might be you APIServer is overloaded or not running normally. And then all the pods events of taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in FlinkResourceManager. So the two taskmanagers are not recognized by ResourceManager and then registration are

Re: Kubernetes killing TaskManager - Flink ignoring taskmanager.memory.process.size

2022-04-21 Thread Yang Wang
Could you please configure a bigger memory to avoid OOM and use NMTracker[1] to figure out the memory usage categories? [1]. https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html Best, Yang Dan Hill 于2022年4月21日周四 07:42写道: > Hi. > > I upgraded to Flink v1.14.4

Re: Flink Kubernetes Operator

2022-04-14 Thread Yang Wang
你这个报错主要原因还是访问外部的一些镜像源失败导致的,你可以使用一些云厂商提供的代理来解决拉镜像失败的问题 或者使用--set webhook.create=false来关闭webhook功能 Best, Yang casel.chen 于2022年4月14日周四 15:46写道: > The deployment 'cert-manager-webhook' shows > Failed to pull image "quay.io/jetstack/cert-manager-webhook:v1.7.1": rpc > error: code = Unknown desc =

Re: Enabling savepoints when deploying in Application Mode

2022-04-12 Thread Yang Wang
If you are trying to submit a job to an already-running application via "flink run", then it will not succeed. Because this is the by-design behavior. Please note that triggering a savepoint will also update the checkpoint information in HA ConfigMap, so deleting the deployment(with HA ConfigMap

Re: Official Flink operator additional class paths

2022-04-07 Thread Yang Wang
It seems that you have a typo when specifying the pipeline classpath. "file:///flink-jar/flink-connector-rabbitmq_2.12-1.14.4.jar" -> "file:///flink-jars/flink-connector-rabbitmq_2.12-1.14.4.jar" If this is not the root cause, maybe you could have a try with downloading the connector jars to

Re: The flink-kubernetes-operator vs third party flink operators

2022-04-05 Thread Yang Wang
Thanks for the interest on the flink-kubernetes-operator project. I believe you could leave a comment in the ticket FLINK-27049. If the reporter has not start working on this ticket, then you could be assigned. Best, Yang Hao t Chang 于2022年4月6日周三 06:30写道: > Hi Gyula > > > > Thanks for the

Re: flink cluster startup time

2022-03-30 Thread Yang Wang
@Gyula Fóra is trying to prepare the preview release(0.1) for flink-kubernetes-operator. It now is fully functional for application mode. You could have a try and share more feedback with the community. The release-1.0 aims for production ready. And we still miss some important pieces(e.g.

Re: "Native Kubernetes" sample in Flink documentation fails. JobManager Web Interface is wrongly generated. [Flink 1.14.4]

2022-03-29 Thread Yang Wang
ubernetes.config.file=/home/devuser/.kube/config > examples/batch/WordCount.jar > > > > Best regards, > > Burcu > > > > *From:* Yang Wang [mailto:danrtsey...@gmail.com] > *Sent:* Saturday, March 26, 2022 7:48 AM > *To:* Burcu Gul POLAT EGRI > *Cc:* user@flink.a

Re: JobManager failed to renew it's leadership (K8S HA)

2022-03-27 Thread Yang Wang
Could you please verify whether the JobManager is going through a long full GC or the Kubernetes APIServer is working well at that moment? We are using Kubernetes HA service in the production and it seems stable without your issue. Best, Yang marco andreas 于2022年3月27日周日 18:35写道: > > Hello, >

Re: Deploy a Flink session cluster natively on K8s with multi AZ

2022-03-27 Thread Yang Wang
> In the example, we can pass args in the command, is there a way to do it by using the flink-conf.yaml? Yes. All the changes in the $FLINK_HOME/conf/flink-conf.yaml at the client side will also be picked up when deploying a native K8s cluster. For your use case, I am also suggesting the

Re: flink-kubernetes-operator: Flink deployment stuck in scheduled state when increasing resource CPU above 1

2022-03-25 Thread Yang Wang
Could you please share the result of "kubectl describe pods" when getting stuck? It will be very useful to help to figure out the root cause. I guess it might be related to insufficient resources for minikube. Best, Yang Őrhidi Mátyás 于2022年3月26日周六 03:12写道: > It's worth checking the

Re: "Native Kubernetes" sample in Flink documentation fails. JobManager Web Interface is wrongly generated. [Flink 1.14.4]

2022-03-25 Thread Yang Wang
The root cause might be the LoadBalancer could not really work in your environment. We already have a ticket to track this[1] and will try to get it resolved in the next release. For now, could you please have a try by adding "-Dkubernetes.rest-service.exposed.type=NodePort" to your session and

Re: Kubernetes HA on an application cluster

2022-03-21 Thread Yang Wang
This log means the Flink internal leader elector failed to renew the leader ConfigMap to keep its leadership. It might be caused by a network issue, long fullGC or the K8s APIServer internal error. This blog[1] could help you to know how the Kubernetes HA works. [1].

Re: Re: k8s native session 问题咨询

2022-03-08 Thread Yang Wang
你用新版本试一下,看着是已经修复了 https://issues.apache.org/jira/browse/FLINK-19212 Best, Yang 崔深圳 于2022年3月9日周三 10:31写道: > > > > web ui报错:请求这个接口: /jobs/overview,时而报错, Exception on server > side:\norg.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: > Failed to serialize the result for RPC call : >

Re: Submit job to a session cluster on Kubernetes via REST API

2022-03-06 Thread Yang Wang
If you want to use the RestClusterClient to do the job submission and lifecycle management, the implementation in the flink-kubernetes-operator[1] project may give you some insights. You could also use /jars/:jarid/run[2] to run a Flink job. It is a pure HTTP interface. [1].

Re: K8s部署Flink 作业,无法在Web UI查看TaskManger的STDOUT日志

2022-03-02 Thread Yang Wang
Standalone Flink on K8s 和 native K8s都会有你说的这个问题 主要原因是标准输出打印到的pod console了,所以通过kubectl logs可以查看stdout日志,但webUI上就没有 你可以参考这个commit[1]自己编译一个Flink binary来实现 [1]. https://github.com/wangyang0918/flink/commit/2454b6daa2978f9ea9669435f92a9f2e78de357a Best, Yang xinzhuxiansheng 于2022年3月2日周三 15:00写道:

Re: [Flink-1.14.3] Restart of pod due to duplicatejob submission

2022-02-24 Thread Yang Wang
This might be related with FLINK-21928 and seems already fixed in 1.14.0. But it will have some limitations and users need to manually clean up the HA entries. Best, Yang Parag Somani 于2022年2月24日周四 13:42写道: > Hello, > > Recently due to log4j vulnerabilities, we have upgraded to Apache Flink >

Re: No effect from --allowNonRestoredState or "execution.savepoint.ignore-unclaimed-state" in K8S application mode

2022-02-22 Thread Yang Wang
last resort but didn't > think to do it together with "--fromSavepoint". > > Thanks again! > > On Sun, Feb 20, 2022 at 9:49 PM Yang Wang wrote: > >> By design, we should support arbitrary config keys via the CLI when using >> generic CLI mode. >> &g

Re: No effect from --allowNonRestoredState or "execution.savepoint.ignore-unclaimed-state" in K8S application mode

2022-02-20 Thread Yang Wang
> The only one i'm having problems with is > "execution.savepoint.ignore-unclaimed-state". > > On Fri, Feb 18, 2022 at 3:42 PM Austin Cawley-Edwards < > austin.caw...@gmail.com> wrote: > >> Hi Andrey, >> >> It's unclear to me from the docs[1] if the flink

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-24 Thread Yang Wang
ge prior cluster creation; all logs' files are there. > once the cluster is deployed, they are missing. (bug?) > > Best, > Tamir. > -- > *From:* Tamir Sagi > *Sent:* Friday, January 21, 2022 7:19 PM > *To:* Yang Wang > *Cc:* user@flink.apache.org > *

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-20 Thread Yang Wang
ePathList > "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" ${CLASS_TO_RUN} > "${ARGS[@]}" "${FLINK_ENV_JAVA_OPTS}" > > Unless there are system configurations which not supposed to be overridden > by user(And then having dedicated env variables

  1   2   3   4   5   6   7   >