Re: Is there an HA solution to run flink job with multiple source

2022-06-01 Thread Alexander Fedulov
copy the implementation of >> existing kafka source and change a little code conveniently. >> >> At 2022-06-01 22:38:39, "Bariša Obradović" wrote: >> >> Hi, >> we are running a flink job with multiple kafka sources connected to >> different ka

Re: Is there an HA solution to run flink job with multiple source

2022-06-01 Thread Jing Ge
lse yourself. But the good news is that you can copy the implementation of > existing kafka source and change a little code conveniently. > > At 2022-06-01 22:38:39, "Bariša Obradović" wrote: > > Hi, > we are running a flink job with multiple kafka sources connected to >

Re:Is there an HA solution to run flink job with multiple source

2022-06-01 Thread Xuyang
of existing kafka source and change a little code conveniently. At 2022-06-01 22:38:39, "Bariša Obradović" wrote: Hi, we are running a flink job with multiple kafka sources connected to different kafka servers. The problem we are facing is when one of the kafka's is down, the

Is there an HA solution to run flink job with multiple source

2022-06-01 Thread Bariša Obradović
Hi, we are running a flink job with multiple kafka sources connected to different kafka servers. The problem we are facing is when one of the kafka's is down, the flink job starts restarting. Is there anyway for flink to pause processing of the kafka which is down, and yet continue processing

Re: Flink Job Manager unable to recognize Task Manager Available slots

2022-05-24 Thread Teoh, Hong
/3a97d1d50f663027ae81efe0f0aa rmr /flink/default/leader/3a97d1d50f663027ae81efe0f0aa This should result in your JobManager recovering from the faulty job. Regards, Hong From: "s_penakalap...@yahoo.com" Date: Tuesday, 24 May 2022 at 18:40 To: User Subject: RE: [EXTERNAL]Flink Job Mana

Re: Flink Job Manager unable to recognize Task Manager Available slots

2022-05-24 Thread s_penakalap...@yahoo.com
Hi Team, Any inputs please badly stuck. Regards,Sunitha On Sunday, May 22, 2022, 12:34:22 AM GMT+5:30, s_penakalap...@yahoo.com wrote: Hi All, Help please! We have standalone Flink service installed in individual VM and clubed to form a cluster with HA and checkpoint in place. When

Flink Job Manager unable to recognize Task Manager Available slots

2022-05-21 Thread s_penakalap...@yahoo.com
Hi All, Help please! We have standalone Flink service installed in individual VM and clubed to form a cluster with HA and checkpoint in place. When cancelling Job, Flink cluster went down and its unable to start up normally as Job manager is continuously going down with the below error:

答复: Flink Job Execution issue at Yarn

2022-05-18 Thread Geng Biao
mum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`. Maybe it is better to post the JM/TM logs if any of them exists to provide more information. Best, Biao Geng 发件人: Anitha Thankappan 日期: 星期三, 2022年5月18日 下午8:26 收件人: user@flink.apache.org 主题: Flink Job Execution issue at Yarn Hi,

Flink Job Execution issue at Yarn

2022-05-18 Thread Anitha Thankappan
Hi, We are using below command to submit a flink application job to GCP dataproc cluster using Yarn. *flink run-application -t yarn-application .jar* Our Cluster have 1 master node with 64 GB and 10 worker nodes of 32 GB. The flink configurations given are: *jobmanager.memory.process.size:

Re: flink Job is throwing depdnecy issue when submitted to clusrer

2022-05-10 Thread Konstantin Knauf
Info | > | 日期 | 2022年05月07日 13:21 | > | 收件人 | d...@flink.apache.org、user< > user@flink.apache.org> | > | 抄送至 | | > | 主题 | flink Job is throwing depdnecy issue when submitted to clusrer | > I have one flink job which reads files from s3 and processes them. > Currently, it is runni

flink Job is throwing depdnecy issue when submitted to clusrer

2022-05-06 Thread Great Info
I have one flink job which reads files from s3 and processes them. Currently, it is running on flink 1.9.0, I need to upgrade my cluster to 1.13.5, so I have done the changes in my job pom and brought up the flink cluster using 1.13.5 dist. when I submit my application I am getting the below

Clust rconfiguration for network-intensive Flink job

2022-03-22 Thread Vasileva, Valeriia
Hi, folks! I am running Flink Streaming job in mode=Batch on EMR. The job has following stages: 1. Read from MySQL 2. KeyBy user_id 3. Reduce by user_id 4. Async I/O enriching from Redis 5. Async I/O enriching from other Redis 6. Async I/O enriching from REST #1 7.

Re: Flink job recovery after task manager failure

2022-03-03 Thread yidan zhao
dleStore{namespace='flink/default/checkpoints/523f9e48274186bb97c13e3c2213be0e'}. > > 2022-02-24 12:20:16,712 INFO > org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore > [] - All 1 checkpoints found are already downloaded. > > 2022-02-24 12:20:16,712 INFO > o

Re: Flink job recovery after task manager failure

2022-03-02 Thread Afek, Ifat (Nokia - IL/Kfar Sava)
- No master state to restore 2 Thanks, Ifat From: yidan zhao Date: Wednesday, 2 March 2022 at 4:08 To: "Afek, Ifat (Nokia - IL/Kfar Sava)" Cc: zhlonghong , "user@flink.apache.org" Subject: Re: Flink job recovery after task manager failure State backend can be set a

Re: Flink job recovery after task manager failure

2022-03-01 Thread yidan zhao
ers? Is > there another option? > > > > Thanks, > > Ifat > > > > *From: *Zhilong Hong > *Date: *Thursday, 24 February 2022 at 19:58 > *To: *"Afek, Ifat (Nokia - IL/Kfar Sava)" > *Cc: *"user@flink.apache.org" > *Subject: *Re: Flink job rec

Re: Flink job recovery after task manager failure

2022-03-01 Thread Afek, Ifat (Nokia - IL/Kfar Sava)
system be shared between the task managers and job managers? Is there another option? Thanks, Ifat From: Zhilong Hong Date: Thursday, 24 February 2022 at 19:58 To: "Afek, Ifat (Nokia - IL/Kfar Sava)" Cc: "user@flink.apache.org" Subject: Re: Flink job recovery after task

Re: Flink job recovery after task manager failure

2022-02-25 Thread Afek, Ifat (Nokia - IL/Kfar Sava)
Hi Zhilong, I will check the issues you raised. Thanks for your help, Ifat From: Zhilong Hong Date: Thursday, 24 February 2022 at 19:58 To: "Afek, Ifat (Nokia - IL/Kfar Sava)" Cc: "user@flink.apache.org" Subject: Re: Flink job recovery after task manager failure Hi, Afe

Re: Flink job recovery after task manager failure

2022-02-24 Thread Zhilong Hong
On Thu, Feb 24, 2022 at 9:04 PM Afek, Ifat (Nokia - IL/Kfar Sava) < ifat.a...@nokia.com> wrote: > Thanks Zhilong. > > > > The first launch of our job is fast, I don’t think that’s the issue. I see > in flink job manager log that there were several exceptions during the > re

Re: Flink job recovery after task manager failure

2022-02-24 Thread Afek, Ifat (Nokia - IL/Kfar Sava)
Thanks Zhilong. The first launch of our job is fast, I don’t think that’s the issue. I see in flink job manager log that there were several exceptions during the restart, and the task manager was restarted a few times until it was stabilized. You can find the log here: jobmanager-log.txt.gz

Re: Flink job recovery after task manager failure

2022-02-23 Thread Zhilong Hong
Hi, Afek! When a TaskManager is killed, JobManager will not be acknowledged until a heartbeat timeout happens. Currently, the default value of heartbeat.timeout is 50 seconds [1]. That's why it takes more than 30 seconds for Flink to trigger a failover. If you'd like to shorten the time a

Flink job recovery after task manager failure

2022-02-23 Thread Afek, Ifat (Nokia - IL/Kfar Sava)
Hi, I am trying to use Flink checkpoints solution in order to support task manager recovery. I’m running flink using beam with filesystem storage and the following parameters: checkpointingInterval=3 checkpointingMode=EXACTLY_ONCE. What I see is that if I kill a task manager pod, it takes

Flink job of multiple sink tables can't started on yarn

2022-01-24 Thread Xuekui
Hi all, I have one flink job which reads data from one kafka topic and sinks to two kafka topics using Flink SQL. The code is something like this: tableEnv.executeSql( """ create table sink_table1 ( xxx xxx ) with ( 'connector' = 'kafka', 'topic' = 'topic1' ) ""

Re: Uploading jar on multiple flink job managers

2021-12-31 Thread Timo Walther
Hi Puneet, are we talking about the `web.upload.dir` [1] ? Maybe others have a better solution for your problem, but have you thought about configuring an NFS or some other distributed file system as the JAR directory? In this case it should be available to all JobManagers. Regards, Timo

Uploading jar on multiple flink job managers

2021-12-24 Thread Puneet Duggal
Hi, So currently we are using flink 1.12 in HA mode on production. There are 3 job managers (1 leader and 2 standby). When I am uploading a jar on one of the job managers, somehow it is not reflected on other job managers. Is there any way where I can achieve a behaviour where uploading jar to

Re: Running a flink job with Yarn per-job mode from application code.

2021-12-09 Thread Caizhi Weng
Hi! For Yarn jobs, Flink web UI (which shares the same port with REST API) can be found by clicking into the "Application Manager" link in the corresponding application page. I'm not familiar with Yarn API but I guess there is some way to get this application manager link. Kamil ty

Re: Running a flink job with Yarn per-job mode from application code.

2021-12-09 Thread Kamil ty
Sorry for that. Will remember to add the mailing list in responses. The REST API approach will probably be sufficient. One more question regarding this. Is it possible to get the address/port of the rest api endpoint from the job? I see that when deploying a job to yarn the rest.address and

Re: Running a flink job with Yarn per-job mode from application code.

2021-12-08 Thread Caizhi Weng
Hi! Please make sure to always reply to the user mailing list so that everyone can see the discussion. You can't get the execution environment for an already running job but if you want to operate on that job you can try to get its JobClient instead. However this is somewhat complicated to get

Re: Running a flink job with Yarn per-job mode from application code.

2021-12-07 Thread Caizhi Weng
Hi! So you would like to submit a yarn job with Java code, not using /bin/flink run? If that is the case, you'll need to set 'execution.target' config option to 'yarn-per-job'. Set this in the configuration of ExecutionEnvironment and execute the job with Flink API as normal. Kamil ty

Running a flink job with Yarn per-job mode from application code.

2021-12-07 Thread Kamil ty
Hello all, I'm looking for a way to submit a Yarn job from another flink jobs application code. I can see that you can access a cluster and submit jobs with a RestClusterClient, but it seems a Yarn per-job mode is not supported with it. Any suggestions would be appreciated. Best Regards Kamil

flink job进行cancel后kafka producer报错

2021-11-18 Thread yu...@kiscloud.net
Hi All: 有一个flink的运行正常的job进行cancel后,flink的log里面打印了一些信息,请问这种异常如何排查根因? TY-APP-DATA-REAL-COMPUTATION - [2021-11-18 22:10:32.465] - INFO [RecordComputeOperator -> Sink: wi-data-sink (3/10)] org.apache.flink.runtime.taskmanager.Task - RecordComputeOperator -> Sink: wi-data-sink (3/10)

Re: Problem with Flink job and Kafka.

2021-10-20 Thread Marco Villalobos
AM +0800, Marco Villalobos < > mvillalo...@kineteque.com>, wrote: > > I have the simplest Flink job that simply deques off of a kafka topic and > writes to another kafka topic, but with headers, and manually copying the > event time into th

Re: Problem with Flink job and Kafka.

2021-10-18 Thread Qingsheng Ren
aste part of your code (on DataStream API) or SQL (on Table & SQL API). -- Best Regards, Qingsheng Ren Email: renqs...@gmail.com On Oct 19, 2021, 9:28 AM +0800, Marco Villalobos , wrote: > I have the simplest Flink job that simply deques off of a kafka topic and > writes to anoth

Problem with Flink job and Kafka.

2021-10-18 Thread Marco Villalobos
I have the simplest Flink job that simply deques off of a kafka topic and writes to another kafka topic, but with headers, and manually copying the event time into the kafka sink. It works as intended, but sometimes I am getting this error

Re: Yarn job not exit when flink job exit

2021-10-11 Thread Yangze Guo
aizhi Weng wrote: > > Hi! > > yarn-cluster is the mode for a yarn session cluster, which means the cluster > will remain even after the job is finished. If you want to finish the Flink > job as well as the yarn job, use yarn-per-job mode instead. > > Jake 于2021年10月9日周六 下午5:53

Re: Yarn job not exit when flink job exit

2021-10-11 Thread Caizhi Weng
Hi! yarn-cluster is the mode for a yarn session cluster, which means the cluster will remain even after the job is finished. If you want to finish the Flink job as well as the yarn job, use yarn-per-job mode instead. Jake 于2021年10月9日周六 下午5:53写道: > Hi > > When submit job in yarn-clus

Yarn job not exit when flink job exit

2021-10-09 Thread Jake
Hi When submit job in yarn-cluster model, flink job finish but yarn job not exit. What should I do? Submit command: /opt/app/flink/flink-1.14.0/bin/flink run -m yarn-cluster ./flink-sql-client.jar --file dev.sql

Re: flink job : TPS drops from 400 to 30 TPS

2021-09-27 Thread JING ZHANG
Hi, About cpu cost, there are several methods: 1. Flink builtin metric: `Status.JVM.CPU.Load` [1] 2. Use `top` command on the target machine which deploys a suspect TaskManager 3. You could use flame graph to do deeper profiler of a JVM [2]. ... About RPC response, I'm not an expert on HBase, I'm

Re: flink job : TPS drops from 400 to 30 TPS

2021-09-27 Thread Ragini Manjaiah
please let me know how to check Does RPC response and CPU cost On Mon, Sep 27, 2021 at 1:19 PM JING ZHANG wrote: > Hi, > Since there is not enough information, you could first check the back > pressure status of the job [1], find the task which caused the back > pressure. > Then try to find out

Re: flink job : TPS drops from 400 to 30 TPS

2021-09-27 Thread JING ZHANG
Hi, Since there is not enough information, you could first check the back pressure status of the job [1], find the task which caused the back pressure. Then try to find out why the task processed data slowly, there are many reasons, for example the following reasons: (1) Does data skew exist,

flink job : TPS drops from 400 to 30 TPS

2021-09-27 Thread Ragini Manjaiah
Hi , I have a flink real time job which processes user records via topic and also reading data from hbase acting as a look table . If the look table does not contain required metadata then it queries the external db via api . First 1to 2 hours it works fine without issues, later it drops down

Re: Exact S3 Permissions to allow a flink job to use s3 for checkpointing

2021-09-24 Thread Meissner, Dylan
to allow a flink job to use s3 for checkpointing Hi, Thomas I am not an expert of s3 but I think Flinkneed write/read/delete(maybe list) permission of the path(bucket). BTW, What error did you encounter? Best, Guowei On Fri, Sep 24, 2021 at 5:00 AM Thomas Wang mailto:w...@datability.io

Re: Exact S3 Permissions to allow a flink job to use s3 for checkpointing

2021-09-24 Thread Guowei Ma
Hi, Thomas I am not an expert of s3 but I think Flinkneed write/read/delete(maybe list) permission of the path(bucket). BTW, What error did you encounter? Best, Guowei On Fri, Sep 24, 2021 at 5:00 AM Thomas Wang wrote: > Hi, > > I'm trying to figure out what exact s3 permissions doe

Exact S3 Permissions to allow a flink job to use s3 for checkpointing

2021-09-23 Thread Thomas Wang
Hi, I'm trying to figure out what exact s3 permissions does a flink job need to work appropriately when using s3 for checkpointing. Currently, I have the following IAM Policy, but it seems insufficient. Can anyone help me figure this out? Thanks. { Action = [ "s3:PutObject", &qu

Re: 如何按照数据量,可用内存,核为flink job分配合适的资源

2021-08-23 Thread 张锴
不是,你应该认错了 yidan zhao 于2021年8月24日周二 下午12:50写道: > 你是zhangkai30吗~ > > 张锴 于2021年8月24日周二 上午11:16写道: > > > 我们数据量大概一天6个T,只做一个简单的处理,不涉及计算,在资源够用的情况下,用命令行的方式启动flink job > > 需要给多少比较合适,比如jobmanager内存 ,taskmanager内存, slot , taskmanager数量 ,并行度p, > > 具体是怎么考虑的呢? > > > > 对于大规模的数据量经验还比较浅,有大佬给指明一下吗 > > >

Re: 如何按照数据量,可用内存,核为flink job分配合适的资源

2021-08-23 Thread yidan zhao
你是zhangkai30吗~ 张锴 于2021年8月24日周二 上午11:16写道: > 我们数据量大概一天6个T,只做一个简单的处理,不涉及计算,在资源够用的情况下,用命令行的方式启动flink job > 需要给多少比较合适,比如jobmanager内存 ,taskmanager内存, slot , taskmanager数量 ,并行度p, > 具体是怎么考虑的呢? > > 对于大规模的数据量经验还比较浅,有大佬给指明一下吗 >

如何按照数据量,可用内存,核为flink job分配合适的资源

2021-08-23 Thread 张锴
我们数据量大概一天6个T,只做一个简单的处理,不涉及计算,在资源够用的情况下,用命令行的方式启动flink job 需要给多少比较合适,比如jobmanager内存 ,taskmanager内存, slot , taskmanager数量 ,并行度p, 具体是怎么考虑的呢? 对于大规模的数据量经验还比较浅,有大佬给指明一下吗

Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
Hi Nicolaus, I double checked again our hdfs config, it is setting 1 instead of 2. I will try the solution you provided. Thanks again. Best regards Rainie On Wed, Aug 4, 2021 at 10:40 AM Rainie Li wrote: > Thanks for the context Nicolaus. > We are using S3 instead of HDFS. > > Best regards >

Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
Thanks for the context Nicolaus. We are using S3 instead of HDFS. Best regards Rainie On Wed, Aug 4, 2021 at 12:39 AM Nicolaus Weidner < nicolaus.weid...@ververica.com> wrote: > Hi Rainie, > > I found a similar issue on stackoverflow, though quite different > stacktrace: >

Re: Flink job failure during yarn node termination

2021-08-04 Thread Rainie Li
Thanks Till. We terminated one of the worker nodes. We enabled HA by using Zookeeper. Sure, we will try upgrade job to newer version. Best regards Rainie On Tue, Aug 3, 2021 at 11:57 PM Till Rohrmann wrote: > Hi Rainie, > > It looks to me as if Yarn is causing this problem. Which Yarn node are

Re: Flink job failure during yarn node termination

2021-08-04 Thread Till Rohrmann
Hi Rainie, It looks to me as if Yarn is causing this problem. Which Yarn node are you terminating? Have you configured your Yarn cluster to be highly available in case you are terminating the ResourceManager? Flink should retry the operation of starting a new container in case it fails. If this

Re: Calling a stateful fuction from Flink Job - DataStream Integration

2021-08-03 Thread Igal Shilman
in Flink, perhaps other members of the community might chime in :-) All the best, Igal. On Wed, Jul 28, 2021 at 12:05 PM Deniz Koçak wrote: > Hi, > > We would like to host a separate service (logically and possibly > physically) from the Flink Job which also we would like to call from

Flink job failure during yarn node termination

2021-08-03 Thread Rainie Li
Hi Flink Community, My flink application is running version 1.9 and it failed to recover (application was running but checkpoint failed and job stopped to process data) during hadoop yarn node termination. *Here is job manager log error:* *2021-07-26 18:02:58,605 INFO

Calling a stateful fuction from Flink Job - DataStream Integration

2021-07-28 Thread Deniz Koçak
Hi, We would like to host a separate service (logically and possibly physically) from the Flink Job which also we would like to call from our Flink Job. We have been considering remote functions via HTTP which seems to be good option on cloud. We have been considering Async I/o and statefun

Re: How would Flink job react to change of partitions in Kafka topic?

2021-06-23 Thread Thomas Wang
Jun 22, 2021 at 1:32 PM Thomas Wang wrote: > >> Hi, >> >> I'm wondering if anyone has changed the number of partitions of a source >> Kafka topic. >> >> Let's say I have a Flink job read from a Kafka topic which used to have >> 32 partitions. I

Re: How would Flink job react to change of partitions in Kafka topic?

2021-06-23 Thread Seth Wiesman
he number of partitions of a source > Kafka topic. > > Let's say I have a Flink job read from a Kafka topic which used to have 32 > partitions. If I change the number of partitions of that topic to 64, can > the Flink job still guarantee the exactly-once semantics? > > Thanks. > > Thomas >

How would Flink job react to change of partitions in Kafka topic?

2021-06-22 Thread Thomas Wang
Hi, I'm wondering if anyone has changed the number of partitions of a source Kafka topic. Let's say I have a Flink job read from a Kafka topic which used to have 32 partitions. If I change the number of partitions of that topic to 64, can the Flink job still guarantee the exactly-once semantics

Re: Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 Thread yidan zhao
我仔细想了想,我的集群是内网服务器上的容器,容器之间访问应该不算经过NAT。 当然和网络相关的监控来看,的确很多机器的time-wait状态的连接不少,在5w+个左右,但也不至于导致这个问题感觉。 东东 于2021年6月17日周四 下午2:48写道: > > 这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。 > > > >

Re:Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 Thread 东东
这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。 一般来说单机不会有这个问题,因为时钟应该是一个,在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致),但不清楚你的具体架构,只能说试一试。 最后,可以跟运维讨论一下,除非确信不会有经过NAT过来的链接,否则这俩最好别都开。 PS: kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了,因为太多人掉这坑里了 在 2021-06-17 14:07:50,"yidan

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 Thread yidan zhao
这啥原理,这个改动我没办法直接改,需要申请。 东东 于2021年6月17日周四 下午1:36写道: > > > > 把其中一个改成0 > > > 在 2021-06-17 13:11:01,"yidan zhao" 写道: > >是的,宿主机IP。 > > > >net.ipv4.tcp_tw_reuse = 1 > >net.ipv4.tcp_timestamps = 1 > > > >东东 于2021年6月17日周四 下午12:52写道: > >> > >> 10.35.215.18是宿主机IP? > >> > >> 看一下

Re:Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread 东东
把其中一个改成0 在 2021-06-17 13:11:01,"yidan zhao" 写道: >是的,宿主机IP。 > >net.ipv4.tcp_tw_reuse = 1 >net.ipv4.tcp_timestamps = 1 > >东东 于2021年6月17日周四 下午12:52写道: >> >> 10.35.215.18是宿主机IP? >> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 >> 实在不行就 tcpdump 吧 >> >> >> >> 在 2021-06-17 12:41:58,"yidan

Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
是的,宿主机IP。 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_timestamps = 1 东东 于2021年6月17日周四 下午12:52写道: > > 10.35.215.18是宿主机IP? > > 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 > 实在不行就 tcpdump 吧 > > > > 在 2021-06-17 12:41:58,"yidan zhao" 写道: > >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。

Re:Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread 东东
10.35.215.18是宿主机IP? 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 实在不行就 tcpdump 吧 在 2021-06-17 12:41:58,"yidan zhao" 写道: >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 > >此外,有个点我不是很清楚,网上这个报错很少,类似的都是

Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 此外,有个点我不是很清楚,网上这个报错很少,类似的都是 RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 东东 于2021年6月17日周四

Re:Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread 东东
单机standalone,还是Docker/K8s ? 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? 在 2021-06-16 19:10:24,"yidan zhao" 写道: >Hi, yingjie. >If the network is not stable, which config parameter I should adjust. > >yidan zhao 于2021年6月16日周三 下午6:56写道: >> >> 2: I use G1, and no full gc occurred, young gc count:

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
Ok, I will try. Yingjie Cao 于2021年6月16日周三 下午8:00写道: > > Maybe you can try to increase taskmanager.network.retries, > taskmanager.network.netty.server.backlog and > taskmanager.network.netty.sendReceiveBufferSize. These options are useful for > our jobs. > > yidan zhao 于2021年6月16日周三 下午7:10写道:

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
Ok, I will try. Yingjie Cao 于2021年6月16日周三 下午8:00写道: > > Maybe you can try to increase taskmanager.network.retries, > taskmanager.network.netty.server.backlog and > taskmanager.network.netty.sendReceiveBufferSize. These options are useful for > our jobs. > > yidan zhao 于2021年6月16日周三 下午7:10写道:

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread Yingjie Cao
Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. yidan zhao 于2021年6月16日周三 下午7:10写道: > Hi, yingjie. > If the network is not stable, which config

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread Yingjie Cao
Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. yidan zhao 于2021年6月16日周三 下午7:10写道: > Hi, yingjie. > If the network is not stable, which config

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
I also searched many result in internet. There are some related exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException, but in my case it is org.apache.flink.runtime.io.network.netty.exception.LocalTransportException. It is different in

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
I also searched many result in internet. There are some related exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException, but in my case it is org.apache.flink.runtime.io.network.netty.exception.LocalTransportException. It is different in

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
Hi, yingjie. If the network is not stable, which config parameter I should adjust. yidan zhao 于2021年6月16日周三 下午6:56写道: > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > 142892, so it is not bad. > 3: stream job. > 4: I will try to config taskmanager.network.retries which is

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
Hi, yingjie. If the network is not stable, which config parameter I should adjust. yidan zhao 于2021年6月16日周三 下午6:56写道: > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > 142892, so it is not bad. > 3: stream job. > 4: I will try to config taskmanager.network.retries which is

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
2: I use G1, and no full gc occurred, young gc count: 422, time: 142892, so it is not bad. 3: stream job. 4: I will try to config taskmanager.network.retries which is default 0, and taskmanager.network.netty.client.connectTimeoutSec 's default is 120s。 5: I checked the net fd number of the

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
2: I use G1, and no full gc occurred, young gc count: 422, time: 142892, so it is not bad. 3: stream job. 4: I will try to config taskmanager.network.retries which is default 0, and taskmanager.network.netty.client.connectTimeoutSec 's default is 120s。 5: I checked the net fd number of the

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread Yingjie Cao
Hi yidan, 1. Is the network stable? 2. Is there any GC problem? 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread Yingjie Cao
Hi yidan, 1. Is the network stable? 2. Is there any GC problem? 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
Hi, here is the text exception stack: org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to '10.35.215.18/10.35.215.18:2045') at

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
Hi, here is the text exception stack: org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to '10.35.215.18/10.35.215.18:2045') at

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread Robert Metzger
Hi Yidan, it seems that the attachment did not make it through the mailing list. Can you copy-paste the text of the exception here or upload the log somewhere? On Wed, Jun 16, 2021 at 9:36 AM yidan zhao wrote: > Attachment is the exception stack from flink's web-ui. Does anyone > have also

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread Robert Metzger
Hi Yidan, it seems that the attachment did not make it through the mailing list. Can you copy-paste the text of the exception here or upload the log somewhere? On Wed, Jun 16, 2021 at 9:36 AM yidan zhao wrote: > Attachment is the exception stack from flink's web-ui. Does anyone > have also

flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 Thread yidan zhao
Attachment is the exception stack from flink's web-ui. Does anyone have also met this problem? Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, each 28G mem.

Re: Flink job restart when one ZK node is down

2021-06-15 Thread Yang Wang
> HI, Flink Users > > > > We use a Zk cluster of 5 node for JM HA. When we terminate one node for > maintenance, we notice lots of flink job fully restarts. The error looks > like: > > ``` > > org.apache.flink.util.FlinkException: R

Re: Flink job restart when one ZK node is down

2021-06-15 Thread yidan zhao
Yes it is expected, I have also met such problems. Lu Niu 于2021年6月15日周二 上午4:53写道: > > HI, Flink Users > > We use a Zk cluster of 5 node for JM HA. When we terminate one node for > maintenance, we notice lots of flink job fully restarts. The e

Flink job restart when one ZK node is down

2021-06-14 Thread Lu Niu
HI, Flink Users We use a Zk cluster of 5 node for JM HA. When we terminate one node for maintenance, we notice lots of flink job fully restarts. The error looks like: ``` org.apache.flink.util.FlinkException: ResourceManager leader changed to new address null

Re: After configuration checkpoint strategy, Flink Job cannot restart when job failed

2021-06-07 Thread Chesnay Schepler
ity, I have  a job which read data from Datahub and sink data to Elasticsearch. The Elasticsearch frequently timeout which lead to Flink job failed and stopped, then a manually restart is needed.  After investigate checkpoint strategy, I believe checkpoint can restart job automaically and

??????flink job exception

2021-05-31 Thread day
history server?? https://ci.apache.org/projects/flink/flink-docs-master/zh/docs/deployment/advanced/historyserver/ ---- ??:

flink job exception

2021-05-30 Thread krislee
各位好: 我是flink的初学者。 今天在flink web UI 和后台的job 管理页面 发现很多 exception: .. 11:29:30.107 [flink-akka.actor.default-dispatcher-41] ERROR org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler - Exception occurred in REST handler: Job 16c614ab0d6f5b28746c66f351fb67f8 not found ..

回复:flink job 运行一小时会出现报错

2021-05-20 Thread 田向阳
用的checkpoint后端存储是啥,flink哪个版本的 | | 田向阳 | | 邮箱:lucas_...@163.com | 签名由 网易邮箱大师 定制 在2021年05月20日 17:02,cecotw 写道: 2021-05-19 19:04:19 org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at

回复:flink job 运行一小时会出现报错

2021-05-20 Thread 田向阳
用的checkpoint后端存储是啥,flink哪个版本的 | | 田向阳 | | 邮箱:lucas_...@163.com | 签名由 网易邮箱大师 定制 在2021年05月20日 17:02,cecotw 写道: 2021-05-19 19:04:19 org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at

flink job 运行一小时会出现报错

2021-05-20 Thread cecotw
2021-05-19 19:04:19 org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) at

Re: flink job task在taskmanager上分布不均衡

2021-05-07 Thread wenyuan138
测试了下,这个参数确实有有效 -- Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink job task在taskmanager上分布不均衡

2021-05-07 Thread wenyuan138
十分感谢黄潇 这个参数的功能描述看起来完全跟我的现象一致,今天我来修改尝试下 -- Sent from: http://apache-flink.147419.n8.nabble.com/

Re: flink job task在taskmanager上分布不均衡

2021-05-07 Thread Shawn Huang
看你的描述应该是Standalone部署模式。 默认调度方法是以slot为单位的,并且会倾向于分配到同一个TaskManager的slot中。 想要充分利用所有slot,一个方法是把集群中slot总数设为所有作业的并行度之和, 或者尝试将配置项cluster.evenly-spread-out-slots设为true。 Best, Shawn Huang 张锴 于2021年5月7日周五 下午7:50写道: > 给l另一个job设置个组别名,不同的组不会slot共享,会跑到别的slot上,slot可以灵活的运行在不同的TM上。 > > allanqinjy

Re: flink job task在taskmanager上分布不均衡

2021-05-07 Thread 张锴
给l另一个job设置个组别名,不同的组不会slot共享,会跑到别的slot上,slot可以灵活的运行在不同的TM上。 allanqinjy 于2021年5月7日周五 下午7:38写道: > > https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html > flink的配置中是有flink taskmanager配置的,一个tm对应几个slots >

回复:flink job task在taskmanager上分布不均衡

2021-05-07 Thread allanqinjy
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html flink的配置中是有flink taskmanager配置的,一个tm对应几个slots 。taskmanager.numberOfTaskSlots默认是1.并行度是对应了slots数据,一般我们的slots与并行度最大的一样。你可以看一下这个参数设置。然后对照官网说明。 | | allanqinjy | | allanqi...@163.com | 签名由网易邮箱大师定制 在2021年05月7日

Re: Flink job消费kafka 失败,无法拿到offset值

2021-04-23 Thread liang zhao
; >> — >> Best Regards, >> >> Qingsheng Ren >> 在 2021年4月14日 +0800 PM12:13,Jacob <17691150...@163.com>,写道: >>> 有一个flink job在消费kafka topic消息,该topic存在于kafka两个集群cluster A 和Cluster B,Flink >>> Job消费A集群的topic一切正常,但切换到消费B集群就启动失败。 >>> >

Re: Flink job消费kafka 失败,无法拿到offset值

2021-04-23 Thread wysstartgo
oker-on-aws-on-docker-etc/ > > — > Best Regards, > > Qingsheng Ren > 在 2021年4月14日 +0800 PM12:13,Jacob <17691150...@163.com>,写道: >> 有一个flink job在消费kafka topic消息,该topic存在于kafka两个集群cluster A 和Cluster B,Flink >> Job消费A集群的topic一切正常,但切换到消费B集群就启动失败。 >

Re: Flink job消费kafka 失败,无法拿到offset值

2021-04-22 Thread Qingsheng Ren
level 配置为 DEBUG 或 TRACE,在日志中获取到更多的信息以帮助排查。 希望有所帮助! [1]  https://www.confluent.io/blog/kafka-client-cannot-connect-to-broker-on-aws-on-docker-etc/ — Best Regards, Qingsheng Ren 在 2021年4月14日 +0800 PM12:13,Jacob <17691150...@163.com>,写道: > 有一个flink job在消费kafka topic消息,该topic存在于kafka两个

Flink job消费kafka 失败,无法拿到offset值

2021-04-13 Thread Jacob
有一个flink job在消费kafka topic消息,该topic存在于kafka两个集群cluster A 和Cluster B,Flink Job消费A集群的topic一切正常,但切换到消费B集群就启动失败。 Flink 集群采用Docker部署,Standalone模式。集群资源充足,Slot充足。报错日志如下: java.lang.Exception: org.apache.kafka.common.errors.TimeoutException: Timeout of 6ms expired before the position for partition

Re: Flink job repeated restart failure

2021-03-26 Thread Arvid Heise
Hi Vinaya, java.io.tmpdir is already the fallback and I'm not aware of another level of fallback. Ensuring java.io.tmpdir is valid is also relevant for some third-party libraries that rely on it (e.g. FileSystem that cache local files). It's good practice to set that appropriately. On Fri, Mar

<    1   2   3   4   5   6   7   8   9   >