copy the implementation of
>> existing kafka source and change a little code conveniently.
>>
>> At 2022-06-01 22:38:39, "Bariša Obradović" wrote:
>>
>> Hi,
>> we are running a flink job with multiple kafka sources connected to
>> different ka
lse yourself. But the good news is that you can copy the implementation of
> existing kafka source and change a little code conveniently.
>
> At 2022-06-01 22:38:39, "Bariša Obradović" wrote:
>
> Hi,
> we are running a flink job with multiple kafka sources connected to
>
of existing kafka
source and change a little code conveniently.
At 2022-06-01 22:38:39, "Bariša Obradović" wrote:
Hi,
we are running a flink job with multiple kafka sources connected to different
kafka servers.
The problem we are facing is when one of the kafka's is down, the
Hi,
we are running a flink job with multiple kafka sources connected to
different kafka servers.
The problem we are facing is when one of the kafka's is down, the flink job
starts restarting.
Is there anyway for flink to pause processing of the kafka which is down,
and yet continue processing
/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/leader/3a97d1d50f663027ae81efe0f0aa
This should result in your JobManager recovering from the faulty job.
Regards,
Hong
From: "s_penakalap...@yahoo.com"
Date: Tuesday, 24 May 2022 at 18:40
To: User
Subject: RE: [EXTERNAL]Flink Job Mana
Hi Team,
Any inputs please badly stuck.
Regards,Sunitha
On Sunday, May 22, 2022, 12:34:22 AM GMT+5:30, s_penakalap...@yahoo.com
wrote:
Hi All,
Help please!
We have standalone Flink service installed in individual VM and clubed to form
a cluster with HA and checkpoint in place. When
Hi All,
Help please!
We have standalone Flink service installed in individual VM and clubed to form
a cluster with HA and checkpoint in place. When cancelling Job, Flink cluster
went down and its unable to start up normally as Job manager is continuously
going down with the below error:
mum-allocation-mb` and
`yarn.nodemanager.resource.memory-mb`. Maybe it is better to post the JM/TM
logs if any of them exists to provide more information.
Best,
Biao Geng
发件人: Anitha Thankappan
日期: 星期三, 2022年5月18日 下午8:26
收件人: user@flink.apache.org
主题: Flink Job Execution issue at Yarn
Hi,
Hi,
We are using below command to submit a flink application job to GCP
dataproc cluster using Yarn.
*flink run-application -t yarn-application .jar*
Our Cluster have 1 master node with 64 GB and 10 worker nodes of 32 GB.
The flink configurations given are:
*jobmanager.memory.process.size:
Info |
> | 日期 | 2022年05月07日 13:21 |
> | 收件人 | d...@flink.apache.org、user<
> user@flink.apache.org> |
> | 抄送至 | |
> | 主题 | flink Job is throwing depdnecy issue when submitted to clusrer |
> I have one flink job which reads files from s3 and processes them.
> Currently, it is runni
I have one flink job which reads files from s3 and processes them.
Currently, it is running on flink 1.9.0, I need to upgrade my cluster to
1.13.5, so I have done the changes in my job pom and brought up the flink
cluster using 1.13.5 dist.
when I submit my application I am getting the below
Hi, folks!
I am running Flink Streaming job in mode=Batch on EMR.
The job has following stages:
1. Read from MySQL
2. KeyBy user_id
3. Reduce by user_id
4. Async I/O enriching from Redis
5. Async I/O enriching from other Redis
6. Async I/O enriching from REST #1
7.
dleStore{namespace='flink/default/checkpoints/523f9e48274186bb97c13e3c2213be0e'}.
>
> 2022-02-24 12:20:16,712 INFO
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore
> [] - All 1 checkpoints found are already downloaded.
>
> 2022-02-24 12:20:16,712 INFO
> o
- No master
state to restore
2
Thanks,
Ifat
From: yidan zhao
Date: Wednesday, 2 March 2022 at 4:08
To: "Afek, Ifat (Nokia - IL/Kfar Sava)"
Cc: zhlonghong , "user@flink.apache.org"
Subject: Re: Flink job recovery after task manager failure
State backend can be set a
ers? Is
> there another option?
>
>
>
> Thanks,
>
> Ifat
>
>
>
> *From: *Zhilong Hong
> *Date: *Thursday, 24 February 2022 at 19:58
> *To: *"Afek, Ifat (Nokia - IL/Kfar Sava)"
> *Cc: *"user@flink.apache.org"
> *Subject: *Re: Flink job rec
system be
shared between the task managers and job managers? Is there another option?
Thanks,
Ifat
From: Zhilong Hong
Date: Thursday, 24 February 2022 at 19:58
To: "Afek, Ifat (Nokia - IL/Kfar Sava)"
Cc: "user@flink.apache.org"
Subject: Re: Flink job recovery after task
Hi Zhilong,
I will check the issues you raised.
Thanks for your help,
Ifat
From: Zhilong Hong
Date: Thursday, 24 February 2022 at 19:58
To: "Afek, Ifat (Nokia - IL/Kfar Sava)"
Cc: "user@flink.apache.org"
Subject: Re: Flink job recovery after task manager failure
Hi, Afe
On Thu, Feb 24, 2022 at 9:04 PM Afek, Ifat (Nokia - IL/Kfar Sava) <
ifat.a...@nokia.com> wrote:
> Thanks Zhilong.
>
>
>
> The first launch of our job is fast, I don’t think that’s the issue. I see
> in flink job manager log that there were several exceptions during the
> re
Thanks Zhilong.
The first launch of our job is fast, I don’t think that’s the issue. I see in
flink job manager log that there were several exceptions during the restart,
and the task manager was restarted a few times until it was stabilized.
You can find the log here:
jobmanager-log.txt.gz
Hi, Afek!
When a TaskManager is killed, JobManager will not be acknowledged until a
heartbeat timeout happens. Currently, the default value of
heartbeat.timeout is 50 seconds [1]. That's why it takes more than 30
seconds for Flink to trigger a failover. If you'd like to shorten the time
a
Hi,
I am trying to use Flink checkpoints solution in order to support task manager
recovery.
I’m running flink using beam with filesystem storage and the following
parameters:
checkpointingInterval=3
checkpointingMode=EXACTLY_ONCE.
What I see is that if I kill a task manager pod, it takes
Hi all,
I have one flink job which reads data from one kafka topic and sinks to two
kafka topics using Flink SQL.
The code is something like this:
tableEnv.executeSql(
"""
create table sink_table1 (
xxx
xxx
) with (
'connector' = 'kafka',
'topic' = 'topic1'
)
""
Hi Puneet,
are we talking about the `web.upload.dir` [1] ? Maybe others have a
better solution for your problem, but have you thought about configuring
an NFS or some other distributed file system as the JAR directory? In
this case it should be available to all JobManagers.
Regards,
Timo
Hi,
So currently we are using flink 1.12 in HA mode on production. There are 3 job
managers (1 leader and 2 standby). When I am uploading a jar on one of the job
managers, somehow it is not reflected on other job managers. Is there any way
where I can achieve a behaviour where uploading jar to
Hi!
For Yarn jobs, Flink web UI (which shares the same port with REST API) can
be found by clicking into the "Application Manager" link in the
corresponding application page. I'm not familiar with Yarn API but I guess
there is some way to get this application manager link.
Kamil ty
Sorry for that. Will remember to add the mailing list in responses.
The REST API approach will probably be sufficient. One more question
regarding this. Is it possible to get the address/port of the rest api
endpoint from the job? I see that when deploying a job to yarn the
rest.address and
Hi!
Please make sure to always reply to the user mailing list so that everyone
can see the discussion.
You can't get the execution environment for an already running job but if
you want to operate on that job you can try to get its JobClient instead.
However this is somewhat complicated to get
Hi!
So you would like to submit a yarn job with Java code, not using /bin/flink
run?
If that is the case, you'll need to set 'execution.target' config option to
'yarn-per-job'. Set this in the configuration of ExecutionEnvironment and
execute the job with Flink API as normal.
Kamil ty
Hello all,
I'm looking for a way to submit a Yarn job from another flink jobs
application code. I can see that you can access a cluster and submit jobs
with a RestClusterClient, but it seems a Yarn per-job mode is not supported
with it.
Any suggestions would be appreciated.
Best Regards
Kamil
Hi All:
有一个flink的运行正常的job进行cancel后,flink的log里面打印了一些信息,请问这种异常如何排查根因?
TY-APP-DATA-REAL-COMPUTATION - [2021-11-18 22:10:32.465] - INFO
[RecordComputeOperator -> Sink: wi-data-sink (3/10)]
org.apache.flink.runtime.taskmanager.Task - RecordComputeOperator ->
Sink: wi-data-sink (3/10)
AM +0800, Marco Villalobos <
> mvillalo...@kineteque.com>, wrote:
>
> I have the simplest Flink job that simply deques off of a kafka topic and
> writes to another kafka topic, but with headers, and manually copying the
> event time into th
aste part
of your code (on DataStream API) or SQL (on Table & SQL API).
--
Best Regards,
Qingsheng Ren
Email: renqs...@gmail.com
On Oct 19, 2021, 9:28 AM +0800, Marco Villalobos ,
wrote:
> I have the simplest Flink job that simply deques off of a kafka topic and
> writes to anoth
I have the simplest Flink job that simply deques off of a kafka topic and
writes to another kafka topic, but with headers, and manually copying the
event time into the kafka sink.
It works as intended, but sometimes I am getting this error
aizhi Weng wrote:
>
> Hi!
>
> yarn-cluster is the mode for a yarn session cluster, which means the cluster
> will remain even after the job is finished. If you want to finish the Flink
> job as well as the yarn job, use yarn-per-job mode instead.
>
> Jake 于2021年10月9日周六 下午5:53
Hi!
yarn-cluster is the mode for a yarn session cluster, which means the
cluster will remain even after the job is finished. If you want to finish
the Flink job as well as the yarn job, use yarn-per-job mode instead.
Jake 于2021年10月9日周六 下午5:53写道:
> Hi
>
> When submit job in yarn-clus
Hi
When submit job in yarn-cluster model, flink job finish but yarn job not exit.
What should I do?
Submit command:
/opt/app/flink/flink-1.14.0/bin/flink run -m yarn-cluster
./flink-sql-client.jar --file dev.sql
Hi,
About cpu cost, there are several methods:
1. Flink builtin metric: `Status.JVM.CPU.Load` [1]
2. Use `top` command on the target machine which deploys a suspect
TaskManager
3. You could use flame graph to do deeper profiler of a JVM [2].
...
About RPC response, I'm not an expert on HBase, I'm
please let me know how to check Does RPC response and CPU cost
On Mon, Sep 27, 2021 at 1:19 PM JING ZHANG wrote:
> Hi,
> Since there is not enough information, you could first check the back
> pressure status of the job [1], find the task which caused the back
> pressure.
> Then try to find out
Hi,
Since there is not enough information, you could first check the back
pressure status of the job [1], find the task which caused the back
pressure.
Then try to find out why the task processed data slowly, there are many
reasons, for example the following reasons:
(1) Does data skew exist,
Hi ,
I have a flink real time job which processes user records via topic and
also reading data from hbase acting as a look table . If the look table
does not contain required metadata then it queries the external db via api
. First 1to 2 hours it works fine without issues, later it drops down
to allow a flink job to use s3 for
checkpointing
Hi, Thomas
I am not an expert of s3 but I think Flinkneed write/read/delete(maybe list)
permission of the path(bucket).
BTW, What error did you encounter?
Best,
Guowei
On Fri, Sep 24, 2021 at 5:00 AM Thomas Wang
mailto:w...@datability.io
Hi, Thomas
I am not an expert of s3 but I think Flinkneed write/read/delete(maybe
list) permission of the path(bucket).
BTW, What error did you encounter?
Best,
Guowei
On Fri, Sep 24, 2021 at 5:00 AM Thomas Wang wrote:
> Hi,
>
> I'm trying to figure out what exact s3 permissions doe
Hi,
I'm trying to figure out what exact s3 permissions does a flink job need to
work appropriately when using s3 for checkpointing. Currently, I have the
following IAM Policy, but it seems insufficient. Can anyone help me figure
this out? Thanks.
{
Action = [
"s3:PutObject",
&qu
不是,你应该认错了
yidan zhao 于2021年8月24日周二 下午12:50写道:
> 你是zhangkai30吗~
>
> 张锴 于2021年8月24日周二 上午11:16写道:
>
> > 我们数据量大概一天6个T,只做一个简单的处理,不涉及计算,在资源够用的情况下,用命令行的方式启动flink job
> > 需要给多少比较合适,比如jobmanager内存 ,taskmanager内存, slot , taskmanager数量 ,并行度p,
> > 具体是怎么考虑的呢?
> >
> > 对于大规模的数据量经验还比较浅,有大佬给指明一下吗
> >
>
你是zhangkai30吗~
张锴 于2021年8月24日周二 上午11:16写道:
> 我们数据量大概一天6个T,只做一个简单的处理,不涉及计算,在资源够用的情况下,用命令行的方式启动flink job
> 需要给多少比较合适,比如jobmanager内存 ,taskmanager内存, slot , taskmanager数量 ,并行度p,
> 具体是怎么考虑的呢?
>
> 对于大规模的数据量经验还比较浅,有大佬给指明一下吗
>
我们数据量大概一天6个T,只做一个简单的处理,不涉及计算,在资源够用的情况下,用命令行的方式启动flink job
需要给多少比较合适,比如jobmanager内存 ,taskmanager内存, slot , taskmanager数量 ,并行度p,
具体是怎么考虑的呢?
对于大规模的数据量经验还比较浅,有大佬给指明一下吗
Hi Nicolaus,
I double checked again our hdfs config, it is setting 1 instead of 2.
I will try the solution you provided.
Thanks again.
Best regards
Rainie
On Wed, Aug 4, 2021 at 10:40 AM Rainie Li wrote:
> Thanks for the context Nicolaus.
> We are using S3 instead of HDFS.
>
> Best regards
>
Thanks for the context Nicolaus.
We are using S3 instead of HDFS.
Best regards
Rainie
On Wed, Aug 4, 2021 at 12:39 AM Nicolaus Weidner <
nicolaus.weid...@ververica.com> wrote:
> Hi Rainie,
>
> I found a similar issue on stackoverflow, though quite different
> stacktrace:
>
Thanks Till.
We terminated one of the worker nodes.
We enabled HA by using Zookeeper.
Sure, we will try upgrade job to newer version.
Best regards
Rainie
On Tue, Aug 3, 2021 at 11:57 PM Till Rohrmann wrote:
> Hi Rainie,
>
> It looks to me as if Yarn is causing this problem. Which Yarn node are
Hi Rainie,
It looks to me as if Yarn is causing this problem. Which Yarn node are you
terminating? Have you configured your Yarn cluster to be highly available
in case you are terminating the ResourceManager?
Flink should retry the operation of starting a new container in case it
fails. If this
in Flink, perhaps other members of the community might chime in :-)
All the best,
Igal.
On Wed, Jul 28, 2021 at 12:05 PM Deniz Koçak wrote:
> Hi,
>
> We would like to host a separate service (logically and possibly
> physically) from the Flink Job which also we would like to call from
Hi Flink Community,
My flink application is running version 1.9 and it failed to recover
(application was running but checkpoint failed and job stopped to process
data) during hadoop yarn node termination.
*Here is job manager log error:*
*2021-07-26 18:02:58,605 INFO
Hi,
We would like to host a separate service (logically and possibly
physically) from the Flink Job which also we would like to call from
our Flink Job. We have been considering remote functions via HTTP
which seems to be good option on cloud. We have been considering Async
I/o and statefun
Jun 22, 2021 at 1:32 PM Thomas Wang wrote:
>
>> Hi,
>>
>> I'm wondering if anyone has changed the number of partitions of a source
>> Kafka topic.
>>
>> Let's say I have a Flink job read from a Kafka topic which used to have
>> 32 partitions. I
he number of partitions of a source
> Kafka topic.
>
> Let's say I have a Flink job read from a Kafka topic which used to have 32
> partitions. If I change the number of partitions of that topic to 64, can
> the Flink job still guarantee the exactly-once semantics?
>
> Thanks.
>
> Thomas
>
Hi,
I'm wondering if anyone has changed the number of partitions of a source
Kafka topic.
Let's say I have a Flink job read from a Kafka topic which used to have 32
partitions. If I change the number of partitions of that topic to 64, can
the Flink job still guarantee the exactly-once semantics
我仔细想了想,我的集群是内网服务器上的容器,容器之间访问应该不算经过NAT。
当然和网络相关的监控来看,的确很多机器的time-wait状态的连接不少,在5w+个左右,但也不至于导致这个问题感觉。
东东 于2021年6月17日周四 下午2:48写道:
>
> 这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。
>
>
>
>
这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。
一般来说单机不会有这个问题,因为时钟应该是一个,在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致),但不清楚你的具体架构,只能说试一试。
最后,可以跟运维讨论一下,除非确信不会有经过NAT过来的链接,否则这俩最好别都开。
PS: kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了,因为太多人掉这坑里了
在 2021-06-17 14:07:50,"yidan
这啥原理,这个改动我没办法直接改,需要申请。
东东 于2021年6月17日周四 下午1:36写道:
>
>
>
> 把其中一个改成0
>
>
> 在 2021-06-17 13:11:01,"yidan zhao" 写道:
> >是的,宿主机IP。
> >
> >net.ipv4.tcp_tw_reuse = 1
> >net.ipv4.tcp_timestamps = 1
> >
> >东东 于2021年6月17日周四 下午12:52写道:
> >>
> >> 10.35.215.18是宿主机IP?
> >>
> >> 看一下
把其中一个改成0
在 2021-06-17 13:11:01,"yidan zhao" 写道:
>是的,宿主机IP。
>
>net.ipv4.tcp_tw_reuse = 1
>net.ipv4.tcp_timestamps = 1
>
>东东 于2021年6月17日周四 下午12:52写道:
>>
>> 10.35.215.18是宿主机IP?
>>
>> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
>> 实在不行就 tcpdump 吧
>>
>>
>>
>> 在 2021-06-17 12:41:58,"yidan
是的,宿主机IP。
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_timestamps = 1
东东 于2021年6月17日周四 下午12:52写道:
>
> 10.35.215.18是宿主机IP?
>
> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
> 实在不行就 tcpdump 吧
>
>
>
> 在 2021-06-17 12:41:58,"yidan zhao" 写道:
> >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。
10.35.215.18是宿主机IP?
看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值
实在不行就 tcpdump 吧
在 2021-06-17 12:41:58,"yidan zhao" 写道:
>@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
>我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
>
>此外,有个点我不是很清楚,网上这个报错很少,类似的都是
@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。
我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。
此外,有个点我不是很清楚,网上这个报错很少,类似的都是
RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是
LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。
东东 于2021年6月17日周四
单机standalone,还是Docker/K8s ?
这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关?
在 2021-06-16 19:10:24,"yidan zhao" 写道:
>Hi, yingjie.
>If the network is not stable, which config parameter I should adjust.
>
>yidan zhao 于2021年6月16日周三 下午6:56写道:
>>
>> 2: I use G1, and no full gc occurred, young gc count:
Ok, I will try.
Yingjie Cao 于2021年6月16日周三 下午8:00写道:
>
> Maybe you can try to increase taskmanager.network.retries,
> taskmanager.network.netty.server.backlog and
> taskmanager.network.netty.sendReceiveBufferSize. These options are useful for
> our jobs.
>
> yidan zhao 于2021年6月16日周三 下午7:10写道:
Ok, I will try.
Yingjie Cao 于2021年6月16日周三 下午8:00写道:
>
> Maybe you can try to increase taskmanager.network.retries,
> taskmanager.network.netty.server.backlog and
> taskmanager.network.netty.sendReceiveBufferSize. These options are useful for
> our jobs.
>
> yidan zhao 于2021年6月16日周三 下午7:10写道:
Maybe you can try to
increase taskmanager.network.retries,
taskmanager.network.netty.server.backlog and
taskmanager.network.netty.sendReceiveBufferSize. These options are useful
for our jobs.
yidan zhao 于2021年6月16日周三 下午7:10写道:
> Hi, yingjie.
> If the network is not stable, which config
Maybe you can try to
increase taskmanager.network.retries,
taskmanager.network.netty.server.backlog and
taskmanager.network.netty.sendReceiveBufferSize. These options are useful
for our jobs.
yidan zhao 于2021年6月16日周三 下午7:10写道:
> Hi, yingjie.
> If the network is not stable, which config
I also searched many result in internet. There are some related
exception like
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException,
but in my case it is
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException.
It is different in
I also searched many result in internet. There are some related
exception like
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException,
but in my case it is
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException.
It is different in
Hi, yingjie.
If the network is not stable, which config parameter I should adjust.
yidan zhao 于2021年6月16日周三 下午6:56写道:
>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is
Hi, yingjie.
If the network is not stable, which config parameter I should adjust.
yidan zhao 于2021年6月16日周三 下午6:56写道:
>
> 2: I use G1, and no full gc occurred, young gc count: 422, time:
> 142892, so it is not bad.
> 3: stream job.
> 4: I will try to config taskmanager.network.retries which is
2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the
2: I use G1, and no full gc occurred, young gc count: 422, time:
142892, so it is not bad.
3: stream job.
4: I will try to config taskmanager.network.retries which is default
0, and taskmanager.network.netty.client.connectTimeoutSec 's default
is 120s。
5: I checked the net fd number of the
Hi yidan,
1. Is the network stable?
2. Is there any GC problem?
3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
information.
4. You may try to config these two options: taskmanager.network.retries,
taskmanager.network.netty.client.connectTimeoutSec. More relevant options
Hi yidan,
1. Is the network stable?
2. Is there any GC problem?
3. Is it a batch job? If so, please use sort-shuffle, see [1] for more
information.
4. You may try to config these two options: taskmanager.network.retries,
taskmanager.network.netty.client.connectTimeoutSec. More relevant options
Hi, here is the text exception stack:
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to
'10.35.215.18/10.35.215.18:2045')
at
Hi, here is the text exception stack:
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to
'10.35.215.18/10.35.215.18:2045')
at
Hi Yidan,
it seems that the attachment did not make it through the mailing list. Can
you copy-paste the text of the exception here or upload the log somewhere?
On Wed, Jun 16, 2021 at 9:36 AM yidan zhao wrote:
> Attachment is the exception stack from flink's web-ui. Does anyone
> have also
Hi Yidan,
it seems that the attachment did not make it through the mailing list. Can
you copy-paste the text of the exception here or upload the log somewhere?
On Wed, Jun 16, 2021 at 9:36 AM yidan zhao wrote:
> Attachment is the exception stack from flink's web-ui. Does anyone
> have also
Attachment is the exception stack from flink's web-ui. Does anyone
have also met this problem?
Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers,
each 28G mem.
> HI, Flink Users
> >
> > We use a Zk cluster of 5 node for JM HA. When we terminate one node for
> maintenance, we notice lots of flink job fully restarts. The error looks
> like:
> > ```
> > org.apache.flink.util.FlinkException: R
Yes it is expected, I have also met such problems.
Lu Niu 于2021年6月15日周二 上午4:53写道:
>
> HI, Flink Users
>
> We use a Zk cluster of 5 node for JM HA. When we terminate one node for
> maintenance, we notice lots of flink job fully restarts. The e
HI, Flink Users
We use a Zk cluster of 5 node for JM HA. When we terminate one node for
maintenance, we notice lots of flink job fully restarts. The error looks
like:
```
org.apache.flink.util.FlinkException: ResourceManager leader changed to new
address null
ity,
I have a job which read data from Datahub and sink data to
Elasticsearch. The Elasticsearch frequently timeout which lead to
Flink job failed and stopped, then a manually restart is needed.
After investigate checkpoint strategy, I believe checkpoint can
restart job automaically and
history server??
https://ci.apache.org/projects/flink/flink-docs-master/zh/docs/deployment/advanced/historyserver/
----
??:
各位好:
我是flink的初学者。 今天在flink web UI 和后台的job 管理页面 发现很多
exception:
..
11:29:30.107 [flink-akka.actor.default-dispatcher-41] ERROR
org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler -
Exception occurred in REST handler: Job 16c614ab0d6f5b28746c66f351fb67f8
not found
..
用的checkpoint后端存储是啥,flink哪个版本的
| |
田向阳
|
|
邮箱:lucas_...@163.com
|
签名由 网易邮箱大师 定制
在2021年05月20日 17:02,cecotw 写道:
2021-05-19 19:04:19
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at
用的checkpoint后端存储是啥,flink哪个版本的
| |
田向阳
|
|
邮箱:lucas_...@163.com
|
签名由 网易邮箱大师 定制
在2021年05月20日 17:02,cecotw 写道:
2021-05-19 19:04:19
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at
2021-05-19 19:04:19
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
at
测试了下,这个参数确实有有效
--
Sent from: http://apache-flink.147419.n8.nabble.com/
十分感谢黄潇
这个参数的功能描述看起来完全跟我的现象一致,今天我来修改尝试下
--
Sent from: http://apache-flink.147419.n8.nabble.com/
看你的描述应该是Standalone部署模式。
默认调度方法是以slot为单位的,并且会倾向于分配到同一个TaskManager的slot中。
想要充分利用所有slot,一个方法是把集群中slot总数设为所有作业的并行度之和,
或者尝试将配置项cluster.evenly-spread-out-slots设为true。
Best,
Shawn Huang
张锴 于2021年5月7日周五 下午7:50写道:
> 给l另一个job设置个组别名,不同的组不会slot共享,会跑到别的slot上,slot可以灵活的运行在不同的TM上。
>
> allanqinjy
给l另一个job设置个组别名,不同的组不会slot共享,会跑到别的slot上,slot可以灵活的运行在不同的TM上。
allanqinjy 于2021年5月7日周五 下午7:38写道:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html
> flink的配置中是有flink taskmanager配置的,一个tm对应几个slots
>
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html
flink的配置中是有flink taskmanager配置的,一个tm对应几个slots
。taskmanager.numberOfTaskSlots默认是1.并行度是对应了slots数据,一般我们的slots与并行度最大的一样。你可以看一下这个参数设置。然后对照官网说明。
| |
allanqinjy
|
|
allanqi...@163.com
|
签名由网易邮箱大师定制
在2021年05月7日
;
>> —
>> Best Regards,
>>
>> Qingsheng Ren
>> 在 2021年4月14日 +0800 PM12:13,Jacob <17691150...@163.com>,写道:
>>> 有一个flink job在消费kafka topic消息,该topic存在于kafka两个集群cluster A 和Cluster B,Flink
>>> Job消费A集群的topic一切正常,但切换到消费B集群就启动失败。
>>>
>
oker-on-aws-on-docker-etc/
>
> —
> Best Regards,
>
> Qingsheng Ren
> 在 2021年4月14日 +0800 PM12:13,Jacob <17691150...@163.com>,写道:
>> 有一个flink job在消费kafka topic消息,该topic存在于kafka两个集群cluster A 和Cluster B,Flink
>> Job消费A集群的topic一切正常,但切换到消费B集群就启动失败。
>
level 配置为 DEBUG 或
TRACE,在日志中获取到更多的信息以帮助排查。
希望有所帮助!
[1]
https://www.confluent.io/blog/kafka-client-cannot-connect-to-broker-on-aws-on-docker-etc/
—
Best Regards,
Qingsheng Ren
在 2021年4月14日 +0800 PM12:13,Jacob <17691150...@163.com>,写道:
> 有一个flink job在消费kafka topic消息,该topic存在于kafka两个
有一个flink job在消费kafka topic消息,该topic存在于kafka两个集群cluster A 和Cluster B,Flink
Job消费A集群的topic一切正常,但切换到消费B集群就启动失败。
Flink 集群采用Docker部署,Standalone模式。集群资源充足,Slot充足。报错日志如下:
java.lang.Exception: org.apache.kafka.common.errors.TimeoutException:
Timeout of 6ms expired before the position for partition
Hi Vinaya,
java.io.tmpdir is already the fallback and I'm not aware of another level
of fallback.
Ensuring java.io.tmpdir is valid is also relevant for some third-party
libraries that rely on it (e.g. FileSystem that cache local files). It's
good practice to set that appropriately.
On Fri, Mar
101 - 200 of 850 matches
Mail list logo