subject:"\?\?\?\?\?\?flink job exception"

Re: Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 文章 yidan zhao

我仔细想了想，我的集群是内网服务器上的容器，容器之间访问应该不算经过NAT。当然和网络相关的监控来看，的确很多机器的time-wait状态的连接不少，在5w+个左右，但也不至于导致这个问题感觉。东东于2021年6月17日周四下午2:48写道： > > 这俩都开启的话，就要求同一源ip的连接请求中的timstamp必须是递增的，否则(非递增)的连接请求被视为无效，数据包会被抛弃，给client端的感觉就是时不时的连接超时。 > > > >

Re:Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 文章东东

这俩都开启的话，就要求同一源ip的连接请求中的timstamp必须是递增的，否则(非递增)的连接请求被视为无效，数据包会被抛弃，给client端的感觉就是时不时的连接超时。一般来说单机不会有这个问题，因为时钟应该是一个，在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致)，但不清楚你的具体架构，只能说试一试。最后，可以跟运维讨论一下，除非确信不会有经过NAT过来的链接，否则这俩最好别都开。 PS： kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了，因为太多人掉这坑里了在 2021-06-17 14:07:50，"yidan

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 文章 yidan zhao

这啥原理，这个改动我没办法直接改，需要申请。东东于2021年6月17日周四下午1:36写道： > > > > 把其中一个改成0 > > > 在 2021-06-17 13:11:01，"yidan zhao" 写道： > >是的，宿主机IP。 > > > >net.ipv4.tcp_tw_reuse = 1 > >net.ipv4.tcp_timestamps = 1 > > > >东东于2021年6月17日周四下午12:52写道： > >> > >> 10.35.215.18是宿主机IP？ > >> > >> 看一下

Re:Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章东东

把其中一个改成0 在 2021-06-17 13:11:01，"yidan zhao" 写道： >是的，宿主机IP。 > >net.ipv4.tcp_tw_reuse = 1 >net.ipv4.tcp_timestamps = 1 > >东东于2021年6月17日周四下午12:52写道： >> >> 10.35.215.18是宿主机IP？ >> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 >> 实在不行就 tcpdump 吧 >> >> >> >> 在 2021-06-17 12:41:58，"yidan

Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

是的，宿主机IP。 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_timestamps = 1 东东于2021年6月17日周四下午12:52写道： > > 10.35.215.18是宿主机IP？ > > 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 > 实在不行就 tcpdump 吧 > > > > 在 2021-06-17 12:41:58，"yidan zhao" 写道： > >@东东 standalone集群。随机时间，一会一个的，没有固定规律。

Re:Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章东东

10.35.215.18是宿主机IP？看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值实在不行就 tcpdump 吧在 2021-06-17 12:41:58，"yidan zhao" 写道： >@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。 >我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。 > >此外，有个点我不是很清楚，网上这个报错很少，类似的都是

Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

@东东 standalone集群。随机时间，一会一个的，没有固定规律。和CPU、内存、网络的话有一定规律，但不确认，因为不是很明显。我排查过几个exception，时间和网络尖刺对上了，但不全能对上，所以不好说是否有这个原因。此外，有个点我不是很清楚，网上这个报错很少，类似的都是 RemoteTransportException，然后提示中说taskmager可能已丢失之类。但我的是 LocalTransportException，不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。东东于2021年6月17日周四

Re:Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章东东

单机standalone，还是Docker/K8s ? 这个异常出现的时机，与周期性的，还是跟CPU、内存，乃至网络流量变化相关？在 2021-06-16 19:10:24，"yidan zhao" 写道： >Hi, yingjie. >If the network is not stable, which config parameter I should adjust. > >yidan zhao 于2021年6月16日周三下午6:56写道： >> >> 2: I use G1, and no full gc occurred, young gc count:

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

Ok, I will try. Yingjie Cao 于2021年6月16日周三下午8:00写道： > > Maybe you can try to increase taskmanager.network.retries, > taskmanager.network.netty.server.backlog and > taskmanager.network.netty.sendReceiveBufferSize. These options are useful for > our jobs. > > yidan zhao 于2021年6月16日周三下午7:10写道：

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 Yingjie Cao

Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. yidan zhao 于2021年6月16日周三下午7:10写道： > Hi, yingjie. > If the network is not stable, which config

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

I also searched many result in internet. There are some related exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException, but in my case it is org.apache.flink.runtime.io.network.netty.exception.LocalTransportException. It is different in

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

Hi, yingjie. If the network is not stable, which config parameter I should adjust. yidan zhao 于2021年6月16日周三下午6:56写道： > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > 142892, so it is not bad. > 3: stream job. > 4: I will try to config taskmanager.network.retries which is

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

2: I use G1, and no full gc occurred, young gc count: 422, time: 142892, so it is not bad. 3: stream job. 4: I will try to config taskmanager.network.retries which is default 0, and taskmanager.network.netty.client.connectTimeoutSec 's default is 120s。 5: I checked the net fd number of the

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 Yingjie Cao

Hi yidan, 1. Is the network stable? 2. Is there any GC problem? 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

Hi, here is the text exception stack: org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to '10.35.215.18/10.35.215.18:2045') at

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 Robert Metzger

Hi Yidan, it seems that the attachment did not make it through the mailing list. Can you copy-paste the text of the exception here or upload the log somewhere? On Wed, Jun 16, 2021 at 9:36 AM yidan zhao wrote: > Attachment is the exception stack from flink's web-ui. Does anyone > have also

flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao

Attachment is the exception stack from flink's web-ui. Does anyone have also met this problem? Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, each 28G mem.

??????flink job exception

2021-05-31 文章 day

history server?? https://ci.apache.org/projects/flink/flink-docs-master/zh/docs/deployment/advanced/historyserver/ ---- ??:

flink job exception

2021-05-30 文章 krislee

各位好：我是flink的初学者。今天在flink web UI 和后台的job 管理页面发现很多 exception: .. 11:29:30.107 [flink-akka.actor.default-dispatcher-41] ERROR org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler - Exception occurred in REST handler: Job 16c614ab0d6f5b28746c66f351fb67f8 not found ..

Re: Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re:Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re:Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re:Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re:Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

flink job exception analysis (netty related, readAddress failed. connection timed out)

??????flink job exception

flink job exception

19 matches

Site Navigation

Mail list logo

Footer information