Re: Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 文章 yidan zhao
我仔细想了想,我的集群是内网服务器上的容器,容器之间访问应该不算经过NAT。 当然和网络相关的监控来看,的确很多机器的time-wait状态的连接不少,在5w+个左右,但也不至于导致这个问题感觉。 东东 于2021年6月17日周四 下午2:48写道: > > 这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。 > > > >

Re:Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 文章 东东
这俩都开启的话,就要求同一源ip的连接请求中的timstamp必须是递增的,否则(非递增)的连接请求被视为无效,数据包会被抛弃,给client端的感觉就是时不时的连接超时。 一般来说单机不会有这个问题,因为时钟应该是一个,在NAT后面才容易出现这个现象(因为多个主机时钟通常不完全一致),但不清楚你的具体架构,只能说试一试。 最后,可以跟运维讨论一下,除非确信不会有经过NAT过来的链接,否则这俩最好别都开。 PS: kernel 4.1里面已经把 tcp_tw_reuse 这玩意废掉了,因为太多人掉这坑里了 在 2021-06-17 14:07:50,"yidan

Re: Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-17 文章 yidan zhao
这啥原理,这个改动我没办法直接改,需要申请。 东东 于2021年6月17日周四 下午1:36写道: > > > > 把其中一个改成0 > > > 在 2021-06-17 13:11:01,"yidan zhao" 写道: > >是的,宿主机IP。 > > > >net.ipv4.tcp_tw_reuse = 1 > >net.ipv4.tcp_timestamps = 1 > > > >东东 于2021年6月17日周四 下午12:52写道: > >> > >> 10.35.215.18是宿主机IP? > >> > >> 看一下

Re:Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 东东
把其中一个改成0 在 2021-06-17 13:11:01,"yidan zhao" 写道: >是的,宿主机IP。 > >net.ipv4.tcp_tw_reuse = 1 >net.ipv4.tcp_timestamps = 1 > >东东 于2021年6月17日周四 下午12:52写道: >> >> 10.35.215.18是宿主机IP? >> >> 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 >> 实在不行就 tcpdump 吧 >> >> >> >> 在 2021-06-17 12:41:58,"yidan

Re: Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
是的,宿主机IP。 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_timestamps = 1 东东 于2021年6月17日周四 下午12:52写道: > > 10.35.215.18是宿主机IP? > > 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 > 实在不行就 tcpdump 吧 > > > > 在 2021-06-17 12:41:58,"yidan zhao" 写道: > >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。

Re:Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 东东
10.35.215.18是宿主机IP? 看一下 tcp_tw_recycle和net.ipv4.tcp_timestamps是什么值 实在不行就 tcpdump 吧 在 2021-06-17 12:41:58,"yidan zhao" 写道: >@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 >我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 > >此外,有个点我不是很清楚,网上这个报错很少,类似的都是

Re: Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
@东东 standalone集群。 随机时间,一会一个的,没有固定规律。 和CPU、内存、网络的话有一定规律,但不确认,因为不是很明显。 我排查过几个exception,时间和网络尖刺对上了,但不全能对上,所以不好说是否有这个原因。 此外,有个点我不是很清楚,网上这个报错很少,类似的都是 RemoteTransportException,然后提示中说taskmager可能已丢失之类。但我的是 LocalTransportException,不清楚netty中这俩错误的含义是不是不一样。目前来看网络上关于这俩异常的资料也查不到什么。 东东 于2021年6月17日周四

Re:Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 东东
单机standalone,还是Docker/K8s ? 这个异常出现的时机,与周期性的,还是跟CPU、内存,乃至网络流量变化相关? 在 2021-06-16 19:10:24,"yidan zhao" 写道: >Hi, yingjie. >If the network is not stable, which config parameter I should adjust. > >yidan zhao 于2021年6月16日周三 下午6:56写道: >> >> 2: I use G1, and no full gc occurred, young gc count:

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
Ok, I will try. Yingjie Cao 于2021年6月16日周三 下午8:00写道: > > Maybe you can try to increase taskmanager.network.retries, > taskmanager.network.netty.server.backlog and > taskmanager.network.netty.sendReceiveBufferSize. These options are useful for > our jobs. > > yidan zhao 于2021年6月16日周三 下午7:10写道:

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 Yingjie Cao
Maybe you can try to increase taskmanager.network.retries, taskmanager.network.netty.server.backlog and taskmanager.network.netty.sendReceiveBufferSize. These options are useful for our jobs. yidan zhao 于2021年6月16日周三 下午7:10写道: > Hi, yingjie. > If the network is not stable, which config

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
I also searched many result in internet. There are some related exception like org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException, but in my case it is org.apache.flink.runtime.io.network.netty.exception.LocalTransportException. It is different in

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
Hi, yingjie. If the network is not stable, which config parameter I should adjust. yidan zhao 于2021年6月16日周三 下午6:56写道: > > 2: I use G1, and no full gc occurred, young gc count: 422, time: > 142892, so it is not bad. > 3: stream job. > 4: I will try to config taskmanager.network.retries which is

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
2: I use G1, and no full gc occurred, young gc count: 422, time: 142892, so it is not bad. 3: stream job. 4: I will try to config taskmanager.network.retries which is default 0, and taskmanager.network.netty.client.connectTimeoutSec 's default is 120s。 5: I checked the net fd number of the

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 Yingjie Cao
Hi yidan, 1. Is the network stable? 2. Is there any GC problem? 3. Is it a batch job? If so, please use sort-shuffle, see [1] for more information. 4. You may try to config these two options: taskmanager.network.retries, taskmanager.network.netty.client.connectTimeoutSec. More relevant options

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
Hi, here is the text exception stack: org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection timed out (connection to '10.35.215.18/10.35.215.18:2045') at

Re: flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 Robert Metzger
Hi Yidan, it seems that the attachment did not make it through the mailing list. Can you copy-paste the text of the exception here or upload the log somewhere? On Wed, Jun 16, 2021 at 9:36 AM yidan zhao wrote: > Attachment is the exception stack from flink's web-ui. Does anyone > have also

flink job exception analysis (netty related, readAddress failed. connection timed out)

2021-06-16 文章 yidan zhao
Attachment is the exception stack from flink's web-ui. Does anyone have also met this problem? Flink1.12 - Flink1.13.1. Standalone Cluster, include 30 containers, each 28G mem.

??????flink job exception

2021-05-31 文章 day
history server?? https://ci.apache.org/projects/flink/flink-docs-master/zh/docs/deployment/advanced/historyserver/ ---- ??:

flink job exception

2021-05-30 文章 krislee
各位好: 我是flink的初学者。 今天在flink web UI 和后台的job 管理页面 发现很多 exception: .. 11:29:30.107 [flink-akka.actor.default-dispatcher-41] ERROR org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler - Exception occurred in REST handler: Job 16c614ab0d6f5b28746c66f351fb67f8 not found ..