Re: RocksDB 读 cpu 100% 如何调优

2022-03-17 文章 yue ma
hi
我觉得这里可以注意两地方
首先 你可以观察一下这个时候 task 的吞吐量是多少 ,如果 qps 特别高 ,比如作业重最旧的offset 消费,我觉得这个时候 cpu 100%
是符合预期的。
其次 你可以在代码中加一些内存缓存的逻辑 类似于 mini-batch, 来减少和 state 交互的频率,也许这样能缓解一部分问题。

deng xuezhao  于2022年3月18日周五 11:19写道:

> 退订
>
>
>
> 在 Peihui He ,2022年3月18日 上午11:18写道:
>
> Hi, all
>
> 如题,flink 任务使用rocksdb 做为状态后端,任务逻辑大概意思是:
> 来一条数据先判断该数据的key 是否再mapstat 中, 然后再将该key 写入mapstat中。
>
> 产生问题是当数据跑一段时间后,判断是否存在线程cpu总是100%,堆栈如下:
>
> "process (6/18)#0" Id=80 RUNNABLE (in native)
> at org.rocksdb.RocksDB.get(Native Method)
> at org.rocksdb.RocksDB.get(RocksDB.java:2084)
> at
>
> org.apache.flink.contrib.streaming.state.RocksDBMapState.contains(RocksDBMapState.java:173)
> at
>
> org.apache.flink.runtime.state.UserFacingMapState.contains(UserFacingMapState.java:72)
> at
>
> com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:156)
> at
>
> com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:145)
> at
>
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
> at
>
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
> at
> org.apache.flink.streaming.runtime.io
> .AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
> at
> org.apache.flink.streaming.runtime.io
> .AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
> at
> org.apache.flink.streaming.runtime.io
> .StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:496)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$624/715942770.runDefaultAction(Unknown
> Source)
> at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
> at
> org.apache.flink.runtime.taskmanager.Task$$Lambda$773/520411616.run(Unknown
> Source)
> at
>
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
> at
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
> at java.lang.Thread.run(Thread.java:748)
>
> 但是看checkpoint数据,才100m左右
>
> 请问大家 rocksdb 是出现什么性能瓶颈了呢? 改怎么调优呢?
>


Re: RocksDB 读 cpu 100% 如何调优

2022-03-17 文章 Jiangang Liu
如果状态比较小,可以直接考虑使用filesystem,这种perRecord的操作还是比较耗时的。

deng xuezhao  于2022年3月18日周五 11:19写道:

> 退订
>
>
>
> 在 Peihui He ,2022年3月18日 上午11:18写道:
>
> Hi, all
>
> 如题,flink 任务使用rocksdb 做为状态后端,任务逻辑大概意思是:
> 来一条数据先判断该数据的key 是否再mapstat 中, 然后再将该key 写入mapstat中。
>
> 产生问题是当数据跑一段时间后,判断是否存在线程cpu总是100%,堆栈如下:
>
> "process (6/18)#0" Id=80 RUNNABLE (in native)
> at org.rocksdb.RocksDB.get(Native Method)
> at org.rocksdb.RocksDB.get(RocksDB.java:2084)
> at
>
> org.apache.flink.contrib.streaming.state.RocksDBMapState.contains(RocksDBMapState.java:173)
> at
>
> org.apache.flink.runtime.state.UserFacingMapState.contains(UserFacingMapState.java:72)
> at
>
> com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:156)
> at
>
> com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:145)
> at
>
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
> at
>
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
> at
> org.apache.flink.streaming.runtime.io
> .AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
> at
> org.apache.flink.streaming.runtime.io
> .AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
> at
> org.apache.flink.streaming.runtime.io
> .StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:496)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$624/715942770.runDefaultAction(Unknown
> Source)
> at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
> at
> org.apache.flink.runtime.taskmanager.Task$$Lambda$773/520411616.run(Unknown
> Source)
> at
>
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
> at
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
> at java.lang.Thread.run(Thread.java:748)
>
> 但是看checkpoint数据,才100m左右
>
> 请问大家 rocksdb 是出现什么性能瓶颈了呢? 改怎么调优呢?
>


Re:RocksDB 读 cpu 100% 如何调优

2022-03-17 文章 deng xuezhao
退订



在 Peihui He ,2022年3月18日 上午11:18写道:

Hi, all

如题,flink 任务使用rocksdb 做为状态后端,任务逻辑大概意思是:
来一条数据先判断该数据的key 是否再mapstat 中, 然后再将该key 写入mapstat中。

产生问题是当数据跑一段时间后,判断是否存在线程cpu总是100%,堆栈如下:

"process (6/18)#0" Id=80 RUNNABLE (in native)
at org.rocksdb.RocksDB.get(Native Method)
at org.rocksdb.RocksDB.get(RocksDB.java:2084)
at
org.apache.flink.contrib.streaming.state.RocksDBMapState.contains(RocksDBMapState.java:173)
at
org.apache.flink.runtime.state.UserFacingMapState.contains(UserFacingMapState.java:72)
at
com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:156)
at
com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:145)
at
org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:496)
at
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$624/715942770.runDefaultAction(Unknown
Source)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
at
org.apache.flink.runtime.taskmanager.Task$$Lambda$773/520411616.run(Unknown
Source)
at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:748)

但是看checkpoint数据,才100m左右

请问大家 rocksdb 是出现什么性能瓶颈了呢? 改怎么调优呢?


RocksDB 读 cpu 100% 如何调优

2022-03-17 文章 Peihui He
Hi, all

如题,flink 任务使用rocksdb 做为状态后端,任务逻辑大概意思是:
来一条数据先判断该数据的key 是否再mapstat 中, 然后再将该key 写入mapstat中。

产生问题是当数据跑一段时间后,判断是否存在线程cpu总是100%,堆栈如下:

"process (6/18)#0" Id=80 RUNNABLE (in native)
at org.rocksdb.RocksDB.get(Native Method)
at org.rocksdb.RocksDB.get(RocksDB.java:2084)
at
org.apache.flink.contrib.streaming.state.RocksDBMapState.contains(RocksDBMapState.java:173)
at
org.apache.flink.runtime.state.UserFacingMapState.contains(UserFacingMapState.java:72)
at
com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:156)
at
com.huanju.security.soc.internal.hs.bigdata.FileScanToTiDB$$anon$12.processElement(FileScanToTiDB.scala:145)
at
org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
at
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:496)
at
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$624/715942770.runDefaultAction(Unknown
Source)
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
at
org.apache.flink.runtime.taskmanager.Task$$Lambda$773/520411616.run(Unknown
Source)
at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:748)

但是看checkpoint数据,才100m左右

请问大家 rocksdb 是出现什么性能瓶颈了呢? 改怎么调优呢?


Re: 如何得知当前Flink集群的leader

2022-03-17 文章 yu'an huang
请问什么场景下你需要获得RM等的leader呢?从Flink UI可以查看日志,从日志中也能看到现在的leader,不知道是否满足你的使用场景。


> On 17 Mar 2022, at 10:38 AM, yidan zhao  wrote:
> 
> 如题,集群resourceManager等角色的leader,以及具体每个任务的leader从UI能否得知。
> 目前从ZK获取是一种方法,但不好看,需要程序解码。



Re: 如何根据flink日志排查错误

2022-03-17 文章 胡伟华
Hi, yidan

从这个日志只能看到 m1-sys-rpm109-7c88b.m1.baidu.com:2086 
 和 10.35.115.170:2045 
之间网络链接出现了异常。可以检查:
1. 是否由于 10.35.115.170:2045 代表的 TM 退出导致下游链接 timeout
2. 如果不是比较大的可能是真的网络抖动了,可以通过一些机器的外围监控查看是否存在丢网、网络延迟的问题

> 2022年3月17日 下午6:13,yidan zhao  写道:
> 
> 不是,我是120个机器,我得知道最早发生异常的是哪台机器。不是说找到机器后,看最早的错误日志。
> 我现在找到host1有报错,但这个报错是别的机器问题导致任务失败,进而任务cancel,这这个机器当然也会有switched from这种日志。
> 但这却不是问题机器。
> 
> 吴Janick  于2022年3月17日周四 16:33写道:
> 
>> 可以按关键字“switched from RUNNING”全局搜索日志,找到最早抛出异常的堆栈信息
>> 
>> 
>>> 2022年3月17日 下午4:15,yidan zhao  写道:
>>> 
>>> 如题,我举个例子。
>>> 我有个任务,经常失败,现在以其中一个例子。首先看exception history,发现:
>>> Time
>>> Exception
>>> Name
>>> Location
>>> 2022-03-17 15:09:44
>>> org.apache.flink.runtime.io
>> .network.netty.exception.LocalTransportException
>>> bal_ft_baiduid_subid_window_reduce(300s, nulls, 0s) (1/30) - execution #7
>>> m1-sys-rpm109-7c88b.m1.baidu.com:2086
>>> 
>>> org.apache.flink.runtime.io
>> .network.netty.exception.LocalTransportException:
>>> readAddress(..) failed: Connection timed out (connection to '
>>> 10.35.115.170/10.35.115.170:2045')
>>>   at org.apache.flink.runtime.io.network.netty.
>>> CreditBasedPartitionRequestClientHandler.exceptionCaught(
>>> CreditBasedPartitionRequestClientHandler.java:201)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeExceptionCaught(
>>> AbstractChannelHandlerContext.java:302)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeExceptionCaught(
>>> AbstractChannelHandlerContext.java:281)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.fireExceptionCaught(
>>> AbstractChannelHandlerContext.java:273)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.
>>> DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline
>>> .java:1377)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeExceptionCaught(
>>> AbstractChannelHandlerContext.java:302)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.
>>> AbstractChannelHandlerContext.invokeExceptionCaught(
>>> AbstractChannelHandlerContext.java:281)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.
>>> 
>> DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
>>> AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(
>>> AbstractEpollStreamChannel.java:728)
>>>   at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
>>> AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(
>>> AbstractEpollStreamChannel.java:818)
>>>   at
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
>>> .processReady(EpollEventLoop.java:475)
>>>   at
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
>>> .run(EpollEventLoop.java:378)
>>>   at org.apache.flink.shaded.netty4.io.netty.util.concurrent.
>>> SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>   at org.apache.flink.shaded.netty4.io.netty.util.internal.
>>> ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>   at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.
>>> Errors$NativeIoException: readAddress(..) failed: Connection timed out
>>> 
>>> 
>>> 根据 location 我找到机器,登陆确认的确有该报错。 为了找到原因,我根据报错,继续查找
>>> Connection timed out (connection to '10.35.115.170/10.35.115.170:2045')
>>> 中ip对应的机器,错误时间,找到相关日志如下:
>>> INFO  2022-03-17 15:09:44,526 org.apache.flink.runtime.taskmanager.Task
>>>- bal_ft_baiduid_Dr12Join ->
>>> xxx_TimestampWatermarkAssigner(-90s) (17/30)#7
>>> (f4944df682b1642e0911b2aa39a1279b) switched from RUNNING to CANCELING.
>>> 其他日志类似,都是cancel吧啦的。
>>> 
>>> 所以,不论是ui上的错误,还是跟踪一层后的日志(直接是cancel)。那为什么引发了cancel呢?
>>> 当然肯定是某中错误导致cancel然后自动恢复啥的。
>>> 
>>> 我需要知道的是,我一个任务120个容器,我怎么去跟踪发现是哪一台容器第一个发生了某中错误,进而导致这一系列问题呢???
>>> 
>>> 任意一机器A假设网络故障,都可能导致另一个机器连接A出现timeout,但是我需要知道最开始或者真正出问题的是哪一台机器。这个大家都怎么解决一般。
>> 
>> 



Re: 如何根据flink日志排查错误

2022-03-17 文章 yidan zhao
不是,我是120个机器,我得知道最早发生异常的是哪台机器。不是说找到机器后,看最早的错误日志。
我现在找到host1有报错,但这个报错是别的机器问题导致任务失败,进而任务cancel,这这个机器当然也会有switched from这种日志。
但这却不是问题机器。

吴Janick  于2022年3月17日周四 16:33写道:

> 可以按关键字“switched from RUNNING”全局搜索日志,找到最早抛出异常的堆栈信息
>
>
> > 2022年3月17日 下午4:15,yidan zhao  写道:
> >
> > 如题,我举个例子。
> > 我有个任务,经常失败,现在以其中一个例子。首先看exception history,发现:
> > Time
> > Exception
> > Name
> > Location
> > 2022-03-17 15:09:44
> > org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException
> > bal_ft_baiduid_subid_window_reduce(300s, nulls, 0s) (1/30) - execution #7
> > m1-sys-rpm109-7c88b.m1.baidu.com:2086
> >
> > org.apache.flink.runtime.io
> .network.netty.exception.LocalTransportException:
> > readAddress(..) failed: Connection timed out (connection to '
> > 10.35.115.170/10.35.115.170:2045')
> >at org.apache.flink.runtime.io.network.netty.
> > CreditBasedPartitionRequestClientHandler.exceptionCaught(
> > CreditBasedPartitionRequestClientHandler.java:201)
> >at org.apache.flink.shaded.netty4.io.netty.channel.
> > AbstractChannelHandlerContext.invokeExceptionCaught(
> > AbstractChannelHandlerContext.java:302)
> >at org.apache.flink.shaded.netty4.io.netty.channel.
> > AbstractChannelHandlerContext.invokeExceptionCaught(
> > AbstractChannelHandlerContext.java:281)
> >at org.apache.flink.shaded.netty4.io.netty.channel.
> > AbstractChannelHandlerContext.fireExceptionCaught(
> > AbstractChannelHandlerContext.java:273)
> >at org.apache.flink.shaded.netty4.io.netty.channel.
> > DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline
> > .java:1377)
> >at org.apache.flink.shaded.netty4.io.netty.channel.
> > AbstractChannelHandlerContext.invokeExceptionCaught(
> > AbstractChannelHandlerContext.java:302)
> >at org.apache.flink.shaded.netty4.io.netty.channel.
> > AbstractChannelHandlerContext.invokeExceptionCaught(
> > AbstractChannelHandlerContext.java:281)
> >at org.apache.flink.shaded.netty4.io.netty.channel.
> >
> DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
> >at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
> > AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(
> > AbstractEpollStreamChannel.java:728)
> >at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(
> > AbstractEpollStreamChannel.java:818)
> >at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
> > .processReady(EpollEventLoop.java:475)
> >at
> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
> > .run(EpollEventLoop.java:378)
> >at org.apache.flink.shaded.netty4.io.netty.util.concurrent.
> > SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> >at org.apache.flink.shaded.netty4.io.netty.util.internal.
> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> >at java.lang.Thread.run(Thread.java:748)
> > Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.
> > Errors$NativeIoException: readAddress(..) failed: Connection timed out
> >
> >
> > 根据 location 我找到机器,登陆确认的确有该报错。 为了找到原因,我根据报错,继续查找
> > Connection timed out (connection to '10.35.115.170/10.35.115.170:2045')
> > 中ip对应的机器,错误时间,找到相关日志如下:
> >  INFO  2022-03-17 15:09:44,526 org.apache.flink.runtime.taskmanager.Task
> > - bal_ft_baiduid_Dr12Join ->
> > xxx_TimestampWatermarkAssigner(-90s) (17/30)#7
> > (f4944df682b1642e0911b2aa39a1279b) switched from RUNNING to CANCELING.
> > 其他日志类似,都是cancel吧啦的。
> >
> > 所以,不论是ui上的错误,还是跟踪一层后的日志(直接是cancel)。那为什么引发了cancel呢?
> > 当然肯定是某中错误导致cancel然后自动恢复啥的。
> >
> > 我需要知道的是,我一个任务120个容器,我怎么去跟踪发现是哪一台容器第一个发生了某中错误,进而导致这一系列问题呢???
> >
> > 任意一机器A假设网络故障,都可能导致另一个机器连接A出现timeout,但是我需要知道最开始或者真正出问题的是哪一台机器。这个大家都怎么解决一般。
>
>


Re: 如何根据flink日志排查错误

2022-03-17 文章 吴Janick
可以按关键字“switched from RUNNING”全局搜索日志,找到最早抛出异常的堆栈信息 


> 2022年3月17日 下午4:15,yidan zhao  写道:
> 
> 如题,我举个例子。
> 我有个任务,经常失败,现在以其中一个例子。首先看exception history,发现:
> Time
> Exception
> Name
> Location
> 2022-03-17 15:09:44
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException
> bal_ft_baiduid_subid_window_reduce(300s, nulls, 0s) (1/30) - execution #7
> m1-sys-rpm109-7c88b.m1.baidu.com:2086
> 
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> readAddress(..) failed: Connection timed out (connection to '
> 10.35.115.170/10.35.115.170:2045')
>at org.apache.flink.runtime.io.network.netty.
> CreditBasedPartitionRequestClientHandler.exceptionCaught(
> CreditBasedPartitionRequestClientHandler.java:201)
>at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:302)
>at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:281)
>at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.fireExceptionCaught(
> AbstractChannelHandlerContext.java:273)
>at org.apache.flink.shaded.netty4.io.netty.channel.
> DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline
> .java:1377)
>at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:302)
>at org.apache.flink.shaded.netty4.io.netty.channel.
> AbstractChannelHandlerContext.invokeExceptionCaught(
> AbstractChannelHandlerContext.java:281)
>at org.apache.flink.shaded.netty4.io.netty.channel.
> DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
>at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
> AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(
> AbstractEpollStreamChannel.java:728)
>at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
> AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(
> AbstractEpollStreamChannel.java:818)
>at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
> .processReady(EpollEventLoop.java:475)
>at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
> .run(EpollEventLoop.java:378)
>at org.apache.flink.shaded.netty4.io.netty.util.concurrent.
> SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>at org.apache.flink.shaded.netty4.io.netty.util.internal.
> ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.
> Errors$NativeIoException: readAddress(..) failed: Connection timed out
> 
> 
> 根据 location 我找到机器,登陆确认的确有该报错。 为了找到原因,我根据报错,继续查找
> Connection timed out (connection to '10.35.115.170/10.35.115.170:2045')
> 中ip对应的机器,错误时间,找到相关日志如下:
>  INFO  2022-03-17 15:09:44,526 org.apache.flink.runtime.taskmanager.Task
> - bal_ft_baiduid_Dr12Join ->
> xxx_TimestampWatermarkAssigner(-90s) (17/30)#7
> (f4944df682b1642e0911b2aa39a1279b) switched from RUNNING to CANCELING.
> 其他日志类似,都是cancel吧啦的。
> 
> 所以,不论是ui上的错误,还是跟踪一层后的日志(直接是cancel)。那为什么引发了cancel呢?
> 当然肯定是某中错误导致cancel然后自动恢复啥的。
> 
> 我需要知道的是,我一个任务120个容器,我怎么去跟踪发现是哪一台容器第一个发生了某中错误,进而导致这一系列问题呢???
> 
> 任意一机器A假设网络故障,都可能导致另一个机器连接A出现timeout,但是我需要知道最开始或者真正出问题的是哪一台机器。这个大家都怎么解决一般。



如何根据flink日志排查错误

2022-03-17 文章 yidan zhao
如题,我举个例子。
我有个任务,经常失败,现在以其中一个例子。首先看exception history,发现:
Time
Exception
Name
Location
2022-03-17 15:09:44
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException
bal_ft_baiduid_subid_window_reduce(300s, nulls, 0s) (1/30) - execution #7
m1-sys-rpm109-7c88b.m1.baidu.com:2086

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection timed out (connection to '
10.35.115.170/10.35.115.170:2045')
at org.apache.flink.runtime.io.network.netty.
CreditBasedPartitionRequestClientHandler.exceptionCaught(
CreditBasedPartitionRequestClientHandler.java:201)
at org.apache.flink.shaded.netty4.io.netty.channel.
AbstractChannelHandlerContext.invokeExceptionCaught(
AbstractChannelHandlerContext.java:302)
at org.apache.flink.shaded.netty4.io.netty.channel.
AbstractChannelHandlerContext.invokeExceptionCaught(
AbstractChannelHandlerContext.java:281)
at org.apache.flink.shaded.netty4.io.netty.channel.
AbstractChannelHandlerContext.fireExceptionCaught(
AbstractChannelHandlerContext.java:273)
at org.apache.flink.shaded.netty4.io.netty.channel.
DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline
.java:1377)
at org.apache.flink.shaded.netty4.io.netty.channel.
AbstractChannelHandlerContext.invokeExceptionCaught(
AbstractChannelHandlerContext.java:302)
at org.apache.flink.shaded.netty4.io.netty.channel.
AbstractChannelHandlerContext.invokeExceptionCaught(
AbstractChannelHandlerContext.java:281)
at org.apache.flink.shaded.netty4.io.netty.channel.
DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(
AbstractEpollStreamChannel.java:728)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.
AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(
AbstractEpollStreamChannel.java:818)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
.processReady(EpollEventLoop.java:475)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop
.run(EpollEventLoop.java:378)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.
SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at org.apache.flink.shaded.netty4.io.netty.util.internal.
ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.
Errors$NativeIoException: readAddress(..) failed: Connection timed out


根据 location 我找到机器,登陆确认的确有该报错。 为了找到原因,我根据报错,继续查找
Connection timed out (connection to '10.35.115.170/10.35.115.170:2045')
中ip对应的机器,错误时间,找到相关日志如下:
  INFO  2022-03-17 15:09:44,526 org.apache.flink.runtime.taskmanager.Task
 - bal_ft_baiduid_Dr12Join ->
xxx_TimestampWatermarkAssigner(-90s) (17/30)#7
(f4944df682b1642e0911b2aa39a1279b) switched from RUNNING to CANCELING.
其他日志类似,都是cancel吧啦的。

所以,不论是ui上的错误,还是跟踪一层后的日志(直接是cancel)。那为什么引发了cancel呢?
当然肯定是某中错误导致cancel然后自动恢复啥的。

我需要知道的是,我一个任务120个容器,我怎么去跟踪发现是哪一台容器第一个发生了某中错误,进而导致这一系列问题呢???

任意一机器A假设网络故障,都可能导致另一个机器连接A出现timeout,但是我需要知道最开始或者真正出问题的是哪一台机器。这个大家都怎么解决一般。