希望有大佬给下这些参数的区别。如果环境的网络不好,纠结要调整哪个参数?还是哪些参数。 我目前只提高了 ask.timeout 。 目前看配置,太多与timeout相关的参数了。
akka.ask.timeout akka.lookup.timeout akka.retry-gate-closed-for akka.tcp.timeout akka.startup-timeout heartbeat.interval heartbeat.timeout high-availability.zookeeper.client.connection-timeout high-availability.zookeeper.client.session-timeout taskmanager.network.request-backoff.max ... yidan zhao <hinobl...@gmail.com> 于2021年3月10日周三 下午1:13写道: > 今天对比了下G1和copy-MarkSweepCompact的效果。 > 运行相同时间, 相同任务。 G1的GC时长更长,但是次数更多,因为每次GC的时间更短。 > 1h15min时间,G1的gc 1100+次,平均每次1s左右。 后者gc 205次,平均每次1.9s左右。 > > yidan zhao <hinobl...@gmail.com> 于2021年3月9日周二 下午7:30写道: > >> 补充,还有就是GC收集器,是否无脑使用G1就可以呢?我之前一直是G1,只是最近修改了opts不小心换成其他了。本意不是为了换GC收集器的。 >> >> yidan zhao <hinobl...@gmail.com> 于2021年3月9日周二 下午7:26写道: >> >>> 观察了下。CPU什么的有尖刺,但是也算基本正常,因为我的任务就是5分钟一波。基本每5分钟都有个尖刺。 >>> 然后目前通过Flink的web-ui看了下gc情况。 >>> 发现部分集群的fgc的确有问题,fgc平均大概达到10-20s,当然只有平均值,不清楚是否有某些gc时间更长情况。总难题来说10-20s的确是比较长的,这个我之后会去看看改进下。 >>> >>> (1)不过不清楚这个是否和这个问题直接相关,因为20s的卡顿是否足以引起该问题呢? >>> (2)此外,大家推荐个内存设置,比如你们都多少TM,每个TM多少内存,跑的任务多大数据量大概。 >>> 我目前5个TM的集群,单TM100G内存,跑任务大概10w >>> qps的入口流量,但是很大部分呢会过滤掉,后续部分流量较少。此外,检查点大概达到3-4GB。 >>> >>> >>> Michael Ran <greemqq...@163.com> 于2021年3月9日周二 下午4:27写道: >>> >>>> 看看当时的负载呢?有没有过高的情况,是什么原因。然后监控下网络和磁盘 >>>> 在 2021-03-09 14:57:43,"yidan zhao" <hinobl...@gmail.com> 写道: >>>> >而且大家推荐怎么设置呢,我可能默认就G1了。不清楚G1是否也需要精调。 >>>> >我目前设置的内存还是比较大的。(50G的,100G的TaskManager都有),这么大heap,是否需要特别设置啥呢? >>>> > >>>> >或者是否有必要拆小,比如设置10Gheap,然后把taskmanager数量提上去。 >>>> > >>>> >yidan zhao <hinobl...@gmail.com> 于2021年3月9日周二 下午2:56写道: >>>> > >>>> >> 好的,我会看下。 >>>> >> 然后我今天发现我好多个集群GC collector不一样。 >>>> >> 目前发现3种,默认的是G1。flink conf中配置了env.java.opts: >>>> >> "-XX:-OmitStackTraceInFastThrow"的情况出现了2种,一种是Parallel GC with 83 >>>> >> threads,还有一种是Mark Sweep Compact GC。 >>>> >> 大佬们,Flink是根据内存大小有什么动态调整吗。 >>>> >> >>>> >> >>>> >> >>>> 不使用G1我大概理解了,可能设置了java.opts这个是覆盖,不是追加。本身我只是希望设置下-XX:-OmitStackTraceInFastThrow而已。 >>>> >> >>>> >> >>>> >> 杨杰 <471419...@qq.com> 于2021年3月8日周一 下午3:09写道: >>>> >> >>>> >>> Hi, >>>> >>> >>>> >>> 可以排查下 GC 情况,频繁 FGC 也会导致这些情况。 >>>> >>> >>>> >>> Best, >>>> >>> jjiey >>>> >>> >>>> >>> > 2021年3月8日 14:37,yidan zhao <hinobl...@gmail.com> 写道: >>>> >>> > >>>> >>> > >>>> >>> >>>> 如题,我有个任务频繁发生该异常然后重启。今天任务启动1h后,看了下WEB-UI的检查点也没,restored达到了8已经。然后Exception页面显示该错误,估计大多数都是因为该错误导致的restore。 >>>> >>> > 除此外,就是 ‘Job leader for job id eb5d2893c4c6f4034995b9c8e180f01e >>>> lost >>>> >>> > leadership’ 错导致任务重启。 >>>> >>> > >>>> >>> > 下面给出刚刚的一个错误日志(环境flink1.12,standalone集群,5JM+5TM,JM和TM混部在相同机器): >>>> >>> > 2021-03-08 14:31:40 >>>> >>> > org.apache.flink.runtime.io >>>> >>> .network.netty.exception.RemoteTransportException: >>>> >>> > Error at remote task manager '10.35.185.38/10.35.185.38:2016'. >>>> >>> > at org.apache.flink.runtime.io.network.netty. >>>> >>> > CreditBasedPartitionRequestClientHandler.decodeMsg( >>>> >>> > CreditBasedPartitionRequestClientHandler.java:294) >>>> >>> > at org.apache.flink.runtime.io.network.netty. >>>> >>> > CreditBasedPartitionRequestClientHandler.channelRead( >>>> >>> > CreditBasedPartitionRequestClientHandler.java:183) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:379) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>>> >>> > .java:357) >>>> >>> > at org.apache.flink.runtime.io.network.netty. >>>> >>> > NettyMessageClientDecoderDelegate.channelRead( >>>> >>> > NettyMessageClientDecoderDelegate.java:115) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:379) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext >>>> >>> > .java:357) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java: >>>> >>> > 1410) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:379) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeChannelRead( >>>> >>> > AbstractChannelHandlerContext.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel.epoll. >>>> >>> > AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady( >>>> >>> > AbstractEpollStreamChannel.java:792) >>>> >>> > at >>>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>>> >>> > .processReady(EpollEventLoop.java:475) >>>> >>> > at >>>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>>> >>> > .run(EpollEventLoop.java:378) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>> >>> > >>>> SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.internal. >>>> >>> > ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>>> >>> > at java.lang.Thread.run(Thread.java:748) >>>> >>> > Caused by: org.apache.flink.runtime.io.network.partition. >>>> >>> > ProducerFailedException: org.apache.flink.util.FlinkException: >>>> >>> JobManager >>>> >>> > responsible for eb5d2893c4c6f4034995b9c8e180f01e lost the >>>> leadership. >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > >>>> .writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:221) >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > .enqueueAvailableReader(PartitionRequestQueue.java:108) >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > .userEventTriggered(PartitionRequestQueue.java:170) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:346) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:332) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:324) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter >>>> >>> > .java:117) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.handler.codec. >>>> >>> > >>>> ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:365) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:346) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:332) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.fireUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:324) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline >>>> >>> > .java:1428) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:346) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > AbstractChannelHandlerContext.invokeUserEventTriggered( >>>> >>> > AbstractChannelHandlerContext.java:332) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.channel. >>>> >>> > >>>> >>> >>>> DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java: >>>> >>> > 913) >>>> >>> > at org.apache.flink.runtime.io >>>> .network.netty.PartitionRequestQueue >>>> >>> > .lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>> >>> > AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) >>>> >>> > at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>> >>> > >>>> >>> >>>> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) >>>> >>> > at >>>> >>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop >>>> >>> > .run(EpollEventLoop.java:387) >>>> >>> > ... 3 more >>>> >>> > Caused by: org.apache.flink.util.FlinkException: JobManager >>>> responsible >>>> >>> for >>>> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost the leadership. >>>> >>> > at org.apache.flink.runtime.taskexecutor.TaskExecutor >>>> >>> > .disconnectJobManagerConnection(TaskExecutor.java:1422) >>>> >>> > at >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300( >>>> >>> > TaskExecutor.java:174) >>>> >>> > at org.apache.flink.runtime.taskexecutor. >>>> >>> > >>>> TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1856) >>>> >>> > at java.util.Optional.ifPresent(Optional.java:159) >>>> >>> > at org.apache.flink.runtime.taskexecutor. >>>> >>> > >>>> TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3( >>>> >>> > TaskExecutor.java:1855) >>>> >>> > at >>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( >>>> >>> > AkkaRpcActor.java:404) >>>> >>> > at >>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( >>>> >>> > AkkaRpcActor.java:197) >>>> >>> > at >>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage( >>>> >>> > AkkaRpcActor.java:154) >>>> >>> > at akka.japi.pf >>>> .UnitCaseStatement.apply(CaseStatements.scala:26) >>>> >>> > at akka.japi.pf >>>> .UnitCaseStatement.apply(CaseStatements.scala:21) >>>> >>> > at >>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >>>> >>> > at akka.japi.pf >>>> >>> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >>>> >>> > at >>>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >>>> >>> > at >>>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>>> >>> > at >>>> >>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >>>> >>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >>>> >>> > at >>>> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >>>> >>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >>>> >>> > at akka.actor.ActorCell.invoke(ActorCell.scala:561) >>>> >>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >>>> >>> > at akka.dispatch.Mailbox.run(Mailbox.scala:225) >>>> >>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >>>> >>> > at >>>> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> >>> > at >>>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool >>>> >>> > .java:1339) >>>> >>> > at >>>> >>> >>>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> >>> > at >>>> >>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread >>>> >>> > .java:107) >>>> >>> > Caused by: java.lang.Exception: Job leader for job id >>>> >>> > eb5d2893c4c6f4034995b9c8e180f01e lost leadership. >>>> >>> > ... 24 more >>>> >>> > >>>> >>> > >>>> >>> > (1)zookeeper的超时设置的是60s,感觉网络异常zk超时不至于60s都不够。 >>>> >>> > (2)akka.ask.timeout: 60s >>>> >>> > taskmanager.network.request-backoff.max: 60000 >>>> >>> > akka此参数之前也调整为60s了。 >>>> >>> > >>>> >>> > 如上信息,希望社区同学们给点思路。 >>>> >>> > >>>> >>> >>>> >>> >>>> >>>