Re: 请问同一个flink history server能够支持多个flink application cluster吗?

2021-08-19 文章 Chenyu Zheng
History Server的API也是使用jobid作为区分

  *   /config
  *   /jobs/overview
  *   /jobs/
  *   /jobs//vertices
  *   /jobs//config
  *   /jobs//exceptions
  *   /jobs//accumulators
  *   /jobs//vertices/
  *   /jobs//vertices//subtasktimes
  *   /jobs//vertices//taskmanagers
  *   /jobs//vertices//accumulators
  *   /jobs//vertices//subtasks/accumulators
  *   /jobs//vertices//subtasks/
  *   /jobs//vertices//subtasks//attempts/
  *   
/jobs//vertices//subtasks//attempts//accumulators
  *   /jobs//plan


From: Chenyu Zheng 
Reply-To: "user-zh@flink.apache.org" 
Date: Friday, August 20, 2021 at 11:43 AM
To: "user-zh@flink.apache.org" 
Subject: 请问同一个flink history server能够支持多个flink application cluster吗?

您好,

我们目前在k8s上以flink application模式运行作业,现在希望部署一个history server方便debug。但是根据文档,flink 
historyserver貌似只支持单个cluster下不同job的使用方法,如果存在多个cluster,相同的jobID将会出现错误。

请问对于多个application cluster,history使用的最佳姿势是什么样的?

谢谢[cid:image001.png@01D795B8.6430A670]


请问同一个flink history server能够支持多个flink application cluster吗?

2021-08-19 文章 Chenyu Zheng
您好,

我们目前在k8s上以flink application模式运行作业,现在希望部署一个history server方便debug。但是根据文档,flink 
historyserver貌似只支持单个cluster下不同job的使用方法,如果存在多个cluster,相同的jobID将会出现错误。

请问对于多个application cluster,history使用的最佳姿势是什么样的?

谢谢[cid:image001.png@01D795B8.6430A670]


请问如何从源码构建flink docker镜像?

2021-08-19 文章 Chenyu Zheng
Hi,

我最近对于手头的源码进行了些许修改,请问如何从源码构建docker镜像?这将方便我进行下一步测试

谢谢


Re: Flink 1.12.5: The heartbeat of JobManager/TaskManager with id xxx timed out

2021-08-10 文章 Chenyu Zheng
$OrElse.applyOrElse(PartialFunction.scala:171) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
[flink-dist_2.11-1.12.5.jar:1.12.5]
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager 
with id 1da1bb0693814dd8cc2549e4f5cd368a timed out.
... 27 more

On 2021/8/10, 7:13 PM, "Chenyu Zheng"  wrote:

Hi 开发者,


我正尝试在k8s上部署flink集群,但是当我将并行度调的比较大(128)时,会经常遇到Jobmanager/Taskmanager的各种超时错误,然后我的任务会被自动取消。

我确定这不是一个网络问题,因为:

  *   在32/64并行度从没有出现过这个问题,但是在128并行度,每次运行都会出现这个错误
  *   我们的flink是部署在生产环境的k8s集群中,没有其他容器反馈遇到了网络问题
  *   将heartbeat.timeout调大(300s)可以解决这个问题

我的flink环境:
·Flink 1.12.5 with java8, scala 2.11
·Jobmanager Start command: $JAVA_HOME/bin/java -classpath 
$FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/jobmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint 
-D jobmanager.memory.off-heap.size=134217728b -D 
jobmanager.memory.jvm-overhead.min=1073741824b -D 
jobmanager.memory.jvm-metaspace.size=268435456b -D 
jobmanager.memory.heap.size=15703474176b -D 
jobmanager.memory.jvm-overhead.max=1073741824b
·Taskmanager Start command: $JAVA_HOME/bin/java -classpath 
$FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 
-XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/taskmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D 
taskmanager.memory.framework.off-heap.size=134217728b -D 
taskmanager.memory.network.max=359703515b -D 
taskmanager.memory.network.min=359703515b -D 
taskmanager.memory.framework.heap.size=134217728b -D 
taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D 
taskmanager.memory.task.heap.size=1530082070b -D 
taskmanager.memory.task.off-heap.size=0b -D 
taskmanager.memory.jvm-metaspace.size=268435456b -D 
taskmanager.memory.jvm-overhead.max=429496736b -D 
taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf 
-Djobmanager.rpc.address='10.50.132.154' 
-Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' 
-Djobmanager.memory.off-heap.size='134217728b' 
-Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' 
-Drest.address='10.50.132.154' 
-Djobmanager.memory.jvm-overhead.max='1073741824b' 
-Djobmanager.memory.jvm-overhead.min='1073741824b' 
-Dtaskmanager.resource-id='stream-367f634e41349f7195961cdb0c6c-taskmanager-1-17'
 -Dexecution.target='embedded' 
-Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar'
 -Djobmanager.memory.jvm-metaspace.size='268435456b' 
-Djobmanager.memory.heap.size='15703474176b'

请问这种超时现象是一种正确的表现吗?我应该做什么来定位这种超时现象的根源呢?

谢谢!

Chenyu



user-zh@flink.apache.org

2021-08-10 文章 Chenyu Zheng
Hi 开发者,

我正尝试在k8s上部署flink集群,但是当我将并行度调的比较大(128)时,会经常遇到Jobmanager/Taskmanager的各种超时错误,然后我的任务会被自动取消。

我确定这不是一个网络问题,因为:

  *   在32/64并行度从没有出现过这个问题,但是在128并行度,每次运行都会出现这个错误
  *   我们的flink是部署在生产环境的k8s集群中,没有其他容器反馈遇到了网络问题
  *   将heartbeat.timeout调大(300s)可以解决这个问题

我的flink环境:
·Flink 1.12.5 with java8, scala 2.11
·Jobmanager Start command: $JAVA_HOME/bin/java -classpath 
$FLINK_CLASSPATH -Xmx15703474176 -Xms15703474176 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/jobmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint 
-D jobmanager.memory.off-heap.size=134217728b -D 
jobmanager.memory.jvm-overhead.min=1073741824b -D 
jobmanager.memory.jvm-metaspace.size=268435456b -D 
jobmanager.memory.heap.size=15703474176b -D 
jobmanager.memory.jvm-overhead.max=1073741824b
·Taskmanager Start command: $JAVA_HOME/bin/java -classpath 
$FLINK_CLASSPATH -Xmx1664299798 -Xms1664299798 
-XX:MaxDirectMemorySize=493921243 -XX:MaxMetaspaceSize=268435456 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintPromotionFailure -XX:+PrintGCCause -XX:+PrintHeapAtGC 
-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
-Dlog.file=/opt/flink/log/taskmanager.log 
-Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 
-Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties 
org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner -D 
taskmanager.memory.framework.off-heap.size=134217728b -D 
taskmanager.memory.network.max=359703515b -D 
taskmanager.memory.network.min=359703515b -D 
taskmanager.memory.framework.heap.size=134217728b -D 
taskmanager.memory.managed.size=1438814063b -D taskmanager.cpu.cores=1.0 -D 
taskmanager.memory.task.heap.size=1530082070b -D 
taskmanager.memory.task.off-heap.size=0b -D 
taskmanager.memory.jvm-metaspace.size=268435456b -D 
taskmanager.memory.jvm-overhead.max=429496736b -D 
taskmanager.memory.jvm-overhead.min=429496736b --configDir /opt/flink/conf 
-Djobmanager.rpc.address='10.50.132.154' 
-Dpipeline.classpaths='file:usrlib/flink-playground-clickcountjob-print.jar' 
-Djobmanager.memory.off-heap.size='134217728b' 
-Dweb.tmpdir='/tmp/flink-web-07190d10-c6ea-4b1a-9eee-b2d0b2711a76' 
-Drest.address='10.50.132.154' 
-Djobmanager.memory.jvm-overhead.max='1073741824b' 
-Djobmanager.memory.jvm-overhead.min='1073741824b' 
-Dtaskmanager.resource-id='stream-367f634e41349f7195961cdb0c6c-taskmanager-1-17'
 -Dexecution.target='embedded' 
-Dpipeline.jars='file:/opt/flink/usrlib/flink-playground-clickcountjob-print.jar'
 -Djobmanager.memory.jvm-metaspace.size='268435456b' 
-Djobmanager.memory.heap.size='15703474176b'

请问这种超时现象是一种正确的表现吗?我应该做什么来定位这种超时现象的根源呢?

谢谢!

Chenyu


Re: 几个Flink 1.12. 2超时问题

2021-08-04 文章 Chenyu Zheng
目前是在所有taskmanager容器都成功启动之后,才出现的timeout,所以不可能是调度层面的问题。
目前我们在网络层面使用的是生产环境的网络,该环境被用于跑大量的生产流量,也没有其他容器反馈过类似问题。

我目前还是比较怀疑flink本身的某个配置导致了这个现象,请问flink是否有相关的metrics或日志可以参考?

On 2021/8/4, 11:50 AM, "东东"  wrote:



应该可以从两个层面查一下:
1、调度层面。native 
application是先启动JM容器,然后由JM容器与K8s交互拉起TM的,可以看一下K8s日志,看看整个流程是否有瓶颈点,比如镜像的拉取,TM容器的启动之类。

2、网络层面。如果调度没有问题,各容器启动的过程和速度都很正常,那就要看网络层面是否存在瓶颈,必要的时候可以tcpdump一下。







在 2021-08-03 14:02:53,"Chenyu Zheng"  写道:

开发者您好,



我正在尝试在Kubernetes上部署Flink 1.12.2,使用的是native 
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。



在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。




我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?



附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。



谢谢!




Re: 几个Flink 1.12. 2超时问题

2021-08-03 文章 Chenyu Zheng
是因为上游事件源速率比较大,需要提高并行度来匹配速率

谢谢!

On 2021/8/3, 2:41 PM, "Ye Chen"  wrote:

你好,
请问一下为什么要设置128并行度,这个数值有点太大了,出于什么考虑设置的






在 2021-08-03 14:02:53,"Chenyu Zheng"  写道:

开发者您好,



我正在尝试在Kubernetes上部署Flink 1.12.2,使用的是native 
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。



在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。




我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?



附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。



谢谢!




Re: 几个Flink 1.12. 2超时问题

2021-08-03 文章 Chenyu Zheng
) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
... 4 more

From: Chenyu Zheng 
Reply-To: "user-zh@flink.apache.org" 
Date: Tuesday, August 3, 2021 at 2:04 PM
To: "user-zh@flink.apache.org" 
Subject: 几个Flink 1.12. 2超时问题

开发者您好,

我正在尝试在Kubernetes上部署Flink 1.12.2, 使用的是native 
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。

在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。

我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?

附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。

谢谢!



Re: 几个Flink 1.12. 2超时问题

2021-08-03 文章 Chenyu Zheng
]
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_282]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_282]
at java.lang.reflect.Method.invoke(Method.java:498) 
~[?:1.8.0_282]
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:349)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:219)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:242)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
... 10 more
Caused by: java.util.concurrent.TimeoutException: Invocation of public default 
java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.webmonitor.RestfulGateway.requestJobStatus(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
 timed out.
at 
org.apache.flink.runtime.rpc.akka.$Proxy36.requestJobStatus(Unknown Source) 
~[?:1.12.2]
at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$getJobResult$0(JobStatusPollingUtils.java:57)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.pollJobResultAsync(JobStatusPollingUtils.java:87)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$null$3(JobStatusPollingUtils.java:107)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
... 9 more
Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/rpc/dispatcher_1#1531007562]] after [6 ms]. 
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A 
typical reason for `AskTimeoutException` is that the recipient actor didn't 
send a reply.
at 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
 ~[flink-dist_2.11-1.12.2.jar:1.12.2]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_282]

From: Chenyu Zheng 
Reply-To: "user-zh@flink.apache.org" 
Date: Tuesday, August 3, 2021 at 2:04 PM
To: "user-zh@flink.apache.org" 
Subject: 几个Flink 1.12. 2超时问题

开发者您好,

我正在尝试在Kubernetes上部署Flink 1.12.2, 使用的是native 
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。

在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。

我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?

附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。

谢谢!



Re: 几个Flink 1.12. 2超时问题

2021-08-03 文章 Chenyu Zheng
) 
~[flink-dist_2.11-1.12.2.jar:1.12.2]
... 4 more

From: Chenyu Zheng 
Reply-To: "user-zh@flink.apache.org" 
Date: Tuesday, August 3, 2021 at 2:04 PM
To: "user-zh@flink.apache.org" 
Subject: 几个Flink 1.12. 2超时问题

开发者您好,

我正在尝试在Kubernetes上部署Flink 1.12.2, 使用的是native 
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。

在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。

我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?

附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。

谢谢!



几个Flink 1.12. 2超时问题

2021-08-03 文章 Chenyu Zheng
开发者您好,

我正在尝试在Kubernetes上部署Flink 1.12.2, 使用的是native 
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。

在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。

我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?

附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。

谢谢!



Flink v1.12.2 Kubernetes Session Mode无法挂载ConfigMap中的log4j.properties

2021-06-19 文章 Chenyu Zheng
开发者您好,
我最近正在尝试使用Kubernetes Session 
Mode启动Flink,但是发现无法挂载ConfigMap中的log4j.properties。请问这是一个bug吗?有没有方法绕开这个问题,动态挂载log4j.properties?
我的yaml:
apiVersion: v1
data:
  flink-conf.yaml: |-
taskmanager.numberOfTaskSlots: 1
blob.server.port: 6124
kubernetes.rest-service.exposed.type: ClusterIP
kubernetes.jobmanager.cpu: 1.00
high-availability.storageDir: 
s3p://hulu-caposv2-flink-s3-bucket/session-cluster-test/ha-backup/
queryable-state.proxy.ports: 6125
kubernetes.service-account: stream-app
high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
jobmanager.memory.process.size: 1024m
taskmanager.memory.process.size: 1024m
kubernetes.taskmanager.annotations: 
cluster-autoscaler.kubernetes.io/safe-to-evict:false
kubernetes.namespace: test123
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 5
kubernetes.taskmanager.cpu: 1.00
state.backend: filesystem
parallelism.default: 4
kubernetes.container.image: 
cubox.prod.hulu.com/proxy/flink:1.12.2-scala_2.11-java8-stdout7
kubernetes.taskmanager.labels: 
capos_id:session-cluster-test,stream-component:jobmanager
state.checkpoints.dir: 
s3p://hulu-caposv2-flink-s3-bucket/session-cluster-test/checkpoints/
kubernetes.cluster-id: session-cluster-test
kubernetes.jobmanager.annotations: 
cluster-autoscaler.kubernetes.io/safe-to-evict:false
state.savepoints.dir: 
s3p://hulu-caposv2-flink-s3-bucket/session-cluster-test/savepoints/
restart-strategy.fixed-delay.delay: 15s
taskmanager.rpc.port: 6122
jobmanager.rpc.address: session-cluster-test-flink-jobmanager
kubernetes.jobmanager.labels: 
capos_id:session-cluster-test,stream-component:jobmanager
jobmanager.rpc.port: 6123
  log4j.properties: |-
logger.kafka.name = org.apache.kafka
logger.hadoop.level = INFO
appender.rolling.type = RollingFile
appender.rolling.filePattern = ${sys:log.file}.%i
appender.rolling.layout.pattern = %d{-MM-dd HH:mm:ss,SSS} %-5p %-60c %x 
- %m%n
logger.netty.name = 
org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline
rootLogger = INFO, rolling
logger.akka.name = akka
appender.rolling.strategy.type = DefaultRolloverStrategy
logger.akka.level = INFO
appender.rolling.append = false
logger.hadoop.name = org.apache.hadoop
appender.rolling.fileName = ${sys:log.file}
appender.rolling.policies.type = Policies
rootLogger.appenderRef.rolling.ref = RollingFileAppender
logger.kafka.level = INFO
appender.rolling.name = RollingFileAppender
appender.rolling.layout.type = PatternLayout
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 100MB
appender.rolling.strategy.max = 10
logger.netty.level = OFF
logger.zookeeper.name = org.apache.zookeeper
logger.zookeeper.level = INFO
kind: ConfigMap
metadata:
  labels:
app: session-cluster-test
capos_id: session-cluster-test
  name: session-cluster-test-flink-config
 namespace: test123

---

apiVersion: batch/v1
kind: Job
metadata:
  labels:
capos_id: session-cluster-test
  name: session-cluster-test-flink-startup
  namespace: test123
spec:
  backoffLimit: 6
  completions: 1
  parallelism: 1
  template:
metadata:
  annotations:
caposv2.prod.hulu.com/streamAppSavepointId: "0"
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  creationTimestamp: null
  labels:
capos_id: session-cluster-test
stream-component: start-up
spec:
  containers:
  - command:
- ./bin/kubernetes-session.sh
- -Dkubernetes.cluster-id=session-cluster-test
image: cubox.prod.hulu.com/proxy/flink:1.12.2-scala_2.11-java8-stdout7
imagePullPolicy: IfNotPresent
name: flink-startup
resources: {}
securityContext:
  runAsUser: 
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /opt/flink/conf
  name: flink-config-volume
  dnsPolicy: ClusterFirst
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: stream-app
  serviceAccountName: stream-app
  terminationGracePeriodSeconds: 30
  volumes:
  - configMap:
  defaultMode: 420
  items:
  - key: flink-conf.yaml
path: flink-conf.yaml
  - key: log4j.properties
path: log4j.properties
  name: session-cluster-test-flink-config
name: flink-config-volume
  ttlSecondsAfterFinished: 86400

启动的jobmanager container volume mount没有log4j.properties
volumes:
  - configMap:
  defaultMode: 420
  items:
  - key: flink-conf.yaml
path: flink-conf.yaml
  name: flink-config-session-cluster-test
name: flink-config-volume

Conf目录下也确实缺少了log配置