退订
---- 回复的原邮件 ---- | 发件人 | love_h1...@126.com | | 日期 | 2024年07月11日 16:10 | | 收件人 | user-zh@flink.apache.org | | 抄送至 | | | 主题 | Flink在HA模式,重启ZK集群,客户端任务提交异常 | 问题现象: Flink 1.11.6版本,Standalone HA模式, 滚动重启了ZK集群;在Flink集群的一个节点上使用flink run 命令提交多个任务; 部分任务提交失败,异常信息如下: [Flink-DispatcherRestEndpoint-thread-2] - [WARN ] - [org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(line:290)] - Could not create remote rpc invocation message. Failing rpc invocation because... java.io.IOException: The rpc invocation size 12532388 exceeds the maximum akka framesize. 日志信息: 集群中A点的JobManager日志有获得主角色的日志信息 17:19:45,433 - [flink-akka.actor.default-dispatcher-22] - [INFO ] - [org.apache.flink.runtime.resourcemanager.ResourceManager.tryAcceptLeadership(line:1118)] - ResourceManager akka.tcp://flink@10.10.160.57:46746/user/rpc/resourcemanager_0 was granted leadership with fencing token ad84d46e902e0cf6da92179447af4e00 17:19:45,434 - [main-EventThread] - [INFO ] - [org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.grantLeadership(line:931)] - http://XXX:XXX was granted leadership with leaderSessionID=f60df688-372d-416b-a965-989a59b37feb 17:19:45,437 - [flink-akka.actor.default-dispatcher-22] - [INFO ] - [org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.start(line:287)] - Starting the SlotManager. 17:19:45,480 - [main-EventThread] - [INFO ] - [org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.startInternal(line:97)] - Start SessionDispatcherLeaderProcess.XXX 17:19:45,489 - [cluster-io-thread-1] - [INFO ] - [org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(line:232)] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_1 . 17:19:45,495 - [flink-akka.actor.default-dispatcher-23] - [INFO ] - [org.apache.flink.runtime.resourcemanager.ResourceManager.registerTaskExecutorInternal(line:891)] - Registering TaskManager with ResourceID XXXXXX (akka.tcp://flink@X:XX/user/rpc/taskmanager_0) at ResourceManager Flink集群中有两个节点(A和B)接收到了Job提交请求,两个节点的日志中均有如下信息 [flink-akka.actor.default-dispatcher-33] - [INFO ] - [org.apache.flink.runtime.jobmaster.JobMaster.connectToResourceManager(line:1107)] - Connecting to ResourceManager akka.tcp://flink@X.X.X.X:46746/user/rpc/resourcemanager_0(ad84d46e902e0cf6da92179447af4e00) 集群中有4个JobManager节点日志出现了 Start SessionDispatcherLeaderProcess日志,但几乎都跟随了Stopping SessionDispatcherLeaderProcess日志,但(A和B)点没有Stopping SessionDispatcherLeaderProcess信息 [main-EventThread] - [INFO ] - [org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.startInternal(line:97)] - Start SessionDispatcherLeaderProcess. [Curator-ConnectionStateManager-0] - [INFO ] - [org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.closeInternal(line:134)] - Stopping SessionDispatcherLeaderProcess.