退订


---- 回复的原邮件 ----
| 发件人 | love_h1...@126.com |
| 日期 | 2024年07月11日 16:10 |
| 收件人 | user-zh@flink.apache.org |
| 抄送至 | |
| 主题 | Flink在HA模式,重启ZK集群,客户端任务提交异常 |
问题现象:
Flink 1.11.6版本,Standalone HA模式, 滚动重启了ZK集群;在Flink集群的一个节点上使用flink run 命令提交多个任务;
部分任务提交失败,异常信息如下:
[Flink-DispatcherRestEndpoint-thread-2] - [WARN ] - 
[org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.createRpcInvocationMessage(line:290)]
 - Could not create remote rpc invocation message. Failing rpc invocation 
because...
java.io.IOException: The rpc invocation size 12532388 exceeds the maximum akka 
framesize.


日志信息:
集群中A点的JobManager日志有获得主角色的日志信息
17:19:45,433 - [flink-akka.actor.default-dispatcher-22] - [INFO ] - 
[org.apache.flink.runtime.resourcemanager.ResourceManager.tryAcceptLeadership(line:1118)]
 - ResourceManager 
akka.tcp://flink@10.10.160.57:46746/user/rpc/resourcemanager_0 was granted 
leadership with fencing token ad84d46e902e0cf6da92179447af4e00
17:19:45,434 - [main-EventThread] - [INFO ] - 
[org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.grantLeadership(line:931)]
 - http://XXX:XXX was granted leadership with 
leaderSessionID=f60df688-372d-416b-a965-989a59b37feb
17:19:45,437 - [flink-akka.actor.default-dispatcher-22] - [INFO ] - 
[org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl.start(line:287)]
 - Starting the SlotManager.
17:19:45,480 - [main-EventThread] - [INFO ] - 
[org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.startInternal(line:97)]
 - Start SessionDispatcherLeaderProcess.XXX
17:19:45,489 - [cluster-io-thread-1] - [INFO ] - 
[org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(line:232)] - 
Starting RPC endpoint for 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher at 
akka://flink/user/rpc/dispatcher_1 .
17:19:45,495 - [flink-akka.actor.default-dispatcher-23] - [INFO ] - 
[org.apache.flink.runtime.resourcemanager.ResourceManager.registerTaskExecutorInternal(line:891)]
 - Registering TaskManager with ResourceID XXXXXX 
(akka.tcp://flink@X:XX/user/rpc/taskmanager_0) at ResourceManager

Flink集群中有两个节点(A和B)接收到了Job提交请求,两个节点的日志中均有如下信息
[flink-akka.actor.default-dispatcher-33] - [INFO ] - 
[org.apache.flink.runtime.jobmaster.JobMaster.connectToResourceManager(line:1107)]
 - Connecting to ResourceManager 
akka.tcp://flink@X.X.X.X:46746/user/rpc/resourcemanager_0(ad84d46e902e0cf6da92179447af4e00)
集群中有4个JobManager节点日志出现了 Start SessionDispatcherLeaderProcess日志,但几乎都跟随了Stopping 
SessionDispatcherLeaderProcess日志,但(A和B)点没有Stopping 
SessionDispatcherLeaderProcess信息
[main-EventThread] - [INFO ] - 
[org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.startInternal(line:97)]
 - Start SessionDispatcherLeaderProcess.
[Curator-ConnectionStateManager-0] - [INFO ] - 
[org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.closeInternal(line:134)]
 - Stopping SessionDispatcherLeaderProcess.




回复