Re:flink 使用yarn部署,报错:Maximum Memory: 8192 Requested: 10240MB. Please check the 'yarn.scheduler.maximum-allocation-mb' and the 'yarn.nodemanager.resource.memory-mb' configuration values
超过了 yarn 容器 配置吧 At 2021-03-20 10:57:23, "william" <712677...@qq.com> wrote: >org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't >deploy Yarn session cluster >at >org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:425) >at >org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:606) >at >org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$4(FlinkYarnSessionCli.java:860) >at java.security.AccessController.doPrivileged(Native Method) >at javax.security.auth.Subject.doAs(Subject.java:422) >at >org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) >at >org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) >at >org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:860) >Caused by: >org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The >cluster does not have the requested resources for the TaskManagers >available! >Maximum Memory: 8192 Requested: 10240MB. Please check the >'yarn.scheduler.maximum-allocation-mb' and the >'yarn.nodemanager.resource.memory-mb' configuration values > > > > >-- >Sent from: http://apache-flink.147419.n8.nabble.com/
flink1.12 Standalone模式发送python脚本任务报错: java.lang.ClassNotFoundException: org.apache.flink.connector.jdbc.table.JdbcRowDataInputFormat
flink1.12.2 部署standalone集群模式,任务是pyflink实现链接Mysql数据库完成计算任务: 1. 已在 /user/local/flink-1.12.2/lib目录下,添加相关依赖: mysql-connector-java-8.0.12.jar, flink-connector-jdbc_2.11-1.12.0.jar, flink-table-api-java-1.12.0.jar 2.发送任务命令: bin/flink run -py ../test.py -p 8 3.附报错信息如下;在线等路过部署过的大佬,指点一下~ 谢谢! Traceback (most recent call last): File "../test.py", line 57, in env.execute('Test') File "/usr/local/env/flink1.12_py3_env/flink-1.12.2/opt/python/pyflink.zip/pyflink/table/table_environment.py", line 1276, in execute File "/usr/local/env/flink1.12_py3_env/flink-1.12.2/opt/python/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__ File "/usr/local/env/flink1.12_py3_env/flink-1.12.2/opt/python/pyflink.zip/pyflink/util/exceptions.py", line 147, in deco File "/usr/local/env/flink1.12_py3_env/flink-1.12.2/opt/python/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o5.execute. : org.apache.flink.util.FlinkException: Failed to execute job 'Pyflink1.12_Query_Time_Test'. at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1918) at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:135) at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76) at org.apache.flink.table.planner.delegation.ExecutorBase.execute(ExecutorBase.java:50) at org.apache.flink.table.api.internal.TableEnvironmentImpl.execute(TableEnvironmentImpl.java:1277) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.api.python.shaded.py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at org.apache.flink.api.python.shaded.py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at org.apache.flink.api.python.shaded.py4j.Gateway.invoke(Gateway.java:282) at org.apache.flink.api.python.shaded.py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at org.apache.flink.api.python.shaded.py4j.commands.CallCommand.execute(CallCommand.java:79) at org.apache.flink.api.python.shaded.py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobInitializationException: Could not instantiate JobManager. at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) Caused by: org.apache.flink.runtime.client.JobInitializationException: Could not instantiate JobManager. at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:494) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.runtime.client.JobExecutionException: Cannot initialize task 'Source: TableSourceScan(table=[[default_catalog, default_database, TP_GL_DAY, project=[DAY_ID]]], fields=[DAY_ID])': Loading the input/output formats failed: at org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:239) at org.apache.flink.runtime.scheduler.SchedulerBase.createExecutionGraph(SchedulerBase.java:322) at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:276) at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:249) at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:133) at org.apache.flink.runtime
Re: Flink 1.12.0 隔几个小时Checkpoint就会失败
从日志看 checkpoint 超时了,可以尝试看一下是哪个算子的哪个并发没有做完 checkpoint,可以看看这篇文章[1] 能否帮助你 [1] https://www.infoq.cn/article/g8ylv3i2akmmzgccz8ku Best, Congxian Frost Wong 于2021年3月18日周四 下午12:28写道: > 哦哦,我看到了有个 > > setTolerableCheckpointFailureNumber > > 之前不知道有这个方法,倒是可以试一下,不过我就是不太理解为什么会失败,也没有任何报错 > > 发件人: yidan zhao > 发送时间: 2021年3月18日 3:47 > 收件人: user-zh > 主题: Re: Flink 1.12.0 隔几个小时Checkpoint就会失败 > > 设置下检查点失败不影响任务呀,你这貌似还导致任务重启了? > > Frost Wong 于2021年3月18日周四 上午10:38写道: > > > Hi 大家好 > > > > 我用的Flink on yarn模式运行的一个任务,每隔几个小时就会出现一次错误 > > > > 2021-03-18 08:52:37,019 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > Completed > > checkpoint 661818 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (562357 bytes > in > > 4699 ms). > > 2021-03-18 08:52:37,637 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > > Triggering checkpoint 661819 (type=CHECKPOINT) @ 1616028757520 for job > > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > > 2021-03-18 08:52:42,956 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > Completed > > checkpoint 661819 for job 4fa72fc414f53e5ee062f9fbd5a2f4d5 (2233389 bytes > > in 4939 ms). > > 2021-03-18 08:52:43,528 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > > Triggering checkpoint 661820 (type=CHECKPOINT) @ 1616028763457 for job > > 4fa72fc414f53e5ee062f9fbd5a2f4d5. > > 2021-03-18 09:12:43,528 INFO > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - > > Checkpoint 661820 of job 4fa72fc414f53e5ee062f9fbd5a2f4d5 expired before > > completing. > > 2021-03-18 09:12:43,615 INFO > > org.apache.flink.runtime.jobmaster.JobMaster [] - Trying > to > > recover from a global failure. > > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable > > failure threshold. > > at > > > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > > > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) > > ~[flink-dist_2.12-1.12.0.jar:1.12.0] > > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > ~[?:1.8.0_231] > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > ~[?:1.8.0_231] > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > ~[?:1.8.0_231] > > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_231] > > 2021-03-18 09:12:43,618 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > > csmonitor_comment_strategy (4fa72fc414f53e5ee062f9fbd5a2f4d5) switched > from > > state RUNNING to RESTARTING. > > 2021-03-18 09:12:43,619 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat > Map > > (43/256) (18dec1f23b95f741f5266594621971d5) switched from RUNNING to > > CANCELING. > > 2021-03-18 09:12:43,622 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat > Map > > (44/256) (3f2ec60b2f3042ceea6e1d660c78d3d7) switched from RUNNING to > > CANCELING. > > 2021-03-18 09:12:43,622 INFO > > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Flat > Map > > (45/256) (66d411c2266ab025b69196dfec30d888) switched from RUNNING to > > CANCELING. > > 然后就自己恢复了。用的是Unaligned > > > Checkpoint,rocksdb存储后端,在这个错误前后也没有什么其他报错信息。从Checkpoint的metrics看,总是剩最后一个无法完成,调整过parallelism也无法解决问题。 > > > > 谢谢大家! > > >
flink1.11.2集群出现了3种连接拒绝,导致任务失败
你好: 最近建立一个3台机子的flink集群,版本是 zk-3.6.2 + hadoop-3.3.0 + flink-1.11.2。3台机制是在同一个物理机上建立的虚拟机,应该来说不会出现网络波动导致的网络拒绝,但是为什么一直会出现网络拒绝 项目在运行一段时间以后,短则几个小时,长则3到5天,任务就会挂掉,一共出现了一下3种异常,全是网络连接方法的,请帮忙看看,是不是flink网络配置方面有问题。 1. 集群之间通信连接拒绝: 2021-03-03 08:50:42,851 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(ProcessingTimeSessionWindows(9), ProcessingTimeTrigger, FlightTrackAggregate, FlightTrackSectorResult) -> Sink: Unnamed (4/4) (3097c00c09b475b35c23782a3b4a8eaa) switched from RUNNING to FAILED on org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@4df09503. org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection reset by peer (connection to '10.100.1.222/10.100.1.222:43156') at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:173) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:276) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:268) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1388) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:276) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:918) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:730) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:820) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:424) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:326) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_152] Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2. 连接到ZK的请求失败, 2021-03-02 20:27:13,487 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Unable to read additional data from server sessionid 0x30018710a580007, likely server has closed socket, closing socket connection and attempting reconnect 2021-03-02 20:27:13,588 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager [] - State change: SUSPENDED 2021-03-02 20:27:13,590 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService [] - Connection to ZooKeeper suspended. The contender LeaderContender: DefaultDispatcherRunner no longer participates in the leader election. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService [] - Connection to ZooKeeper suspended. The contender LeaderContender: StandaloneResourceManager no longer participates in the leader election. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService [] - Connection to ZooKeeper suspended. The contender http://flink-02:8081 no longer participates in the leader election. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2021-03-02 20:27:13,830 WARN org.apache.flink.shade
flink1.11.2集群出现了3种连接拒绝,导致任务失败
你好: 最近建立一个3台机子的flink集群,版本是 zk-3.6.2 + hadoop-3.3.0 + flink-1.11.2。3台机制是在同一个物理机上建立的虚拟机,应该来说不会出现网络波动导致的网络拒绝,但是为什么一直会出现网络拒绝 项目在运行一段时间以后,短则几个小时,长则3到5天,任务就会挂掉,一共出现了一下3种异常,全是网络连接方法的,请帮忙看看,是不是flink网络配置方面有问题。 1. 集群之间通信连接拒绝: 2021-03-03 08:50:42,851 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(ProcessingTimeSessionWindows(9), ProcessingTimeTrigger, FlightTrackAggregate, FlightTrackSectorResult) -> Sink: Unnamed (4/4) (3097c00c09b475b35c23782a3b4a8eaa) switched from RUNNING to FAILED on org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@4df09503. org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection reset by peer (connection to '10.100.1.222/10.100.1.222:43156') at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:173) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:276) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:268) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1388) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:276) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:918) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.handleReadException(AbstractEpollStreamChannel.java:730) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:820) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:424) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:326) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[flink-dist_2.11-1.11.2.jar:1.11.2] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_152] Caused by: org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2. 连接到ZK的请求失败, 2021-03-02 20:27:13,487 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Unable to read additional data from server sessionid 0x30018710a580007, likely server has closed socket, closing socket connection and attempting reconnect 2021-03-02 20:27:13,588 INFO org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager [] - State change: SUSPENDED 2021-03-02 20:27:13,590 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService [] - Connection to ZooKeeper suspended. The contender LeaderContender: DefaultDispatcherRunner no longer participates in the leader election. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService [] - Connection to ZooKeeper suspended. The contender LeaderContender: StandaloneResourceManager no longer participates in the leader election. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService [] - Connection to ZooKeeper suspended. The contender http://flink-02:8081 no longer participates in the leader election. 2021-03-02 20:27:13,591 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2021-03-02 20:27:13,830 WARN org.apache.flink.shade
Flink job manager HA 是否可以像 Hadoop Name Node 一样手动重启?
Flink job manager HA 是否可以像 Hadoop Name Node 一样手动重启,同时保证集群正常运行? 我发现 job manager 占用内存似乎总是在缓慢不断增长,Hadoop Name Node 也有这个问题,我通过隔一段时间轮动重启Hadoop Name Node 解决这个问题,在HA模式下Flink job manager 是否可以轮动重启? -- Sent from: http://apache-flink.147419.n8.nabble.com/
Re: flink 1.11.2 使用rocksdb时出现org.apache.flink.util.SerializedThrowable错误
分享下原因呗。 hdxg1101300...@163.com 于2021年3月21日周日 上午12:36写道: > 知道原因了 > > > > hdxg1101300...@163.com > > 发件人: hdxg1101300...@163.com > 发送时间: 2021-03-20 22:07 > 收件人: user-zh > 主题: flink 1.11.2 使用rocksdb时出现org.apache.flink.util.SerializedThrowable错误 > 你好: > 最近升级flink版本从flink 1.10.2 升级到flink.1.11.2;主要是考虑日志太大查看不方便的原因; > 代码没有变动只是从1.10.2.编译为1.11.2 ,集群客户端版本升级到1.11.2;任务提交到yarn 使用per job方式; > 之前时一个taskmanager一个slot,现在使用一个taskmanager 2个slot;程序运行一段时间(1个小时左右)后就会出现 > Caused by: org.apache.flink.util.SerializedThrowable > > org.apache.flink.runtime.checkpoint.CheckpointException: Could not > complete snapshot 53 for operator Sink: 发送短信 (5/8). Failure reason: > Checkpoint was declined. > at > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:215) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:921) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:911) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:879) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:113) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.io.CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:198) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:93) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:158) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191) > [flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181) > [flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566) > [flink-dist_2.11-1.11.2.jar:1.11.2] > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536) > [flink-dist_2.11-1.11.2.jar:1.11.2] > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) > [flink-dist_2.11-1.11.2.jar:1.11.2] > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) > [flink-dist_2.11-1.11.2.jar:1.11.2] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] > Caused by: org.apache.flink.util.SerializedThrowable > at com.com.functions.Transaction.firstPhase(Transaction.java:193) > ~[dc_cbssbroadband-1.0.4.3.1-jar-with-dependencies.jar:?] > at com.com.functions.TransactionData.flush(TransactionData.java:37) > ~[dc_cbssbroadband-1.0.4.3.1-jar-with-dependencies.jar:?] > at com.com.utils.TwoPhaseHttpSink.preCommit(TwoPhaseHttpSink.java:105) > ~[dc_cbssbroadband-1.0.4.3.1-jar-with-dependencies.jar:?] > at com.com.utils.TwoPhaseHttpSink.preCommit(TwoPhaseHttpSink.java:39) > ~[dc_cbssbroadband-1.0.4.3.1-jar-with-dependencies.jar:?] > at > org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:321) > ~[flink-dist_2.11-1.11.2.jar:1.11.2] >
Re: flink 使用yarn部署,报错:Maximum Memory: 8192 Requested: 10240MB. Please check the 'yarn.scheduler.maximum-allocation-mb' and the 'yarn.nodemanager.resource.memory-mb' configuration values
报错信息里已经说明了:你的 Yarn 集群配置允许的最大 container 是 2g,而你的 flink 配置的 TM 大小是 10g。 Thank you~ Xintong Song On Sat, Mar 20, 2021 at 7:52 PM william <712677...@qq.com> wrote: > org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't > deploy Yarn session cluster > at > > org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:425) > at > > org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:606) > at > > org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$4(FlinkYarnSessionCli.java:860) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) > at > > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at > > org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:860) > Caused by: > org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The > cluster does not have the requested resources for the TaskManagers > available! > Maximum Memory: 8192 Requested: 10240MB. Please check the > 'yarn.scheduler.maximum-allocation-mb' and the > 'yarn.nodemanager.resource.memory-mb' configuration values > > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ >
flink cdc 在做全量同步时Lock wait timeout
2021-03-22 09:33:19.554 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink Streaming Job (e0f59fb577ea3d275451163cf5cc479b) switched from state RUNNING to FAILING. org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, backoffTimeMS=1) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:392) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_232] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_232] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.0.jar:1.11.0] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.0.jar:1.11.0] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.0.jar:1.11.0] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.0.jar:1.11.0] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.0.jar:1.11.0] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.0.jar:1.11.0] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.0.jar:1.11.0] Caused by: org.apache.kafka.connect.errors.ConnectException: Lock wait timeout exceeded; try restarting transaction Error code: 1205; SQLSTATE: 40001. at io.debezium.connector.mysql.AbstractReader.wrap(AbstractReader.java:230) ~[1616376681494-flink-connector-mysql-cdc.jar:?] at io.debezium.connector.mysql.AbstractReader.failed(AbstractReader.java:207) ~[1616376681494-flink-connector-mysql-cdc.jar:?] at io.debezium.connector.mysql.SnapshotReader.execute(SnapshotReader.java:831) ~[1616376681494-flink-connector-mysql-cdc.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_232] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo