看了下代码,这个问题有可能的原因是: 1. flink是先创建chk目录,然后再打 Triggering checkpoint 的 log 的,所以有概率是目录创建了,但是log没输出trigger 2. 作业失败,和触发下一个cp,这是两个异步线程,所以有可能是先执行了创建25548目录的操作然后作业再失败,然后trigger 25548还没输出就退了。
版本1.14.5之后代码已经把上述1行为改了,先打log再创建目录,就不会有这样奇怪的问题了。 On Thu, Jan 11, 2024 at 3:03 PM 吴先生 <15951914...@163.com> wrote: > TM日志: > 2023-12-31 18:50:11.180 [flink-akka.actor.default-dispatcher-26] INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task > and sending final execution state CANCELED to JobManager for task > ChargeRangeBroadcastFunction -> Timestamps/Watermarks (4/6)#0 > e960208bbd95b1b219bafe4887b48392. > 2023-12-31 18:50:11.232 [Flink Netty Server (288) Thread 0] ERROR > o.a.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered > error while consuming partitions > java.nio.channels.ClosedChannelException: null > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:606) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1352) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:622) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:606) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:472) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:957) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.close(AbstractChannel.java:232) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestQueue.close(PartitionRequestQueue.java:134) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:160) > at org.apache.flink.runtime.io > .network.netty.PartitionRequestServerHandler.channelRead0(PartitionRequestServerHandler.java:47) > at > org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) > at > org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at java.lang.Thread.run(Thread.java:748) > > > JM日志,没有25548的触发记录: > 2023-12-31 18:39:10.664 [jobmanager-future-thread-20] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 25546 for job d12f3c6e836f56fb23d96e31737ff0b3 (411347921 bytes > in 50128 ms). > 2023-12-31 18:40:10.681 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering > checkpoint 25547 (type=CHECKPOINT) @ 1704019210665 for job > d12f3c6e836f56fb23d96e31737ff0b3. > 2023-12-31 18:50:10.681 [Checkpoint Timer] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint > 25547 of job d12f3c6e836f56fb23d96e31737ff0b3 expired before completing. > 2023-12-31 18:50:10.698 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.runtime.jobmaster.JobMaster - Trying to recover from a > global failure. > org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable > failure threshold. > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:90) > at > org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1760) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1733) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1870) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > > > > checkpoing路径下有: > 25546:正常 > 25547:无 > 25548:有,路径下为空 > > > > > 任务人为从25548恢复时失败,抛出异常找不到_metadate文件 > > > | | > 吴先生 > | > | > 15951914...@163.com > | > ---- 回复的原邮件 ---- > | 发件人 | Xuyang<xyzhong...@163.com> | > | 发送日期 | 2024年1月11日 14:55 | > | 收件人 | <user-zh@flink.apache.org> | > | 主题 | Re:回复: flink-checkpoint 问题 | > Hi, 你的图挂了,可以用图床处理一下,或者直接贴log。 > > > > > -- > > Best! > Xuyang > > > > > 在 2024-01-11 13:40:43,"吴先生" <15951914...@163.com> 写道: > > JM中chk失败时间点日志,没有25548的触发记录: > > > 自动recovery失败: > > > TM日志: > > > checkpoint文件路径,25548里面空的: > > > | | > 吴先生 > | > | > 15951914...@163.com > | > ---- 回复的原邮件 ---- > | 发件人 | Zakelly Lan<zakelly....@gmail.com> | > | 发送日期 | 2024年1月10日 18:20 | > | 收件人 | <user-zh@flink.apache.org> | > | 主题 | Re: flink-checkpoint 问题 | > 你好, > 方便的话贴一下jobmanager的log吧,应该有一些线索 > > > On Wed, Jan 10, 2024 at 5:55 PM 吴先生 <15951914...@163.com> wrote: > > Flink版本: 1.12 > checkpoint配置:hdfs > > > 现象:作业由于一些因素第N个checkpoint失败,导致任务重试,任务重试失败,hdfs中不存在第N个chk路径,但是为什么会出现一个第N+1的chk路径,且这个路径下是空的 > > >