Hi 确认一下这些 ha 文件的 last modification time 是一致的还是错开的?
另外,指定 chk- 恢复尝试了没有?可以恢复吗? Best, Guojun On Fri, Mar 10, 2023 at 11:56 AM guanyq <dlgua...@163.com> wrote: > flink ha路径为 /tmp/flink/ha/ > flink chk路径为 /tmp/flink/checkpoint > > > 我现在不确定是这个ha的文件损坏了,还是所有chk都损坏,但是这个需要模拟验证一下。 > > > > > 会尝试从10个chk恢复,日志有打印 > 2023-03-0718:37:43,703INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Recovering checkpoints from ZooKeeper. > 2023-03-0718:37:43,730INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Found 10 checkpoints in ZooKeeper. > 2023-03-0718:37:43,731INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to fetch 10 checkpoints from storage. > 2023-03-0718:37:43,731INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7079. > 2023-03-0718:37:43,837INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7080. > 2023-03-0718:37:43,868INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7081. > 2023-03-0718:37:43,896INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7082. > 2023-03-0718:37:43,906INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7083. > 2023-03-0718:37:43,928INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7084. > 2023-03-0718:37:43,936INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7085. > 2023-03-0718:37:43,947INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7086. > > > > 详细日志为,后面重复部分我给省略了,不同点就是尝试不同的/tmp/flink/ha/application_1678102326043_0007/completedCheckpointxxxxxx启动 > 2023-03-0718:37:43,621INFOorg.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl > - Starting the SlotManager. > 2023-03-0718:37:43,621INFOorg.apache.flink.runtime.jobmaster.JobMaster - > Successfully ran initialization on master in 0 ms. > 2023-03-0718:37:43,660INFOorg.apache.flink.runtime.util.ZooKeeperUtils - > Initialized ZooKeeperCompletedCheckpointStore in > '/checkpoints/3844b96b002601d932e66233dd46899c'. > 2023-03-0718:37:43,680INFOorg.apache.flink.runtime.jobmaster.JobMaster - > Using application-defined state backend: File State Backend (checkpoints: > 'hdfs:/tmp/flink/checkpoint', savepoints: 'null', asynchronous: UNDEFINED, > fileStateThreshold: -1) > 2023-03-0718:37:43,680INFOorg.apache.flink.runtime.jobmaster.JobMaster - > Configuring application-defined state backend with job/cluster config > 2023-03-0718:37:43,703INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Recovering checkpoints from ZooKeeper. > 2023-03-0718:37:43,730INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Found 10 checkpoints in ZooKeeper. > 2023-03-0718:37:43,731INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to fetch 10 checkpoints from storage. > 2023-03-0718:37:43,731INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7079. > 2023-03-0718:37:43,837INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7080. > 2023-03-0718:37:43,868INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7081. > 2023-03-0718:37:43,896INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7082. > 2023-03-0718:37:43,906INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7083. > 2023-03-0718:37:43,928INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7084. > 2023-03-0718:37:43,936INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7085. > 2023-03-0718:37:43,947INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore > - Trying to retrieve checkpoint 7086. > 2023-03-0718:37:43,979WARNorg.apache.hadoop.hdfs.BlockReaderFactory - I/O > error constructing remote block reader. > java.io.IOException: Got error, status message opReadBlock > BP-1003103929-192.168.200.11-1668473836936:blk_1301252639_227512278 > received exception > org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: > The meta file length 0 is less than the expected length 7, for > OP_READ_BLOCK, self=/192.168.200.23:45534, remote=/192.168.200.21:9866, > for file > /tmp/flink/ha/application_1678102326043_0007/completedCheckpoint58755403e33a, > for pool BP-1003103929-192.168.200.11-1668473836936 block > 1301252639_227512278 > at > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:142) > at > org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:456) > at > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:424) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:818) > at > org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:697) > at > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:665) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:874) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:926) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94) > at > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2663) > at > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679) > at > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156) > at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) > at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) > at > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.<init>(InstantiationUtil.java:69) > at > org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:572) > at > org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:555) > at > org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58) > at > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.retrieveCompletedCheckpoint(ZooKeeperCompletedCheckpointStore.java:339) > at > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.recover(ZooKeeperCompletedCheckpointStore.java:175) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1070) > at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:234) > at > org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:216) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120) > at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105) > at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278) > at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40) > at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146) > at > org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:381) > at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > > > > > > > > > > > > > > > > 在 2023-03-10 11:00:58,"Weihua Hu" <huweihua....@gmail.com> 写道: > >Hi > > > >一般来说只是 YARN 集群异常停电不会影响已经完成的历史 Checkpoint(最后一次 Checkpoint 可能会写 hdfs 异常) > > > >有更详细的 JobManager 日志吗?可以先确认下 Flink 在恢复时检索到了多少个 completedCheckpoint > >以及最终尝试从哪一次 cp 恢复的。 > > > >也可以尝试按照 Yanfei 所说指定历史的 cp 作为 savepoint 恢复 > > > > > >Best, > >Weihua > > > > > >On Fri, Mar 10, 2023 at 10:38 AM guanyq <dlgua...@163.com> wrote: > > > >> 没有开启增量chk > >> 文件损坏是看了启动日志,启动日志尝试从10个chk启动,但是都因为以下块损坏启动失败了 > >> 错误日志为: > >> > >> java.io.IOException: Got error, status message opReadBlock > >> BP-1003103929-192.168.200.11-1668473836936:blk_1301252639_227512278 > >> received exception > >> org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: > >> The meta file length 0 is less than the expected length 7, for > >> OP_READ_BLOCK, self=/ip:45534, remote=/ip:9866, > >> for file > >> > /tmp/flink/ha/application_1678102326043_0007/completedCheckpoint58755403e33a, > >> for pool BP-1003103929-192.168.200.11-1668473836936 block > >> 1301252639_227512278 > >> at > >> > org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:142) > >> at > >> > org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:456) > >> at > >> > org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:424) > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> 在 2023-03-10 10:26:11,"Yanfei Lei" <fredia...@gmail.com> 写道: > >> >Hi 可以尝试去flink配置的checkpoint dir下面去找一找历史chk-x文件夹,如果能找到历史的chk-x,可以尝试手工指定 > >> chk重启[1]。 > >> > > >> >> flink任务是10个checkpoint,每个checkpoint间隔5秒,如果突然停电,为什么所有的checkpoint都损坏了。 > >> > >> > >请问作业开启增量checkpoint了吗?在开启了增量的情况下,如果是比较早的一个checkpoint的文件损坏了,会影响后续基于它进行增量的checkpoint。 > >> > > >> >> checkpoint落盘的机制,这个应该和hdfs写入有关系,flink任务checkpoint成功,但是hdfs却没有落盘。 > >> >是观察到checkpoint dir下面没有文件吗? > >> > > >> >[1] > >> > https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/ops/state/savepoints/#resuming-from-savepoints > >> > > >> >guanyq <dlgua...@163.com> 于2023年3月10日周五 08:58写道: > >> >> > >> >> 目前也想着用savepoint处理异常停电的问题 > >> >> 但是我这面还有个疑问: > >> >> flink任务是10个checkpoint,每个checkpoint间隔5秒,如果突然停电,为什么所有的checkpoint都损坏了。 > >> >> 就很奇怪是不是10个checkpoint都没落盘导致的。 > >> >> 想问下: > >> >> checkpoint落盘的机制,这个应该和hdfs写入有关系,flink任务checkpoint成功,但是hdfs却没有落盘。 > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> 在 2023-03-10 08:47:11,"Shammon FY" <zjur...@gmail.com> 写道: > >> >> >Hi > >> >> > > >> >> >我觉得Flink > >> >> > >> > >作业恢复失败时,作业本身很难确定失败是checkpoint的文件块损坏之类的原因。如果你的作业做过savepoint,可以尝试从指定的savepoint恢复作业 > >> >> > > >> >> >Best, > >> >> >Shammon > >> >> > > >> >> >On Thu, Mar 9, 2023 at 10:06 PM guanyq <dlgua...@163.com> wrote: > >> >> > > >> >> >> 前提 > >> >> >> 1.flink配置了高可用 > >> >> >> 2.flink配置checkpoint数为10 > >> >> >> 3.yarn集群配置了任务恢复 > >> >> >> 疑问 > >> >> >> > yarn集群停电重启后,恢复flink任务时,如果最近的checkpoint由于停电导致块损坏,是否会尝试从其他checkpoint启动 > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> > > >> > > >> > > >> >-- > >> >Best, > >> >Yanfei > >> >