回复:Flink 1.12.0 隔几个小时Checkpoint就会失败
唉,这个问题着实让人头大,我现在还没找到原因。你这边确定了跟我说一声哈😊 | | 田向阳 | | 邮箱:lucas_...@163.com | 签名由 网易邮箱大师 定制 在2021年04月22日 20:56,张锴 写道: 你好,我也遇到了这个问题,你的checkpoint是怎么配置的,可以参考一下吗 Haihang Jing 于2021年3月23日周二 下午8:04写道: > 你好,问题定位到了吗? > 我也遇到了相同的问题,感觉和checkpoint interval有关 > 我有两个相同的作业(checkpoint interval > 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint > 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint > tolerable failure threshold. > 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。 > > 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit > feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。 > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/
回复:flink 1.12.1 savepoint 重启问题
我也遇到了,不知道啥原因,这个也是偶尔发生,是真的难定位问题 | | 田向阳 | | 邮箱:lucas_...@163.com | 签名由 网易邮箱大师 定制 在2021年04月28日 17:00,chaos 写道: 你好,因为业务逻辑变化,需要对线上在跑的flink任务重启,采用savapoint方式。 操作如下: # 查看yarn-application-id yarn application -list # 查看flink任务id flink list -t yarn-per-job -Dyarn.application.id=application_1617064018715_0097 # 带有保存点的停止任务 flink stop -t yarn-per-job -p hdfs:///user/chaos/flink/0097_savepoint/ -Dyarn.application.id=application_1617064018715_0097 d7c1e0231f0e72cbd709baa9a0ba6415 Savepoint completed. Path: hdfs://nameservice1/user/chaos/flink/0097_savepoint/savepoint-d7c1e0-96f5b07d18be # 从保存点中重启任务 flink run -t yarn-per-job -s hdfs://nameservice1/user/chaos/flink/0097_savepoint/savepoint-d7c1e0-96f5b07d18be -d -c xxx xxx.jar 遇到的问题: Service temporarily unavailable due to an ongoing leader election. Please refresh 还请解惑,提前谢过! -- Sent from: http://apache-flink.147419.n8.nabble.com/
回复:wd: flink1.12.2 CLI连接hive出现异常
重新启动一个yarn session 集群。 | | 田向阳 | | 邮箱:lucas_...@163.com | 签名由 网易邮箱大师 定制 在2021年04月29日 10:05,张锴 写道: 官网有说吗,你在哪里找到的呢 guoyb <861277...@qq.com> 于2021年4月28日周三 上午10:56写道: > 我的也有这种问题,没解决,kerberos认证的hive导致的。 > > > > ---原始邮件--- > 发件人: "张锴" 发送时间: 2021年4月28日(周三) 上午10:41 > 收件人: "user-zh" 主题: Fwd: flink1.12.2 CLI连接hive出现异常 > > > -- Forwarded message - > 发件人: 张锴 Date: 2021年4月27日周二 下午1:59 > Subject: flink1.12.2 CLI连接hive出现异常 > To: > > *使用flink1.12.2 CLI连接hive时,可以查看到catalog,也能看到指定的catalog下的tables,但是执行select > 语句时就出现异常。* > [ERROR] Could not execute SQL statement. Reason: > org.apache.hadoop.ipc.RemoteException: Application with id > 'application_1605840182730_29292' doesn't exist in RM. Please check that > the job submission was suc > at > > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382) > at > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java: > at > > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561) > at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > *使用yarn logs -applicationId application_1605840182730_29292 > 查看日志时,并没有给出具体的错误,以下是打出的日志。日志几乎看不出啥问题。* > INFO client.RMProxy: Connecting to ResourceManager at > hadoop01.xxx.xxx.xxx/xx.xx.x.xx:8050 > Unable to get ApplicationState. Attempting to fetch logs directly from the > filesystem. > Can not find the appOwner. Please specify the correct appOwner > Could not locate application logs for application_1605840182730_29292 > > 这个如何排查呢,有遇到类似的问题的小伙伴吗
回复:flink job 运行一小时会出现报错
用的checkpoint后端存储是啥,flink哪个版本的 | | 田向阳 | | 邮箱:lucas_...@163.com | 签名由 网易邮箱大师 定制 在2021年05月20日 17:02,cecotw 写道: 2021-05-19 19:04:19 org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:590) at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384) at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:286) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.IllegalStateException: zip file closed at java.util.zip.ZipFile.ensureOpen(ZipFile.java:686) at java.util.zip.ZipFile.getEntry(ZipFile.java:315) at java.util.jar.JarFile.getEntry(JarFile.java:240) at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128) at java.util.jar.JarFile.getJarEntry(JarFile.java:223) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1054) at sun.misc.URLClassPath.getResource(URLClassPath.java:249) at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:61) at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:65) at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:865) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doAutoCommitOffsetsAsync(ConsumerCoordinator.java:973) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.maybeAutoCommitOffsetsAsync(ConsumerCoordinator.java:964) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:488) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded
回复:flink job 运行一小时会出现报错
用的checkpoint后端存储是啥,flink哪个版本的 | | 田向阳 | | 邮箱:lucas_...@163.com | 签名由 网易邮箱大师 定制 在2021年05月20日 17:02,cecotw 写道: 2021-05-19 19:04:19 org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116) at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78) at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192) at org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185) at org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179) at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:590) at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384) at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:286) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.IllegalStateException: zip file closed at java.util.zip.ZipFile.ensureOpen(ZipFile.java:686) at java.util.zip.ZipFile.getEntry(ZipFile.java:315) at java.util.jar.JarFile.getEntry(JarFile.java:240) at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128) at java.util.jar.JarFile.getJarEntry(JarFile.java:223) at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1054) at sun.misc.URLClassPath.getResource(URLClassPath.java:249) at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:61) at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:65) at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:865) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doAutoCommitOffsetsAsync(ConsumerCoordinator.java:973) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.maybeAutoCommitOffsetsAsync(ConsumerCoordinator.java:964) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:488) at org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded