回复:flink job 运行一小时会出现报错

2021-05-20 Thread
用的checkpoint后端存储是啥,flink哪个版本的


| |
田向阳
|
|
邮箱:lucas_...@163.com
|

签名由 网易邮箱大师 定制

在2021年05月20日 17:02,cecotw 写道:
2021-05-19 19:04:19
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
 at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
 at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
 at
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
 at
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
 at
org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
 at
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:590)
 at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384)
 at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:286)
 at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201)
 at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
 at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
 at akka.actor.Actor.aroundReceive(Actor.scala:517)
 at akka.actor.Actor.aroundReceive$(Actor.scala:515)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalStateException: zip file closed
 at java.util.zip.ZipFile.ensureOpen(ZipFile.java:686)
 at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
 at java.util.jar.JarFile.getEntry(JarFile.java:240)
 at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128)
 at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
 at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1054)
 at sun.misc.URLClassPath.getResource(URLClassPath.java:249)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at
org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:61)
 at
org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:65)
 at
org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:865)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doAutoCommitOffsetsAsync(ConsumerCoordinator.java:973)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.maybeAutoCommitOffsetsAsync(ConsumerCoordinator.java:964)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:488)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded

回复:flink job 运行一小时会出现报错

2021-05-20 Thread
用的checkpoint后端存储是啥,flink哪个版本的


| |
田向阳
|
|
邮箱:lucas_...@163.com
|

签名由 网易邮箱大师 定制

在2021年05月20日 17:02,cecotw 写道:
2021-05-19 19:04:19
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
 at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
 at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
 at
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
 at
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
 at
org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
 at
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:590)
 at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:384)
 at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:286)
 at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201)
 at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
 at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
 at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
 at akka.actor.Actor.aroundReceive(Actor.scala:517)
 at akka.actor.Actor.aroundReceive$(Actor.scala:515)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalStateException: zip file closed
 at java.util.zip.ZipFile.ensureOpen(ZipFile.java:686)
 at java.util.zip.ZipFile.getEntry(ZipFile.java:315)
 at java.util.jar.JarFile.getEntry(JarFile.java:240)
 at sun.net.www.protocol.jar.URLJarFile.getEntry(URLJarFile.java:128)
 at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
 at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1054)
 at sun.misc.URLClassPath.getResource(URLClassPath.java:249)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at
org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:61)
 at
org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:65)
 at
org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsAsync(ConsumerCoordinator.java:865)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.doAutoCommitOffsetsAsync(ConsumerCoordinator.java:973)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.maybeAutoCommitOffsetsAsync(ConsumerCoordinator.java:964)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:488)
 at
org.apache.flink.kafka.shaded.org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded

回复:wd: flink1.12.2 CLI连接hive出现异常

2021-04-28 Thread
重新启动一个yarn session 集群。


| |
田向阳
|
|
邮箱:lucas_...@163.com
|

签名由 网易邮箱大师 定制

在2021年04月29日 10:05,张锴 写道:
官网有说吗,你在哪里找到的呢

guoyb <861277...@qq.com> 于2021年4月28日周三 上午10:56写道:

> 我的也有这种问题,没解决,kerberos认证的hive导致的。
>
>
>
> ---原始邮件---
> 发件人: "张锴" 发送时间: 2021年4月28日(周三) 上午10:41
> 收件人: "user-zh" 主题: Fwd: flink1.12.2 CLI连接hive出现异常
>
>
> -- Forwarded message -
> 发件人: 张锴  Date: 2021年4月27日周二 下午1:59
> Subject: flink1.12.2 CLI连接hive出现异常
> To: 
>
> *使用flink1.12.2 CLI连接hive时,可以查看到catalog,也能看到指定的catalog下的tables,但是执行select
> 语句时就出现异常。*
> [ERROR] Could not execute SQL statement. Reason:
> org.apache.hadoop.ipc.RemoteException: Application with id
> 'application_1605840182730_29292' doesn't exist in RM. Please check that
> the job submission was suc
> at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:382)
> at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:
> at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:561)
> at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>
> *使用yarn logs -applicationId application_1605840182730_29292
> 查看日志时,并没有给出具体的错误,以下是打出的日志。日志几乎看不出啥问题。*
> INFO client.RMProxy: Connecting to ResourceManager at
> hadoop01.xxx.xxx.xxx/xx.xx.x.xx:8050
> Unable to get ApplicationState. Attempting to fetch logs directly from the
> filesystem.
> Can not find the appOwner. Please specify the correct appOwner
> Could not locate application logs for application_1605840182730_29292
>
> 这个如何排查呢,有遇到类似的问题的小伙伴吗


回复:flink 1.12.1 savepoint 重启问题

2021-04-28 Thread
我也遇到了,不知道啥原因,这个也是偶尔发生,是真的难定位问题


| |
田向阳
|
|
邮箱:lucas_...@163.com
|

签名由 网易邮箱大师 定制

在2021年04月28日 17:00,chaos 写道:
你好,因为业务逻辑变化,需要对线上在跑的flink任务重启,采用savapoint方式。

操作如下:
# 查看yarn-application-id
yarn application -list
# 查看flink任务id
flink list -t yarn-per-job
-Dyarn.application.id=application_1617064018715_0097
# 带有保存点的停止任务
flink stop -t yarn-per-job -p hdfs:///user/chaos/flink/0097_savepoint/
-Dyarn.application.id=application_1617064018715_0097
d7c1e0231f0e72cbd709baa9a0ba6415
Savepoint completed. Path:
hdfs://nameservice1/user/chaos/flink/0097_savepoint/savepoint-d7c1e0-96f5b07d18be
# 从保存点中重启任务
flink run -t yarn-per-job -s
hdfs://nameservice1/user/chaos/flink/0097_savepoint/savepoint-d7c1e0-96f5b07d18be
-d -c xxx xxx.jar

遇到的问题:
Service temporarily unavailable due to an ongoing leader election. Please
refresh

还请解惑,提前谢过!



--
Sent from: http://apache-flink.147419.n8.nabble.com/


回复:Flink 1.12.0 隔几个小时Checkpoint就会失败

2021-04-22 Thread
唉,这个问题着实让人头大,我现在还没找到原因。你这边确定了跟我说一声哈


| |
田向阳
|
|
邮箱:lucas_...@163.com
|

签名由 网易邮箱大师 定制

在2021年04月22日 20:56,张锴 写道:
你好,我也遇到了这个问题,你的checkpoint是怎么配置的,可以参考一下吗

Haihang Jing  于2021年3月23日周二 下午8:04写道:

> 你好,问题定位到了吗?
> 我也遇到了相同的问题,感觉和checkpoint interval有关
> 我有两个相同的作业(checkpoint interval
> 设置的是3分钟),一个运行在flink1.9,一个运行在flink1.12,1.9的作业稳定运行,1.12的运行5小时就会checkpoint
> 制作失败,抛异常 org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint
> tolerable failure threshold.
> 当我把checkpoint interval调大到10分钟后,1.12的作业也可以稳定运行,所以我怀疑和制作间隔有关。
>
> 看到过一个issuse,了解到flink1.10后对于checkpoint机制进行调整,接收端在barrier对齐时不会缓存单个barrier到达后的数据,意味着发送方必须在barrier对齐后等待credit
> feedback来传输数据,因此发送方会产生一定的冷启动,影响到延迟和网络吞吐量。但是不确定是不是一定和这个相关,以及如何定位影响。
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/