Re: 请教集群稳定性问题
解决了我的问题,非常感谢 liu_mingzhang 于2019年5月13日周一 上午9:48写道: > > > 你好,这个问题需要把javax.ws.rs-api-2.0.jar 对应版本的包放到$FLINK_HOME/lib下 > 在2019年5月12日 11:05,naisili Yuan 写道: > 好的,谢谢回复。 > 想问下,生产环境适合用standalone > cluster模式部署嘛,我感觉集群还是不稳定,10个job跑不了24小时就出问题,基本都是心跳丢失或者slot 被移除之类的问题。 > 但是我部署flink on yarn又出了问题,自己解决半天也出了问题,希望能得到帮助,我运行bin/yarn-session.sh -jm > 1024m -tm 4096m -s 8报错: > 2019-05-12 11:02:39,056 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.rpc.address, 192.168.199.244 > 2019-05-12 11:02:39,057 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.rpc.port, 6123 > 2019-05-12 11:02:39,057 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: jobmanager.heap.size, 1024m > 2019-05-12 11:02:39,057 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: taskmanager.heap.size, 8gb > 2019-05-12 11:02:39,057 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: taskmanager.numberOfTaskSlots, 8 > 2019-05-12 11:02:39,057 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: parallelism.default, 1 > 2019-05-12 11:02:39,058 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: taskmanager.data.port, 25630 > 2019-05-12 11:02:39,058 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: tasknamager.rpc.port, 20603-20606 > 2019-05-12 11:02:39,058 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: blob.server.port, 20666 > 2019-05-12 11:02:39,058 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: resourcemanager.rpc.port, 20667 > 2019-05-12 11:02:39,059 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: yarn.appmaster.rpc.address, 192.168.199.100 > 2019-05-12 11:02:39,059 INFO > org.apache.flink.configuration.GlobalConfiguration- Loading > configuration property: yarn.appmaster.rpc.port, 8032 > 2019-05-12 11:02:39,338 WARN org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable > 2019-05-12 11:02:39,396 INFO > org.apache.flink.runtime.security.modules.HadoopModule- Hadoop user > set to root (auth:SIMPLE) > 2019-05-12 11:02:39,492 ERROR > org.apache.flink.yarn.cli.FlinkYarnSessionCli - Error while > running the Flink Yarn session. > java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at
Re: 请教集群稳定性问题
好的,谢谢回复。 想问下,生产环境适合用standalone cluster模式部署嘛,我感觉集群还是不稳定,10个job跑不了24小时就出问题,基本都是心跳丢失或者slot 被移除之类的问题。 但是我部署flink on yarn又出了问题,自己解决半天也出了问题,希望能得到帮助,我运行bin/yarn-session.sh -jm 1024m -tm 4096m -s 8报错: 2019-05-12 11:02:39,056 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: jobmanager.rpc.address, 192.168.199.244 2019-05-12 11:02:39,057 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: jobmanager.rpc.port, 6123 2019-05-12 11:02:39,057 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: jobmanager.heap.size, 1024m 2019-05-12 11:02:39,057 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: taskmanager.heap.size, 8gb 2019-05-12 11:02:39,057 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: taskmanager.numberOfTaskSlots, 8 2019-05-12 11:02:39,057 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: parallelism.default, 1 2019-05-12 11:02:39,058 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: taskmanager.data.port, 25630 2019-05-12 11:02:39,058 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: tasknamager.rpc.port, 20603-20606 2019-05-12 11:02:39,058 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: blob.server.port, 20666 2019-05-12 11:02:39,058 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: resourcemanager.rpc.port, 20667 2019-05-12 11:02:39,059 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: yarn.appmaster.rpc.address, 192.168.199.100 2019-05-12 11:02:39,059 INFO org.apache.flink.configuration.GlobalConfiguration- Loading configuration property: yarn.appmaster.rpc.port, 8032 2019-05-12 11:02:39,338 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2019-05-12 11:02:39,396 INFO org.apache.flink.runtime.security.modules.HadoopModule- Hadoop user set to root (auth:SIMPLE) 2019-05-12 11:02:39,492 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli - Error while running the Flink Yarn session. java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.yarn.util.timeline.TimelineUtils.(TimelineUtils.java:50) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:179) at
Re: 请教集群稳定性问题
心跳超时的话,先看一下AM和TM的内存使用情况,看下GC Log有没有长时间的GC。 -- From:naisili Yuan Send Time:2019 May 10 (Fri.) 09:34 To:user-zh Subject:请教集群稳定性问题 我的集群配置的是内存checkpoint,自动重启,但是经常跑了一晚上就自动重启,重启的原因日志是这样的: org.apache.flink.util.FlinkException: The assigned slot f6b9b4065386152879a01dfc7d396f42_1 was removed. at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893) at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863) at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058) at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385) at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:825) at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1139) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 或者是: java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 7675b5849deb7da116ad946eed0f74b6 timed out. at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1631) at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 想请教下,有没有给出的flink参考配置能解决这方面的问题,我的是standalone模式部署的。先谢谢了!