One more thing: You configured: high-availability.cluster-id: /cluster-test it should be: high-availability.cluster-id: cluster-test I don't think this is major issue, in case it helps, you can check. Can you check one more thing: Is check pointing happening or not? Were you able to see the chk-* folder under checkpoint directory?
Regards Bhaskar On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <xcz200...@qq.com> wrote: > hi, > Is there any deference (for me using nas is more convenient to test > currently)? > from the docs seems hdfs ,s3, nfs etc all will be fine. > > > > ------------------ 原始邮件 ------------------ > *发件人:* "vino yang"<yanghua1...@gmail.com>; > *发送时间:* 2019年11月28日(星期四) 晚上7:17 > *收件人:* "曾祥才"<xcz200...@qq.com>; > *抄送:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;"User-Flink"< > user@flink.apache.org>; > *主题:* Re: JobGraphs not cleaned up in HA mode > > Hi, > > Why do you not use HDFS directly? > > Best, > Vino > > 曾祥才 <xcz200...@qq.com> 于2019年11月28日周四 下午6:48写道: > >> >> anyone have the same problem? pls help, thks >> >> >> >> ------------------ 原始邮件 ------------------ >> *发件人:* "曾祥才"<xcz200...@qq.com>; >> *发送时间:* 2019年11月28日(星期四) 下午2:46 >> *收件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >> *抄送:* "User-Flink"<user@flink.apache.org>; >> *主题:* 回复: JobGraphs not cleaned up in HA mode >> >> the config (/flink is the NASdirectory ): >> >> jobmanager.rpc.address: flink-jobmanager >> taskmanager.numberOfTaskSlots: 16 >> web.upload.dir: /flink/webUpload >> blob.server.port: 6124 >> jobmanager.rpc.port: 6123 >> taskmanager.rpc.port: 6122 >> jobmanager.heap.size: 1024m >> taskmanager.heap.size: 1024m >> high-availability: zookeeper >> high-availability.cluster-id: /cluster-test >> high-availability.storageDir: /flink/ha >> high-availability.zookeeper.quorum: ****:2181 >> high-availability.jobmanager.port: 6123 >> high-availability.zookeeper.path.root: /flink/risk-insight >> high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints >> state.backend: filesystem >> state.checkpoints.dir: file:///flink/checkpoints >> state.savepoints.dir: file:///flink/savepoints >> state.checkpoints.num-retained: 2 >> jobmanager.execution.failover-strategy: region >> jobmanager.archive.fs.dir: file:///flink/archive/history >> >> >> >> ------------------ 原始邮件 ------------------ >> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >> *发送时间:* 2019年11月28日(星期四) 下午3:12 >> *收件人:* "曾祥才"<xcz200...@qq.com>; >> *抄送:* "User-Flink"<user@flink.apache.org>; >> *主题:* Re: JobGraphs not cleaned up in HA mode >> >> Can you share the flink configuration once? >> >> Regards >> Bhaskar >> >> On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200...@qq.com> wrote: >> >>> if i clean the zookeeper data , it runs fine . but next time when the >>> jobmanager failed and redeploy the error occurs again >>> >>> >>> >>> >>> ------------------ 原始邮件 ------------------ >>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >>> *发送时间:* 2019年11月28日(星期四) 下午3:05 >>> *收件人:* "曾祥才"<xcz200...@qq.com>; >>> *主题:* Re: JobGraphs not cleaned up in HA mode >>> >>> Again it could not find the state store file: "Caused by: >>> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 " >>> Check why its unable to find. >>> Better thing is: Clean up zookeeper state and check your configurations, >>> correct them and restart cluster. >>> Otherwise it always picks up corrupted state from zookeeper and it will >>> never restart >>> >>> Regards >>> Bhaskar >>> >>> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200...@qq.com> wrote: >>> >>>> i've made a misstake( the log before is another cluster) . the full >>>> exception log is : >>>> >>>> >>>> INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - >>>> Recovering all persisted jobs. >>>> 2019-11-28 02:33:12,726 INFO >>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - >>>> Starting the SlotManager. >>>> 2019-11-28 02:33:12,743 INFO >>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - >>>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from >>>> ZooKeeper. >>>> 2019-11-28 02:33:12,744 ERROR >>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >>>> occurred in the cluster entrypoint. >>>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take >>>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) >>>> >>>> at >>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >>>> >>>> at >>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >>>> >>>> at >>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) >>>> >>>> at >>>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) >>>> at >>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) >>>> >>>> at >>>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) >>>> >>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) >>>> at >>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) >>>> >>>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at >>>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> >>>> at >>>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> at >>>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> >>>> Caused by: java.lang.RuntimeException: >>>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph >>>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates >>>> that the retrieved state handle is broken. Try cleaning the state handle >>>> store. >>>> at >>>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >>>> at >>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) >>>> >>>> at >>>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) >>>> at >>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) >>>> >>>> ... 7 more >>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >>>> submitted JobGraph from state handle under >>>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state >>>> handle is broken. Try cleaning the state handle store. >>>> at >>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) >>>> >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) >>>> >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) >>>> >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) >>>> >>>> at >>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) >>>> >>>> at >>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) >>>> >>>> ... 9 more >>>> Caused by: java.io.FileNotFoundException: >>>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) >>>> at java.io.FileInputStream.open0(Native Method) >>>> at java.io.FileInputStream.open(FileInputStream.java:195) >>>> >>>> >>>> >>>> >>>> ------------------ 原始邮件 ------------------ >>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >>>> *发送时间:* 2019年11月28日(星期四) 下午2:46 >>>> *收件人:* "曾祥才"<xcz200...@qq.com>; >>>> *抄送:* "User-Flink"<user@flink.apache.org>; >>>> *主题:* Re: JobGraphs not cleaned up in HA mode >>>> >>>> Is it filesystem or hadoop? If its NAS then why the exception "Caused >>>> by: org.apache.hadoop.hdfs.BlockMissingException: " >>>> It seems you configured hadoop state store and giving NAS mount. >>>> >>>> Regards >>>> Bhaskar >>>> >>>> >>>> >>>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200...@qq.com> wrote: >>>> >>>>> /flink/checkpoints is a external persistent store (a nas directory >>>>> mounts to the job manager) >>>>> >>>>> >>>>> >>>>> >>>>> ------------------ 原始邮件 ------------------ >>>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>; >>>>> *发送时间:* 2019年11月28日(星期四) 下午2:29 >>>>> *收件人:* "曾祥才"<xcz200...@qq.com>; >>>>> *抄送:* "user"<user@flink.apache.org>; >>>>> *主题:* Re: JobGraphs not cleaned up in HA mode >>>>> >>>>> Following are the mandatory condition to run in HA: >>>>> >>>>> a) You should have persistent common external store for jobmanager and >>>>> task managers to while writing the state >>>>> b) You should have persistent external store for zookeeper to store >>>>> the Jobgraph. >>>>> >>>>> Zookeeper is referring path: >>>>> /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but >>>>> jobmanager unable to find it. >>>>> It seems /flink/checkpoints is not the external persistent store >>>>> >>>>> >>>>> Regards >>>>> Bhaskar >>>>> >>>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com> wrote: >>>>> >>>>>> hi ,I've the same problem with flink 1.9.1 , any solution to fix it >>>>>> when the k8s redoploy jobmanager , the error looks like (seems zk not >>>>>> remove submitted job info, but jobmanager remove the file): >>>>>> >>>>>> >>>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >>>>>> submitted JobGraph from state handle under >>>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved >>>>>> state >>>>>> handle is broken. Try cleaning the state handle store. >>>>>> at >>>>>> >>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) >>>>>> at >>>>>> >>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) >>>>>> at >>>>>> >>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) >>>>>> at >>>>>> >>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662) >>>>>> at >>>>>> >>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821) >>>>>> at >>>>>> >>>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72) >>>>>> ... 9 more >>>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not >>>>>> obtain >>>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494 >>>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed >>>>>> at >>>>>> >>>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052) >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Sent from: >>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >>>>>> >>>>>