Re: JobGraphs not cleaned up in HA mode

Vijay Bhaskar Thu, 28 Nov 2019 04:02:14 -0800

One more thing:
You configured:
high-availability.cluster-id: /cluster-test
it should be:
high-availability.cluster-id: cluster-test
I don't think this is major issue, in case it helps, you can check.
Can you check one more thing:
Is check pointing happening or not?
Were you able to see the chk-* folder under checkpoint directory?


Regards
Bhaskar

On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <xcz200...@qq.com> wrote:

> hi，
> Is there any deference （for me using nas is more convenient to test
> currently）?
> from the docs seems hdfs ,s3, nfs etc all will be fine.
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "vino yang"<yanghua1...@gmail.com>;
> *发送时间:* 2019年11月28日(星期四) 晚上7:17
> *收件人:* "曾祥才"<xcz200...@qq.com>;
> *抄送:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;"User-Flink"<
> user@flink.apache.org>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Hi,
>
> Why do you not use HDFS directly?
>
> Best,
> Vino
>
> 曾祥才 <xcz200...@qq.com> 于2019年11月28日周四 下午6:48写道：
>
>>
>> anyone have the same problem？ pls help, thks
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "曾祥才"<xcz200...@qq.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>> *收件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>> *抄送:* "User-Flink"<user@flink.apache.org>;
>> *主题:* 回复： JobGraphs not cleaned up in HA mode
>>
>> the config  (/flink is the NASdirectory ):
>>
>> jobmanager.rpc.address: flink-jobmanager
>> taskmanager.numberOfTaskSlots: 16
>> web.upload.dir: /flink/webUpload
>> blob.server.port: 6124
>> jobmanager.rpc.port: 6123
>> taskmanager.rpc.port: 6122
>> jobmanager.heap.size: 1024m
>> taskmanager.heap.size: 1024m
>> high-availability: zookeeper
>> high-availability.cluster-id: /cluster-test
>> high-availability.storageDir: /flink/ha
>> high-availability.zookeeper.quorum: ****:2181
>> high-availability.jobmanager.port: 6123
>> high-availability.zookeeper.path.root: /flink/risk-insight
>> high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
>> state.backend: filesystem
>> state.checkpoints.dir: file:///flink/checkpoints
>> state.savepoints.dir: file:///flink/savepoints
>> state.checkpoints.num-retained: 2
>> jobmanager.execution.failover-strategy: region
>> jobmanager.archive.fs.dir: file:///flink/archive/history
>>
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>> *发送时间:* 2019年11月28日(星期四) 下午3:12
>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>> *抄送:* "User-Flink"<user@flink.apache.org>;
>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>
>> Can you share the flink configuration once?
>>
>> Regards
>> Bhaskar
>>
>> On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <xcz200...@qq.com> wrote:
>>
>>> if i clean the zookeeper data , it runs fine .  but next time when the
>>> jobmanager failed and redeploy the error occurs again
>>>
>>>
>>>
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>>> *发送时间:* 2019年11月28日(星期四) 下午3:05
>>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>
>>> Again it could not find the state store file: "Caused by:
>>> java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
>>>  Check why its unable to find.
>>> Better thing is: Clean up zookeeper state and check your configurations,
>>> correct them and restart cluster.
>>> Otherwise it always picks up corrupted state from zookeeper and it will
>>> never restart
>>>
>>> Regards
>>> Bhaskar
>>>
>>> On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <xcz200...@qq.com> wrote:
>>>
>>>> i've made a misstake( the log before is another cluster) . the full
>>>> exception log is :
>>>>
>>>>
>>>> INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      -
>>>> Recovering all persisted jobs.
>>>> 2019-11-28 02:33:12,726 INFO
>>>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  -
>>>> Starting the SlotManager.
>>>> 2019-11-28 02:33:12,743 INFO
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>>>> Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from
>>>> ZooKeeper.
>>>> 2019-11-28 02:33:12,744 ERROR
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error
>>>> occurred in the cluster entrypoint.
>>>> org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take
>>>> leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>>>>
>>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
>>>> at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>>>>
>>>> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> at
>>>> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>
>>>> at
>>>> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> at
>>>> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>
>>>> Caused by: java.lang.RuntimeException:
>>>> org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph
>>>> from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates
>>>> that the retrieved state handle is broken. Try cleaning the state handle
>>>> store.
>>>> at
>>>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>>>> at
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
>>>>
>>>> at
>>>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
>>>> at
>>>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>>>>
>>>> ... 7 more
>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>> submitted JobGraph from state handle under
>>>> /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state
>>>> handle is broken. Try cleaning the state handle store.
>>>> at
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
>>>>
>>>> at
>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
>>>>
>>>> at
>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
>>>>
>>>> ... 9 more
>>>> Caused by: java.io.FileNotFoundException:
>>>> /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
>>>> at java.io.FileInputStream.open0(Native Method)
>>>> at java.io.FileInputStream.open(FileInputStream.java:195)
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>>>> *发送时间:* 2019年11月28日(星期四) 下午2:46
>>>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>>>> *抄送:* "User-Flink"<user@flink.apache.org>;
>>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>>
>>>> Is it filesystem or hadoop? If its NAS then why the exception "Caused
>>>> by: org.apache.hadoop.hdfs.BlockMissingException: "
>>>> It seems you configured hadoop state store and giving NAS mount.
>>>>
>>>> Regards
>>>> Bhaskar
>>>>
>>>>
>>>>
>>>> On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <xcz200...@qq.com> wrote:
>>>>
>>>>> /flink/checkpoints  is a external persistent store (a nas directory
>>>>> mounts to the job manager)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------ 原始邮件 ------------------
>>>>> *发件人:* "Vijay Bhaskar"<bhaskar.eba...@gmail.com>;
>>>>> *发送时间:* 2019年11月28日(星期四) 下午2:29
>>>>> *收件人:* "曾祥才"<xcz200...@qq.com>;
>>>>> *抄送:* "user"<user@flink.apache.org>;
>>>>> *主题:* Re: JobGraphs not cleaned up in HA mode
>>>>>
>>>>> Following are the mandatory condition to run in HA:
>>>>>
>>>>> a) You should have persistent common external store for jobmanager and
>>>>> task managers to while writing the state
>>>>> b) You should have persistent external store for zookeeper to store
>>>>> the Jobgraph.
>>>>>
>>>>> Zookeeper is referring  path:
>>>>> /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but
>>>>> jobmanager unable to find it.
>>>>> It seems /flink/checkpoints  is not the external persistent store
>>>>>
>>>>>
>>>>> Regards
>>>>> Bhaskar
>>>>>
>>>>> On Thu, Nov 28, 2019 at 10:43 AM seuzxc <xcz200...@qq.com> wrote:
>>>>>
>>>>>> hi ，I've the same problem with flink 1.9.1 , any solution to fix it
>>>>>> when the k8s redoploy jobmanager ,  the error looks like (seems zk not
>>>>>> remove submitted job info, but jobmanager remove the file):
>>>>>>
>>>>>>
>>>>>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve
>>>>>> submitted JobGraph from state handle under
>>>>>> /147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved
>>>>>> state
>>>>>> handle is broken. Try cleaning the state handle store.
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
>>>>>>         at
>>>>>>
>>>>>> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
>>>>>>         ... 9 more
>>>>>> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not
>>>>>> obtain
>>>>>> block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
>>>>>> file=/flink/checkpoints/submittedJobGraph480ddf9572ed
>>>>>>         at
>>>>>>
>>>>>> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from:
>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>>>>>
>>>>>

Re: JobGraphs not cleaned up in HA mode

Reply via email to