?????? JobGraphs not cleaned up in HA mode

2019-11-28 Thread ??????
the chk-* directory is not found , I think the misssing because of jobmanager removes it automaticly , but why it still in zookeeper? ---- ??:"Vijay Bhaskar"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

2019-11-28 Thread Vijay Bhaskar
-- 原始邮件 -- > *发件人:* "vino yang"; > *发送时间:* 2019年11月28日(星期四) 晚上7:17 > *收件人:* "曾祥才"; > *抄送:* "Vijay Bhaskar";"User-Flink"< > user@flink.apache.org>; > *主题:* Re: JobGraphs not cleaned up in HA mode > > Hi, > > Why

?????? JobGraphs not cleaned up in HA mode

2019-11-28 Thread ??????
hi?? Is there any deference ??for me using nas is more convenient to test currently??? from the docs seems hdfs ,s3, nfs etc all will be fine. ---- ??:"vino yang"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

2019-11-28 Thread vino yang
Bhaskar"; > *抄送:* "User-Flink"; > *主题:* 回复: JobGraphs not cleaned up in HA mode > > the config (/flink is the NASdirectory ): > > jobmanager.rpc.address: flink-jobmanager > taskmanager.numberOfTaskSlots: 16 > web.upload.dir: /flink/webUpload > blob.server.port:

?????? JobGraphs not cleaned up in HA mode

2019-11-28 Thread ??????
anyone have the same problem?? pls help, thks ---- ??:"??"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

?????? JobGraphs not cleaned up in HA mode

2019-11-27 Thread ??????
the config (/flink is the NASdirectory ): jobmanager.rpc.address: flink-jobmanager taskmanager.numberOfTaskSlots: 16 web.upload.dir: /flink/webUpload blob.server.port: 6124 jobmanager.rpc.port: 6123 taskmanager.rpc.port: 6122 jobmanager.heap.size: 1024m taskmanager.heap.size: 1024m

Re: JobGraphs not cleaned up in HA mode

2019-11-27 Thread Vijay Bhaskar
-- > *发件人:* "Vijay Bhaskar"; > *发送时间:* 2019年11月28日(星期四) 下午3:05 > *收件人:* "曾祥才"; > *主题:* Re: JobGraphs not cleaned up in HA mode > > Again it could not find the state store file: "Caused by: > java.io.FileNotFoundException: /flink/ha/submittedJob

?????? JobGraphs not cleaned up in HA mode

2019-11-27 Thread ??????
if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again ---- ??:"Vijay Bhaskar"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

2019-11-27 Thread Vijay Bhaskar
is a external persistent store (a nas directory mounts > to the job manager) > > > > > -- 原始邮件 -- > *发件人:* "Vijay Bhaskar"; > *发送时间:* 2019年11月28日(星期四) 下午2:29 > *收件人:* "曾祥才"; > *抄送:* "user"; > *主题:* Re: JobGraph

?????? JobGraphs not cleaned up in HA mode

2019-11-27 Thread ??????
/flink/checkpoints is a external persistent store(a nas directory mounts to the job manager) ---- ??:"Vijay Bhaskar"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: JobGraphs not cleaned up in HA mode

2019-11-27 Thread Vijay Bhaskar
Following are the mandatory condition to run in HA: a) You should have persistent common external store for jobmanager and task managers to while writing the state b) You should have persistent external store for zookeeper to store the Jobgraph. Zookeeper is referring path:

Re: JobGraphs not cleaned up in HA mode

2019-11-27 Thread seuzxc
hi ,I've the same problem with flink 1.9.1 , any solution to fix it when the k8s redoploy jobmanager , the error looks like (seems zk not remove submitted job info, but jobmanager remove the file): Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from

Re: JobGraphs not cleaned up in HA mode

2018-08-29 Thread Till Rohrmann
Hi Encho, thanks for sending the first part of the logs. What I would actually be interested in are the complete logs because somewhere in the jobmanager-2 logs there must be a log statement saying that the respective dispatcher gained leadership. I would like to see why this happens but for this

Re: JobGraphs not cleaned up in HA mode

2018-08-29 Thread Encho Mishinev
Hi Till, I will use the approach with a k8s deployment and HA mode with a single job manager. Nonetheless, here are the logs I just produced by repeating the aforementioned experiment, hope they help in debugging: *- Starting Jobmanager-1:* Starting Job Manager sed: cannot rename

Re: JobGraphs not cleaned up in HA mode

2018-08-29 Thread Till Rohrmann
Hi Encho, it sounds strange that the standby JobManager tries to recover a submitted job graph. This should only happen if it has been granted leadership. Thus, it seems as if the standby JobManager thinks that it is also the leader. Could you maybe share the logs of the two

Re: JobGraphs not cleaned up in HA mode

2018-08-29 Thread Encho Mishinev
Hello, Since two job managers don't seem to be working for me I was thinking of just using a single job manager in Kubernetes in HA mode with a deployment ensuring its restart whenever it fails. Is this approach viable? The High-Availability page mentions that you use only one job manager in an

Re: JobGraphs not cleaned up in HA mode

2018-08-29 Thread Encho Mishinev
Hi, Unfortunately the thing I described does indeed happen every time. As mentioned in the first email, I am running on Kubernetes so certain things could be different compared to just a standalone cluster. Any ideas for workarounds are welcome, as this problem basically prevents me from using

Re: JobGraphs not cleaned up in HA mode

2018-08-28 Thread vino yang
Hi Encho, >From your description, I feel that there are extra bugs. About your description: *- Start both job managers* *- Start a batch job in JobManager 1 and let it finish* *The jobgraphs in both Zookeeper and HDFS remained.* Is it necessarily happening every time? In the Standalone

Re: JobGraphs not cleaned up in HA mode

2018-08-28 Thread Encho Mishinev
Hello Till, I spend a few more hours testing and looking at the logs and it seems like there's a more general problem here. While the two job managers are active neither of them can properly delete jobgraphs. The above problem I described comes from the fact that Kubernetes gets JobManager 1

Re: JobGraphs not cleaned up in HA mode

2018-08-28 Thread Till Rohrmann
Hi Encho, thanks a lot for reporting this issue. The problem arises whenever the old leader maintains the connection to ZooKeeper. If this is the case, then ephemeral nodes which we create to protect against faulty delete operations are not removed and consequently the new leader is not able to

Re: JobGraphs not cleaned up in HA mode

2018-08-28 Thread vino yang
Hi Encho, A temporary solution can be used to determine if it has been cleaned up by monitoring the specific JobID under Zookeeper's "/jobgraph". Another solution, modify the source code, rudely modify the cleanup mode to the synchronous form, but the flink operation Zookeeper's path needs to

Re: JobGraphs not cleaned up in HA mode

2018-08-27 Thread Encho Mishinev
Thank you very much for the info! Will keep track of the progress. In the meantime is there any viable workaround? It seems like HA doesn't really work due to this bug. On Tue, Aug 28, 2018 at 4:52 AM vino yang wrote: > About some implementation mechanisms. > Flink uses Zookeeper to store

Re: JobGraphs not cleaned up in HA mode

2018-08-27 Thread vino yang
Hi Encho, This is a problem already known to the Flink community, you can track its progress through FLINK-10011[1], and currently Till is fixing this issue. [1]: https://issues.apache.org/jira/browse/FLINK-10011 Thanks, vino. Encho Mishinev 于2018年8月27日周一 下午10:13写道: > I am running Flink

JobGraphs not cleaned up in HA mode

2018-08-27 Thread Encho Mishinev
I am running Flink 1.5.3 with two job managers and two task managers in Kubernetes along with HDFS and Zookeeper in high-availability mode. My problem occurs after the following actions: - Upload a .jar file to jobmanager-1 - Run a streaming job from the jar on jobmanager-1 - Wait for 1 or 2