the chk-* directory is not found , I think the misssing because of jobmanager
removes it automaticly , but why it still in zookeeper?
----
??:"Vijay Bhaskar"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
-- 原始邮件 --
> *发件人:* "vino yang";
> *发送时间:* 2019年11月28日(星期四) 晚上7:17
> *收件人:* "曾祥才";
> *抄送:* "Vijay Bhaskar";"User-Flink"<
> user@flink.apache.org>;
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Hi,
>
> Why
hi??
Is there any deference ??for me using nas is more convenient to test
currently???
from the docs seems hdfs ,s3, nfs etc all will be fine.
----
??:"vino yang"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Bhaskar";
> *抄送:* "User-Flink";
> *主题:* 回复: JobGraphs not cleaned up in HA mode
>
> the config (/flink is the NASdirectory ):
>
> jobmanager.rpc.address: flink-jobmanager
> taskmanager.numberOfTaskSlots: 16
> web.upload.dir: /flink/webUpload
> blob.server.port:
anyone have the same problem?? pls help, thks
----
??:"??"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
the config (/flink is the NASdirectory ):
jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
--
> *发件人:* "Vijay Bhaskar";
> *发送时间:* 2019年11月28日(星期四) 下午3:05
> *收件人:* "曾祥才";
> *主题:* Re: JobGraphs not cleaned up in HA mode
>
> Again it could not find the state store file: "Caused by:
> java.io.FileNotFoundException: /flink/ha/submittedJob
if i clean the zookeeper data , it runs fine . but next time when the
jobmanager failed and redeploy the error occurs again
----
??:"Vijay Bhaskar"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
is a external persistent store (a nas directory mounts
> to the job manager)
>
>
>
>
> -- 原始邮件 --
> *发件人:* "Vijay Bhaskar";
> *发送时间:* 2019年11月28日(星期四) 下午2:29
> *收件人:* "曾祥才";
> *抄送:* "user";
> *主题:* Re: JobGraph
/flink/checkpoints is a external persistent store(a nas directory
mounts to the job manager)
----
??:"Vijay Bhaskar"http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Following are the mandatory condition to run in HA:
a) You should have persistent common external store for jobmanager and task
managers to while writing the state
b) You should have persistent external store for zookeeper to store the
Jobgraph.
Zookeeper is referring path:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):
Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from
Hi Encho,
thanks for sending the first part of the logs. What I would actually be
interested in are the complete logs because somewhere in the jobmanager-2
logs there must be a log statement saying that the respective dispatcher
gained leadership. I would like to see why this happens but for this
Hi Till,
I will use the approach with a k8s deployment and HA mode with a single job
manager. Nonetheless, here are the logs I just produced by repeating the
aforementioned experiment, hope they help in debugging:
*- Starting Jobmanager-1:*
Starting Job Manager
sed: cannot rename
Hi Encho,
it sounds strange that the standby JobManager tries to recover a submitted
job graph. This should only happen if it has been granted leadership. Thus,
it seems as if the standby JobManager thinks that it is also the leader.
Could you maybe share the logs of the two
Hello,
Since two job managers don't seem to be working for me I was thinking of
just using a single job manager in Kubernetes in HA mode with a deployment
ensuring its restart whenever it fails. Is this approach viable? The
High-Availability page mentions that you use only one job manager in an
Hi,
Unfortunately the thing I described does indeed happen every time. As
mentioned in the first email, I am running on Kubernetes so certain things
could be different compared to just a standalone cluster.
Any ideas for workarounds are welcome, as this problem basically prevents
me from using
Hi Encho,
>From your description, I feel that there are extra bugs.
About your description:
*- Start both job managers*
*- Start a batch job in JobManager 1 and let it finish*
*The jobgraphs in both Zookeeper and HDFS remained.*
Is it necessarily happening every time?
In the Standalone
Hello Till,
I spend a few more hours testing and looking at the logs and it seems like
there's a more general problem here. While the two job managers are active
neither of them can properly delete jobgraphs. The above problem I
described comes from the fact that Kubernetes gets JobManager 1
Hi Encho,
thanks a lot for reporting this issue. The problem arises whenever the old
leader maintains the connection to ZooKeeper. If this is the case, then
ephemeral nodes which we create to protect against faulty delete operations
are not removed and consequently the new leader is not able to
Hi Encho,
A temporary solution can be used to determine if it has been cleaned up by
monitoring the specific JobID under Zookeeper's "/jobgraph".
Another solution, modify the source code, rudely modify the cleanup mode to
the synchronous form, but the flink operation Zookeeper's path needs to
Thank you very much for the info! Will keep track of the progress.
In the meantime is there any viable workaround? It seems like HA doesn't
really work due to this bug.
On Tue, Aug 28, 2018 at 4:52 AM vino yang wrote:
> About some implementation mechanisms.
> Flink uses Zookeeper to store
Hi Encho,
This is a problem already known to the Flink community, you can track its
progress through FLINK-10011[1], and currently Till is fixing this issue.
[1]: https://issues.apache.org/jira/browse/FLINK-10011
Thanks, vino.
Encho Mishinev 于2018年8月27日周一 下午10:13写道:
> I am running Flink
I am running Flink 1.5.3 with two job managers and two task managers in
Kubernetes along with HDFS and Zookeeper in high-availability mode.
My problem occurs after the following actions:
- Upload a .jar file to jobmanager-1
- Run a streaming job from the jar on jobmanager-1
- Wait for 1 or 2
24 matches
Mail list logo