Hi Sunitha,
Without more information about your setup, I would assume you are trying to
return JobManager (and HA setup) into a stable state. A couple of questions:
* Since your job is cancelled, I would assume that the current job’s HA
state is not important, so we can delete the checkpoint pointer and data.
* Are there other jobs running on the same cluster whose HA state you want
to salvage?
I can think of the following options:
1. If there are no other jobs running on the same cluster, and the HA state
is not important, the easiest way is to totally replace your Zookeeper
instances. (this will start the JobManager afresh, but will cause the HA state
for all other jobs running on the same cluster to be lost)
2. Manually clear the Zookeeper HA state for the problematic job. This will
keep the HA state of other jobs running on the same cluster.
To perform step 2, see below:
The zookeeper stores “Active” jobs in a znode hierarchy as shown below (You can
imagine this like a pseudo file system). I am assuming the jobid you have
pasted in logs.
* /flink/default/running_job_registry/3a97d1d50f663027ae81efe0f0aa
This has the status of the job (e.g. RUNNING)
* /flink/default/leader/resource_manager_lock
This has the information about which JM has the ResourceManager (which is the
component responsible for registering the task slots in the cluster
There are other znodes as well, which are all interesting (e.g.
/flink/default/checkpoints, /flink/default/checkpoint-counter), but I’ve
highlighted the relevant ones.
To clear this, you can simply log unto your zookeeper nodes, and delete the
znodes. The JobManager will repopulate them when the job starts up.
1. Log unto your zookeeper nodes (e.g. execute into your zookeeper container)
2. Execute the zookeeper CLI. This usually comes prepackaged with zookeeper,
and you can simply run the pre-packaged script bin/zkCli.sh.
Explore the pseudo-file system by doing ls or get (e.g. ls /flink/default )
3. You can delete the znodes associated to your job
rmr /flink/default/running_job_registry/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/jobgraphs/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/checkpoints/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/checkpoint-counter/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/leaderlatch/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/leader/3a97d1d50f663027ae81efe0f0aa
This should result in your JobManager recovering from the faulty job.
Regards,
Hong
From: "s_penakalap...@yahoo.com"
Date: Tuesday, 24 May 2022 at 18:40
To: User
Subject: RE: [EXTERNAL]Flink Job Manager unable to recognize Task Manager
Available slots
CAUTION: This email originated from outside of the organization. Do not click
links or open attachments unless you can confirm the sender and know the
content is safe.
Hi Team,
Any inputs please badly stuck.
Regards,
Sunitha
On Sunday, May 22, 2022, 12:34:22 AM GMT+5:30, s_penakalap...@yahoo.com
wrote:
Hi All,
Help please!
We have standalone Flink service installed in individual VM and clubed to form
a cluster with HA and checkpoint in place. When cancelling Job, Flink cluster
went down and its unable to start up normally as Job manager is continuously
going down with the below error:
2022-05-21 14:33:09,314 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error
occurred in the cluster entrypoint.
java.util.concurrent.CompletionException:
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id
3a97d1d50f663027ae81efe0f0aa.
Each attempt to restart cluster failed with the same error so the whole cluster
became unrecoverable and not operating, please help on the below points:
1> In which Fink/zookeeper folder job recovery details are stored and how can
we clear all old job instance so that Flink cluster will not try to recover and
start fresh to manually submit all job.
2> Since cluster is HA, we have 2 Job manager's even though one JM is going
down Flink is started but available slots are showing up as 0 (task manager's
are up but not displayed in web UI).
Regards
Sunitha.