Re: Flink Job Manager unable to recognize Task Manager Available slots

2022-05-24 Thread Teoh, Hong
Hi Sunitha,

Without more information about your setup, I would assume you are trying to 
return JobManager (and HA setup) into a stable state. A couple of questions:

  *   Since your job is cancelled, I would assume that the current job’s HA 
state is not important, so we can delete the checkpoint pointer and data.
  *   Are there other jobs running on the same cluster whose HA state you want 
to salvage?

I can think of the following options:

  1.  If there are no other jobs running on the same cluster, and the HA state 
is not important, the easiest way is to totally replace your Zookeeper 
instances. (this will start the JobManager afresh, but will cause the HA state 
for all other jobs running on the same cluster to be lost)
  2.  Manually clear the Zookeeper HA state for the problematic job. This will 
keep the HA state of other jobs running on the same cluster.

To perform step 2, see below:
The zookeeper stores “Active” jobs in a znode hierarchy as shown below (You can 
imagine this like a pseudo file system). I am assuming the jobid you have 
pasted in logs.


  *   /flink/default/running_job_registry/3a97d1d50f663027ae81efe0f0aa
This has the status of the job (e.g. RUNNING)

  *   /flink/default/leader/resource_manager_lock
This has the information about which JM has the ResourceManager (which is the 
component responsible for registering the task slots in the cluster

There are other znodes as well, which are all interesting (e.g. 
/flink/default/checkpoints, /flink/default/checkpoint-counter), but I’ve 
highlighted the relevant ones.

To clear this, you can simply log unto your zookeeper nodes, and delete the 
znodes. The JobManager will repopulate them when the job starts up.

  1.  Log unto your zookeeper nodes (e.g. execute into your zookeeper container)
  2.  Execute the zookeeper CLI. This usually comes prepackaged with zookeeper, 
and you can simply run the pre-packaged script bin/zkCli.sh.

Explore the pseudo-file system by doing ls or get (e.g. ls /flink/default )

  3.  You can delete the znodes associated to your job

rmr /flink/default/running_job_registry/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/jobgraphs/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/checkpoints/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/checkpoint-counter/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/leaderlatch/3a97d1d50f663027ae81efe0f0aa
rmr /flink/default/leader/3a97d1d50f663027ae81efe0f0aa

This should result in your JobManager recovering from the faulty job.

Regards,
Hong






From: "s_penakalap...@yahoo.com" 
Date: Tuesday, 24 May 2022 at 18:40
To: User 
Subject: RE: [EXTERNAL]Flink Job Manager unable to recognize Task Manager 
Available slots


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hi Team,

Any inputs please badly stuck.

Regards,
Sunitha

On Sunday, May 22, 2022, 12:34:22 AM GMT+5:30, s_penakalap...@yahoo.com 
 wrote:


Hi All,

Help please!

We have standalone Flink service installed in individual VM and clubed to form 
a cluster with HA and checkpoint in place. When cancelling Job, Flink cluster 
went down and its unable to start up normally as Job manager is continuously 
going down with the below error:

2022-05-21 14:33:09,314 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 
3a97d1d50f663027ae81efe0f0aa.

Each attempt to restart cluster failed with the same error so the whole cluster 
became unrecoverable and not operating, please help on the below points:
1> In which Fink/zookeeper folder job recovery details are stored and how can 
we clear all old job instance so that Flink cluster will not try to recover and 
start fresh to manually submit all job.

2> Since cluster is HA, we have 2 Job manager's even though one JM is going 
down Flink is started but available slots are showing up as 0 (task manager's 
are up but not displayed in web UI).

Regards
Sunitha.



Re: Flink Job Manager unable to recognize Task Manager Available slots

2022-05-24 Thread s_penakalap...@yahoo.com
 Hi Team,
Any inputs please badly stuck.
Regards,Sunitha
On Sunday, May 22, 2022, 12:34:22 AM GMT+5:30, s_penakalap...@yahoo.com 
 wrote:  
 
 Hi All,
Help please!
We have standalone Flink service installed in individual VM and clubed to form 
a cluster with HA and checkpoint in place. When cancelling Job, Flink cluster 
went down and its unable to start up normally as Job manager is continuously 
going down with the below error:
2022-05-21 14:33:09,314 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
occurred in the cluster entrypoint.java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 
3a97d1d50f663027ae81efe0f0aa.
Each attempt to restart cluster failed with the same error so the whole cluster 
became unrecoverable and not operating, please help on the below points:1> In 
which Fink/zookeeper folder job recovery details are stored and how can we 
clear all old job instance so that Flink cluster will not try to recover and 
start fresh to manually submit all job.
2> Since cluster is HA, we have 2 Job manager's even though one JM is going 
down Flink is started but available slots are showing up as 0 (task manager's 
are up but not displayed in web UI).
RegardsSunitha.
  

Flink Job Manager unable to recognize Task Manager Available slots

2022-05-21 Thread s_penakalap...@yahoo.com
Hi All,
Help please!
We have standalone Flink service installed in individual VM and clubed to form 
a cluster with HA and checkpoint in place. When cancelling Job, Flink cluster 
went down and its unable to start up normally as Job manager is continuously 
going down with the below error:
2022-05-21 14:33:09,314 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error 
occurred in the cluster entrypoint.java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not recover job with job id 
3a97d1d50f663027ae81efe0f0aa.
Each attempt to restart cluster failed with the same error so the whole cluster 
became unrecoverable and not operating, please help on the below points:1> In 
which Fink/zookeeper folder job recovery details are stored and how can we 
clear all old job instance so that Flink cluster will not try to recover and 
start fresh to manually submit all job.
2> Since cluster is HA, we have 2 Job manager's even though one JM is going 
down Flink is started but available slots are showing up as 0 (task manager's 
are up but not displayed in web UI).
RegardsSunitha.