Bhupendra Yadav created FLINK-32631:
---------------------------------------
Summary: FlinkSessionJob stuck in Created/Reconciling state
because of No Job found error in JobManager
Key: FLINK-32631
URL: https://issues.apache.org/jira/browse/FLINK-32631
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.16.0
Environment: Local
Reporter: Bhupendra Yadav
{*}Background{*}: We are using FlinkSessionJob for submitting jobs to a session
cluster.
{*}Bug{*}: We frequently encounter a problem where the job gets stuck in
CREATED/RECONCILING state. On checking flink operator logs we see the errorĀ
{_}Job could not be found{_}. Full traceĀ [here|https://ideone.com/NuAyEK].
# When a Flink session job is submitted, the Flink operator submits the job to
the Flink Cluster.
# If the Flink job manager (JM) restarts for some reason, the job may no
longer exist in the JM.
# Upon reconciliation, the Flink operator queries the JM's REST API for the
job using its jobID, but it receives a 404 error, indicating that the job is
not found.
# The operator then encounters an error and logs it, leading to the job
getting stuck in an indefinite state.
# Attempting to restart or suspend the job using the operator's provided
mechanisms also fails because the operator keeps calling the REST API and
receiving the same 404 error.
{*}Expected Behavior{*}: Ideally, when the Flink operator reconciles a job and
finds that it no longer exists in the Flink Cluster, it should handle the
situation gracefully. Instead of getting stuck and logging errors indefinitely,
the operator should mark the job as failed or deleted, or set an appropriate
status for it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)