Fritz Budiyanto created FLINK-17853:
---------------------------------------
Summary: JobGraph is not getting deleted after Job cancelation
Key: FLINK-17853
URL: https://issues.apache.org/jira/browse/FLINK-17853
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.9.2
Environment: Flink 1.9.2
Zookeeper from AWS MSK
Reporter: Fritz Budiyanto
Attachments: flinkissue.txt
I have been seeing this issue several time where JobGraph are not cleaned up
properly after Job deletion. Job deletion is performed by using "flink stop"
command. As a result JobGraph node lingering in ZK, when Flink cluster is
restarted, it will attempt to do HA restoration on non existing checkpoint
which prevent the Flink cluster to come up.
2020-05-19 19:56:21,471 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
- Un-registering task and sending final execution state FINISHED to JobManager
for task Source: kafkaConsumer[update_server] ->
(DetectedUpdateMessageConverter -> Sink: update_server.detected_updates,
DrivenCoordinatesMessageConverter -> Sink: update_server.driven_coordinates)
588902a8096f49845b09fa1f595d6065.
2020-05-19 19:56:21,622 INFO
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot
TaskSlot(index:0, state:ACTIVE, resource profile:
ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647,
directMemoryInMB=2147483647, nativeMemoryInMB=2147483647,
networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId:
29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
2020-05-19 19:56:21,622 INFO
org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job
86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
2020-05-19 19:56:21,622 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService -
Stopping ZooKeeperLeaderRetrievalService
/leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
2020-05-19 19:56:21,623 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor
- Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO
org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to
job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.
...
Zookeeper CLI:
ls /flink/cluster_update/jobgraphs
[86a028b3f7aada8ffe59859ca71d6385]
Attached is the Flink logs in reverse order
--
This message was sent by Atlassian Jira
(v8.3.4#803005)