[ 
https://issues.apache.org/jira/browse/FLINK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119634#comment-17119634
 ] 

Fritz Budiyanto commented on FLINK-17853:
-----------------------------------------

Thanks. We will migrate to 1.10. Feel free to close this ticket. I'll re-open 
if it is still happening in 1.10.

> JobGraph is not getting deleted after Job cancelation
> -----------------------------------------------------
>
>                 Key: FLINK-17853
>                 URL: https://issues.apache.org/jira/browse/FLINK-17853
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.2
>         Environment: Flink 1.9.2
> Zookeeper from AWS MSK
>            Reporter: Fritz Budiyanto
>            Priority: Major
>         Attachments: flinkissue.txt
>
>
> I have been seeing this issue several time where JobGraph are not cleaned up 
> properly after Job deletion. Job deletion is performed by using "flink stop" 
> command. As a result JobGraph node lingering in ZK, when Flink cluster is 
> restarted, it will attempt to do HA restoration on non existing checkpoint 
> which prevent the Flink cluster to come up.
> 2020-05-19 19:56:21,471 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and 
> sending final execution state FINISHED to JobManager for task Source: 
> kafkaConsumer[update_server] -> (DetectedUpdateMessageConverter -> Sink: 
> update_server.detected_updates, DrivenCoordinatesMessageConverter -> Sink: 
> update_server.driven_coordinates) 588902a8096f49845b09fa1f595d6065.
> 2020-05-19 19:56:21,622 INFO 
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot 
> TaskSlot(index:0, state:ACTIVE, resource profile: 
> ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
> directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
> networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId: 
> 29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
> 2020-05-19 19:56:21,622 INFO 
> org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job 
> 86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
> 2020-05-19 19:56:21,622 INFO 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - 
> Stopping ZooKeeperLeaderRetrievalService 
> /leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
> 2020-05-19 19:56:21,623 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager 
> connection for job 86a028b3f7aada8ffe59859ca71d6385.
> 2020-05-19 19:56:21,624 INFO 
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager 
> connection for job 86a028b3f7aada8ffe59859ca71d6385.
> 2020-05-19 19:56:21,624 INFO 
> org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to 
> job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.
> ...
> Zookeeper CLI:
> ls /flink/cluster_update/jobgraphs
> [86a028b3f7aada8ffe59859ca71d6385]
>  
> Attached is the Flink logs in reverse order



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to