john created FLINK-25832:
----------------------------

             Summary: When the TaskManager is closed, its associated slot is 
not set to the released state.
                 Key: FLINK-25832
                 URL: https://issues.apache.org/jira/browse/FLINK-25832
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Task
    Affects Versions: 1.14.3, 1.14.2
            Reporter: john
         Attachments: image-2022-01-27-10-55-14-758.png, 
image-2022-01-27-10-55-59-119.png, image-2022-01-27-10-57-26-223.png

I deployed a standalone flink cluster on k8s and enabled 
scheduler-mode=reactive. When Taskmanager is closed, I actively call the 
closeTaskManagerConnection method of ResourceManager. However, when 
AdaptiveScheduler actively starts to restart the job, it calls the cancel 
method of Execution at this time, but this method does not judge whether the 
status of its associated slot is Alive. The Taskmanager to which this slot 
belongs has been closed, and RpcTimeout is triggered at this time.
But when I change the cancel method of Execution, after judging whether the 
status of the slot is Alive before cancel, repeating the above operation is 
still invalid, that is, RpcTimeout will still be triggered. My problem is: 
Active in the ResourceManager's closeTaskManagerConnection method, does not 
affect the state of its associated allocated slot. I think this is a bug. We 
should optimize the behavior of cancel to speed up the execution of cancel.

!image-2022-01-27-10-55-59-119.png!

!image-2022-01-27-10-57-26-223.png!!image-2022-01-27-10-55-14-758.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to