xingbe created FLINK-31652:
------------------------------

             Summary: Flink should handle the delete event if the pod was 
deleted while pending
                 Key: FLINK-31652
                 URL: https://issues.apache.org/jira/browse/FLINK-31652
             Project: Flink
          Issue Type: Bug
          Components: Deployment / Kubernetes
    Affects Versions: 1.16.1, 1.17.0
            Reporter: xingbe


I found that in kubernetes deployment, if the taskmanager pod is deleted in 
'Pending' phase, the flink job will get stuck and keep waiting for the pod 
scheduled. We can reproduce this issue with the 'kubectl delete pod' command to 
delete the pod when it is in the pending phase.
 
The cause reason is that the pod status will not be updated in time in this 
case, so the KubernetesResourceManagerDriver won't detect the pod is 
terminated, and I also verified this by logging the pod status in 
KubernetesPod#isTerminated(), and it shows as follows.
{code:java}
public boolean isTerminated() {
    log.info("pod status: " + getInternalResource().getStatus());
    if (getInternalResource().getStatus() != null) {
        final boolean podFailed =
                
PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase());
        final boolean containersFailed =
                
getInternalResource().getStatus().getContainerStatuses().stream()
                        .anyMatch(
                                e ->
                                        e.getState() != null
                                                && e.getState().getTerminated() 
!= null);
        return containersFailed || podFailed;
    }
    return false;
} {code}
In the case, this function will return false because `containersFailed` and 
`podFailed` are both false.
{code:java}
PodStatus(conditions=[PodCondition(lastProbeTime=null, 
lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False, 
type=PodScheduled, additionalProperties={})], containerStatuses=[], 
ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[], 
message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[], 
qosClass=Guaranteed, reason=null, startTime=null, additionalProperties={}){code}
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to