[ https://issues.apache.org/jira/browse/AIRFLOW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140280#comment-17140280 ]
Daniel Cooper commented on AIRFLOW-5589: ---------------------------------------- Hey [~dimberman], thanks for getting the PR for this in. I saw you tagged the PR as in 1.10.11 so assigned this to you & set the fix version so it isn't missed in release notes. > KubernetesPodOperator: Duplicate pods created on worker restart > --------------------------------------------------------------- > > Key: AIRFLOW-5589 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5589 > Project: Apache Airflow > Issue Type: Bug > Components: worker > Affects Versions: 1.10.4, 1.10.5 > Reporter: Daniel Cooper > Assignee: Daniel Imberman > Priority: Major > Fix For: 1.10.11 > > > K8sPodOperator holds state within the execute function that monitors the > running pod. If a worker restarts for any reason (pod death, pod shuffle, > upgrade etc.) then this state is lost. > At this point the scheduler notices (after max heartbeat interval wait) that > the task is now 'zombie' (not monitored) and reschedules the task. > The new worker has no knowledge of the existing running pod and so creates a > new duplicate pod. This can lead to many duplicate pods for the same task > running together in extreme cases. > I believe this is the problem Nicholas Brenwald (King) described as having > when running k8s pod operator on Google Composer (at the September meetup at > King). > My fix is to add enough labels to uniquely identify a running pod as being > from a given task instance (dag_id, task_id, run_id). We then do a > namespaced list of pods from k8s with a label selector and monitor the > existing pod if it exists otherwise we create a new one as normal. -- This message was sent by Atlassian Jira (v8.3.4#803005)