[ 
https://issues.apache.org/jira/browse/AIRFLOW-5589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140280#comment-17140280
 ] 

Daniel Cooper commented on AIRFLOW-5589:
----------------------------------------

Hey [~dimberman], thanks for getting the PR for this in.  I saw you tagged the 
PR as in 1.10.11 so assigned this to you & set the fix version so it isn't 
missed in release notes.

> KubernetesPodOperator: Duplicate pods created on worker restart
> ---------------------------------------------------------------
>
>                 Key: AIRFLOW-5589
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5589
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: worker
>    Affects Versions: 1.10.4, 1.10.5
>            Reporter: Daniel Cooper
>            Assignee: Daniel Imberman
>            Priority: Major
>             Fix For: 1.10.11
>
>
> K8sPodOperator holds state within the execute function that monitors the 
> running pod. If a worker restarts for any reason (pod death, pod shuffle, 
> upgrade etc.) then this state is lost.
> At this point the scheduler notices (after max heartbeat interval wait) that 
> the task is now 'zombie' (not monitored) and reschedules the task.
> The new worker has no knowledge of the existing running pod and so creates a 
> new duplicate pod.  This can lead to many duplicate pods for the same task 
> running together in extreme cases.
> I believe this is the problem Nicholas Brenwald (King) described as having 
> when running k8s pod operator on Google Composer (at the September meetup at 
> King).
> My fix is to add enough labels to uniquely identify a running pod as being 
> from a given task instance (dag_id, task_id, run_id).  We then do a 
> namespaced list of pods from k8s with a label selector and monitor the 
> existing pod if it exists otherwise we create a new one as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to