pankajkoti commented on code in PR #35825:
URL: https://github.com/apache/airflow/pull/35825#discussion_r1404218105


##########
docs/apache-airflow/core-concepts/tasks.rst:
##########
@@ -243,9 +243,103 @@ Zombie/Undead Tasks
 
 No system runs perfectly, and task instances are expected to die once in a 
while. Airflow detects two kinds of task/process mismatch:
 
-* *Zombie tasks* are tasks that are supposed to be running but suddenly died 
(e.g. their process was killed, or the machine died). Airflow will find these 
periodically, clean them up, and either fail or retry the task depending on its 
settings.
+* *Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite 
their associated jobs being inactive
+  (e.g. their process didn't send a recent heartbeat as it got killed, or the 
machine died). Airflow will find these
+  periodically, clean them up, and either fail or retry the task depending on 
its settings.
+
+* *Undead tasks* are tasks that are *not* supposed to be running but are, 
often caused when you manually edit Task
+  Instances via the UI. Airflow will find them periodically and terminate them.
+
+
+Below is the code snippet from the Airflow scheduler that runs periodically to 
detect zombie/undead tasks.
+
+.. exampleinclude:: /../../airflow/jobs/scheduler_job_runner.py
+    :language: python
+    :start-after: [START find_zombies]
+    :end-before: [END find_zombies]
+
+
+The explanation of the criteria used in the above snippet to detect zombie 
tasks is as below:
+
+1. **Task Instance State**
+
+    Only task instances in the RUNNING state are considered potential zombies.
+
+.. code-block::
+
+  .where(TI.state == TaskInstanceState.RUNNING)
+
+2. **Job State and Heartbeat Check**
+
+    Zombie tasks are identified if the associated job is not in the RUNNING 
state or if the latest heartbeat of the job is
+    earlier than the calculated time threshold (limit_dttm). The heartbeat is 
a mechanism to indicate that a task or job is
+    still alive and running.
+
+.. code-block::
+
+  .where(
+    or_(
+        Job.state != JobState.RUNNING,
+        Job.latest_heartbeat < limit_dttm,
+    )
+  )
+
+3. **Job Type**
+
+    The job associated with the task must be of type "LocalTaskJob."
+
+.. code-block::
+
+  .where(Job.job_type == "LocalTaskJob")
+
+4. **Queued by Job ID**
+
+    Only tasks queued by the same job that is currently being processed are 
considered.
+
+.. code-block::
+
+  .where(TI.queued_by_job_id == self.job.id)
+
+These conditions collectively help identify running tasks that may be zombies 
based on their state, associated job
+state, heartbeat status, job type, and the specific job that queued them. If a 
task meets these criteria, it is
+considered a potential zombie, and further actions, such as logging and 
sending a callback request, are taken.
+
+Reproducing zombie tasks locally

Review Comment:
   yes, this was @vatsrahul1001's smart way of reproducing :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to