ashb commented on code in PR #66412:
URL: https://github.com/apache/airflow/pull/66412#discussion_r3187878965
##########
airflow-core/src/airflow/jobs/triggerer_job_runner.py:
##########
@@ -638,6 +647,16 @@ def heartbeat(self):
"TriggerRunnerSupervisor.heartbeat() requires a Job; "
"subclasses without a metadata-DB Job must override this
method."
)
+ elapsed = time.monotonic() - self._last_runner_comms
+ if self.runner_health_check_threshold > 0 and elapsed >
self.runner_health_check_threshold:
+ log.error(
+ "TriggerRunner subprocess event loop appears deadlocked: no
communication received "
+ "for %.1fs (threshold: %ds). Skipping heartbeat so the
triggerer appears unhealthy "
+ "to the scheduler and its triggers are reassigned.",
+ elapsed,
+ self.runner_health_check_threshold,
+ )
+ return
Review Comment:
By returning here, we skip updating the hearbeat in the DB, which is also
means the check in the pod from the helm chart (`airflow jobs check --job-type
TriggererJob --local`) will pick up and after the configured timeout mark the
pod as unhealthy too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]