David Tan created AIRFLOW-6354: ---------------------------------- Summary: Intermittent warnings: "Processor for some/dag.py exited with return code -9" Key: AIRFLOW-6354 URL: https://issues.apache.org/jira/browse/AIRFLOW-6354 Project: Apache Airflow Issue Type: Bug Components: DAG Affects Versions: 1.10.5 Reporter: David Tan
Hi there, Thank you for your work on Airflow. We've observed some intermittent errors with {{DagFileProcessorManager}} and we would like to request your assistance in understanding the error. *Context* We've provisioned an instance on Airflow on a Kubernetes cluster using the bitnami airflow helm chart. We get the following warning every few hours (and sometimes every few minutes - we didn't find any pattern to this). {code:java} # most times, we see these healthy info logs [2019-12-20 11:49:27,956] {dag_processing.py:892} INFO - Searching for files in /opt/bitnami/airflow/dags [2019-12-20 11:49:28,895] {dag_processing.py:895} INFO - There are 2 files in /opt/bitnami/airflow/dags [2019-12-20 11:54:29,749] {dag_processing.py:892} INFO - Searching for files in /opt/bitnami/airflow/dags [2019-12-20 11:54:31,798] {dag_processing.py:895} INFO - There are 2 files in /opt/bitnami/airflow/dags ... # but at random intervals, we see this error logs saying that a processor for a particular dag has timed out (after 30 seconds), and processor exits with an error code -9. [2019-12-20 11:56:04,814] {dag_processing.py:1229} INFO - Processor for /opt/bitnami/airflow/dags/multi_producer_dags.py with PID 19537 started at 2019-12-20T11:55:34.728282+00:00 has timed out, killing it. [2019-12-20 11:56:04,822] {dag_processing.py:1143} WARNING - Processor for /opt/bitnami/airflow/dags/multi_producer_dags.py exited with return code -9. {code} We've traced through the code path in airflow/utils/dag_processing.py which issues both the healthy log events and unhealthy log events: {code:java} # healthy info logs DagFileProcessorManager#start() -> self._refresh_dag_dir() -> log.info("Searching for files...") and log.info("There are N files in ...") # timeout error logs DagFileProcessorManager#start() -> self.heartbeat() -> self.collect_results() -> _kill_timed_out_processors() -> log.error("Processor for [/some/dag.py] with [pid] started at [time] has timed out, killing it") and processor.kill() {code} Basically, when the processor manager loop runs every 5 minutes, {{_kill_timed_out_processors() }}is called without any problems. But sometimes processor durations takes longer than 30 seconds. *Could you please help us understand if this intermittent timeouts in processor is something that is to be expected? If it is, we will ignore this warning log in our alerts.* -- This message was sent by Atlassian Jira (v8.3.4#803005)