shaurya-sood opened a new issue, #28201: URL: https://github.com/apache/airflow/issues/28201
### Apache Airflow version Other Airflow 2 version (please specify below) ### What happened ### Apache Airflow version 2.4.3 - Tasks get `SIGTERM` once a huge DAG is triggered (DAG with 100+ parallel tasks) and go into `UP_FOR_RETRY`/`FAILED` after retry. - `scheduler_heartbeat` metric drops very low 0-0.05 during the same time. - The CPU utilization of the database is spiked up to 100%. ### Airflow Logs ``` [2022-12-07, 15:37:49 UTC] {local_task_job.py:223} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance. [2022-12-07, 15:37:49 UTC] {process_utils.py:133} INFO - Sending Signals.SIGTERM to group 89412. PIDs of all processes in the group: [89412] [2022-12-07, 15:37:49 UTC] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 89412 [2022-12-07, 15:37:49 UTC] {taskinstance.py:1562} ERROR - Received SIGTERM. Terminating subprocesses. ``` ### Message on the UI `The scheduler does not appear to be running. Last heartbeat was received 30 seconds ago. The DAGs list may not update, and new tasks will not be scheduled.` ### Meta database CPU Utilization ![Screenshot 2022-12-07 at 19 18 20](https://user-images.githubusercontent.com/19922777/206263962-78a31497-a5ba-4b51-909e-644ea973f870.png) ### What you think should happen instead Tasks must get executed successfully without any SIGTERM signal. ### How to reproduce _No response_ ### Operating System Linux ### Versions of Apache Airflow Providers _No response_ ### Deployment Official Apache Airflow Helm Chart ### Deployment details ### Apache Airflow version 2.4.3 ### Executor Celery ### Airflow metadatabase Postgres DB (`db.r6g.large` RDS instance) ### Config ``` config: core: dag_discovery_safe_mode: false hostname_callable: airflow.utils.net.get_host_ip_address parallelism: 300 max_active_tasks_per_dag: 30 dagbag_import_timeout: 90 killed_task_cleanup_time: 604800 min_serialized_dag_update_interval: 300 celery: sync_parallelism: 1 worker_concurrency: 10 scheduler: dag_dir_list_interval: 300 min_file_process_interval: 300 parsing_processes: 2 schedule_after_task_execution: false job_heartbeat_sec: 20 ``` ### Anything else ``` [2022-12-06 14:35:46,653: INFO/ForkPoolWorker-7] Task airflow.executors.celery_executor.execute_command[b5552af6-76cf-4d55-a300-ba0351bf7b45] succeeded in 42.28011583001353s: None [2022-12-06 14:35:46,560: INFO/ForkPoolWorker-7] Using connection ID 'S3_default' for task execution. return func(*bound_args.args, **bound_args.kwargs) Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 466, in head_object File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/log/s3_task_handler.py", line 167, in s3_write raise e File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper return func(*bound_args.args, **bound_args.kwargs) obj = self.head_object(key, bucket_name) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper [2022-12-06 14:35:46,506: ERROR/ForkPoolWorker-7] Could not verify previous log to append File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 461, in head_object File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/log/s3_task_handler.py", line 133, in s3_log_exists return self.get_conn().head_object(Bucket=bucket_name, Key=key) if append and self.s3_log_exists(remote_log_location): File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 515, in _api_call File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 479, in check_for_key return self.hook.check_for_key(remote_log_location) return self._make_api_call(operation_name, kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 934, in _make_api_call File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper raise error_class(parsed_response, operation_name) return func(*bound_args.args, **bound_args.kwargs) botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden return func(*bound_args.args, **bound_args.kwargs) [2022-12-06 14:35:46,191: INFO/ForkPoolWorker-7] AWS Connection (conn_id='S3_default', conn_type='s3') credentials retrieved from extra. - [2022-12-06 14:35:46,190: WARNING/ForkPoolWorker-7] /home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/utils/connection_wrapper.py:8: DeprecationWarning: AWS Connection (conn_id='S3_default', conn_type='s3') has connection type 's3', which has been replaced by connection type 'aws'. Please update your connection to have `conn_type='aws'`. # [2022-12-06 14:35:46,188: INFO/ForkPoolWorker-7] Using connection ID 'S3_default' for task execution. [2022-12-06 14:35:46,130: INFO/ForkPoolWorker-7] Using connection ID 'S3_default' for task execution. 2022-12-06 14:35:43.058 UTC [1] LOG stats: 0 xacts/s, 0 queries/s, in 0 B/s, out 0 B/s, xact 0 us, query 0 us, wait 0 us File "/home/airflow/.local/lib/python3.7/site-packages/celery/app/trace.py", line 734, in __protected_call__ _execute_in_fork(command_to_exec, celery_task_id) File "/home/airflow/.local/lib/python3.7/site-packages/celery/app/trace.py", line 451, in trace_task [2022-12-06 14:35:12,120: ERROR/ForkPoolWorker-7] Task airflow.executors.celery_executor.execute_command[9f2ecb02-09c0-40cf-bb6d-f1bef4abb879] raised unexpected: AirflowException('Celery command failed on host: with celery_task_id 9f2ecb02-09c0-40cf-bb6d-f1bef4abb879') File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_executor.py", line 96, in execute_command airflow.exceptions.AirflowException: Celery command failed on host: with celery_task_id 9f2ecb02-09c0-40cf-bb6d-f1bef4abb879 File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_executor.py", line 111, in _execute_in_fork Traceback (most recent call last): R = retval = fun(*args, **kwargs) raise AirflowException(msg) return self.run(*args, **kwargs) [2022-12-06 14:35:12,057: INFO/ForkPoolWorker-7] Using connection ID 'S3_default' for task execution. File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper raise error_class(parsed_response, operation_name) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 92, in wrapper Traceback (most recent call last): raise e File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 479, in check_for_key File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 466, in head_object File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 64, in wrapper return self._make_api_call(operation_name, kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/log/s3_task_handler.py", line 133, in s3_log_exists File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 515, in _api_call File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/log/s3_task_handler.py", line 167, in s3_write File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 461, in head_object return func(*bound_args.args, **bound_args.kwargs) return self.hook.check_for_key(remote_log_location) File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 934, in _make_api_call obj = self.head_object(key, bucket_name) return func(*bound_args.args, **bound_args.kwargs) botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden return func(*bound_args.args, **bound_args.kwargs) if append and self.s3_log_exists(remote_log_location): return self.get_conn().head_object(Bucket=bucket_name, Key=key) [2022-12-06 14:35:12,025: ERROR/ForkPoolWorker-7] Could not verify previous log to append return func(*bound_args.args, **bound_args.kwargs) [2022-12-06 14:35:11,774: INFO/ForkPoolWorker-7] AWS Connection (conn_id='S3_default', conn_type='s3') credentials retrieved from extra. - [2022-12-06 14:35:11,773: WARNING/ForkPoolWorker-7] /home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/utils/connection_wrapper.py:8: DeprecationWarning: AWS Connection (conn_id='S3_default', conn_type='s3') has connection type 's3', which has been replaced by connection type 'aws'. Please update your connection to have `conn_type='aws'`. # [2022-12-06 14:35:11,771: INFO/ForkPoolWorker-7] Using connection ID 'S3_default' for task execution. [2022-12-06 14:35:11,745: INFO/ForkPoolWorker-7] Using connection ID 'S3_default' for task execution. return f(*args, **kwargs) raise AirflowException("Hostname of job runner does not match") File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 247, in _run_task_by_local_task_job self._execute() _run_task_by_local_task_job(args, ti) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 189, in _run_task_by_selected_method File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 247, in run return func(*args, **kwargs) return func(*args, **kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_executor.py", line 130, in _execute_in_fork self.heartbeat_callback(session=session) args.func(args) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 228, in heartbeat File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 382, in task_run airflow.exceptions.AirflowException: Hostname of job runner does not match File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 52, in command File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 72, in wrapper _run_task_by_selected_method(args, dag, ti) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 103, in wrapper File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/local_task_job.py", line 189, in heartbeat_callback run_job.run() Traceback (most recent call last): self.heartbeat() [2022-12-06 14:35:11,715: ERROR/ForkPoolWorker-7] [9f2ecb02-09c0-40cf-bb6d-f1bef4abb879] Failed to execute task Hostname of job runner does not match. File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/local_task_job.py", line 135, in _execute ``` ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org