argibbs opened a new issue, #34339: URL: https://github.com/apache/airflow/issues/34339
### Apache Airflow version 2.7.1 ### What happened When running a dag, a task's logs will show that it ran successfully, and completed without error, but the task is marked as failed. See (slightly redacted, sorry!) image an for example:  As you can see, the log ends with the messages: ``` [2023-09-13, 12:26:30 BST] {subprocess.py:97} INFO - Command exited with return code 0 [2023-09-13, 12:26:31 BST] {taskinstance.py:1398} INFO - Marking task as SUCCESS. dag_id=STOXX600_Index_1.ANC_110.AEE_110.QTR1.Realtime, task_id=Shared-Data.split-adjustments-combined, execution_date=20230913T090500, start_date=20230913T112612, end_date=20230913T112631 [2023-09-13, 12:26:31 BST] {local_task_job_runner.py:228} INFO - Task exited with return code 0 ``` And yet the task has been marked as failed. 😱 Even more "interesting" (aka worrying), it's been marked as failed on it's first attempt - and yet, the task is configured with `retries=2`. The retry attempts setting has been ignored.  My version history has been: 1. 2.3.3 2. 2.7.0 3. 2.7.1 (I was naughty and did not keep up with releases). This issue is very definitely present on 2.7.0 and 2.7.1; I never saw it in 2.3.3 ### What you think should happen instead I mean, stating the obvious, but: 1. The task should not have been marked as failed. 2. Even if it had been seen as failed, it should have retried. ### How to reproduce I have no idea (yet) how to reproduce this problem outside of our existing environments. Things I am looking at / have ruled out: 1. The problem exists in both our dev and uat envs. I have not upgraded prod, for obvious reasons. 2. Memory - the workers and schedulers are given access to the full memory of the box; it's shared with other processes, but even the smallest box has 5 Gb free at all times (including during the failure window) 3. We currently run 3 schedulers; I am trying to run with a single scheduler to see if that improves the problem (since I am wondering if it's because the schedulers are somehow conflicting). ### Operating System centos 7 running a docker image of 2.7.1 ### Versions of Apache Airflow Providers These should be consistent with the constraints set by 2.7.1 ``` apache-airflow-providers-celery==3.3.3 apache-airflow-providers-common-sql==1.7.1 apache-airflow-providers-docker==3.7.4 apache-airflow-providers-ftp==3.5.1 apache-airflow-providers-http==4.5.1 apache-airflow-providers-imap==3.3.1 apache-airflow-providers-microsoft-mssql==3.4.2 apache-airflow-providers-odbc==4.0.0 apache-airflow-providers-postgres==5.6.0 apache-airflow-providers-redis==3.3.1 apache-airflow-providers-slack==8.0.0 apache-airflow-providers-sqlite==3.4.3 ``` ### Deployment Docker-Compose ### Deployment details We build the airflow image ourselves (because we've been building it since 1.10, and it's less hassle to bump the version number than trying to switch to the official image). It's using python 3.10 as a base. Our deployment runs postgres 13.1 + bundled celery + redis 6.0.9 + workers on-prem. We have historically run 3 HA schedulers, but as noted I'm trying to run with a single scheduler to see if that improves performance. The other (possibly related change) is that in 2.3.3 we had `max_tis_per_query = 512` and in 2.7.x we are using `max_tis_per_query = 16` as per the 2.7.0 release notes. We also have the following config overrides which are unchanged since 2.3.3: ``` - AIRFLOW__CORE__PARALLELISM=1000 - AIRFLOW__SCHEDULER__PARSING_PROCESSES=3 - AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=120 - AIRFLOW__SCHEDULER__MAX_FILE_PROCESS_INTERVAL=300 ``` ### Anything else Our install has ~200 dags. Most dags are small, with only a handful of dags with >10 tasks per run. The problem seems to happen fairly frequently (we run ~6 similar-looking dags 4 times a day, I see failures in multiple dags. The behaviour sounds quite similar to #33155. I have also noticed tasks which don't appear to have any logs, but I can trace the task in flower (celery) and I can see that the worker ran the task successfully; it's even marked as success in flower. There's just no logs in the airflow gui, and the task has been marked as failed. e.g. here are some logs from the worker for a task that is marked as failed in the GUI, with no log file, but clearly ran ok. ``` [2023-09-13 09:40:12,903: INFO/MainProcess] Task airflow.providers.celery.executors.celery_executor_utils.execute_command[723d73ae-323f-421b-8350-44a0c08c82d3] received [2023-09-13 09:40:12,981: INFO/ForkPoolWorker-7] [723d73ae-323f-421b-8350-44a0c08c82d3] Executing command in Celery: ['airflow', 'tasks', 'run', '__redacted_dag_name__', 'Notebooks.notebook', 'scheduled__2023-09-13T07:05:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/pipeline_estimates.py'] [2023-09-13 09:41:45,673: INFO/ForkPoolWorker-7] Task airflow.providers.celery.executors.celery_executor_utils.execute_command[723d73ae-323f-421b-8350-44a0c08c82d3] succeeded in 92.75607865909114s: None ``` ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
