[GitHub] [airflow] argibbs opened a new issue, #34339: Tasks being marked as failed even after running successfully

via GitHub Wed, 13 Sep 2023 05:32:59 -0700


argibbs opened a new issue, #34339:
URL: https://github.com/apache/airflow/issues/34339


   ### Apache Airflow version
   
   2.7.1
   
   ### What happened
   
   When running a dag, a task's logs will show that it ran successfully, and 
completed without error, but the task is marked as failed.
   
   See (slightly redacted, sorry!) image an for example:
   
   
![image](https://github.com/apache/airflow/assets/4406514/6dc00adb-6427-4452-a567-3c3433b98a17)
   
   As you can see, the log ends with the messages:
   ```
   [2023-09-13, 12:26:30 BST] {subprocess.py:97} INFO - Command exited with 
return code 0
   [2023-09-13, 12:26:31 BST] {taskinstance.py:1398} INFO - Marking task as 
SUCCESS. dag_id=STOXX600_Index_1.ANC_110.AEE_110.QTR1.Realtime, 
task_id=Shared-Data.split-adjustments-combined, execution_date=20230913T090500, 
start_date=20230913T112612, end_date=20230913T112631
   [2023-09-13, 12:26:31 BST] {local_task_job_runner.py:228} INFO - Task exited 
with return code 0
   ```
   
   And yet the task has been marked as failed. 😱 
   
   Even more "interesting" (aka worrying), it's been marked as failed on it's 
first attempt - and yet, the task is configured with `retries=2`. The retry 
attempts setting has been ignored.
   
![image](https://github.com/apache/airflow/assets/4406514/9671f2c7-d4ae-41e1-9d00-bc062421884f)
   
   My version history has been:
   1. 2.3.3
   2. 2.7.0
   3. 2.7.1
   
   (I was naughty and did not keep up with releases). This issue is very 
definitely present on 2.7.0 and 2.7.1; I never saw it in 2.3.3
   
   ### What you think should happen instead
   
   I mean, stating the obvious, but:
   1. The task should not have been marked as failed.
   2. Even if it had been seen as failed, it should have retried.
   
   ### How to reproduce
   
   I have no idea (yet) how to reproduce this problem outside of our existing 
environments.
   
   Things I am looking at / have ruled out:
   1. The problem exists in both our dev and uat envs. I have not upgraded 
prod, for obvious reasons.
   2. Memory - the workers and schedulers are given access to the full memory 
of the box; it's shared with other processes, but even the smallest box has 5 
Gb free at all times (including during the failure window)
   3. We currently run 3 schedulers; I am trying to run with a single scheduler 
to see if that improves the problem (since I am wondering if it's because the 
schedulers are somehow conflicting).
   
   ### Operating System
   
   centos 7 running a docker image of 2.7.1
   
   ### Versions of Apache Airflow Providers
   
   These should be consistent with the constraints set by 2.7.1
   ```
   apache-airflow-providers-celery==3.3.3
   apache-airflow-providers-common-sql==1.7.1
   apache-airflow-providers-docker==3.7.4
   apache-airflow-providers-ftp==3.5.1
   apache-airflow-providers-http==4.5.1
   apache-airflow-providers-imap==3.3.1
   apache-airflow-providers-microsoft-mssql==3.4.2
   apache-airflow-providers-odbc==4.0.0
   apache-airflow-providers-postgres==5.6.0
   apache-airflow-providers-redis==3.3.1
   apache-airflow-providers-slack==8.0.0
   apache-airflow-providers-sqlite==3.4.3
   ```
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   We build the airflow image ourselves (because we've been building it since 
1.10, and it's less hassle to bump the version number than trying to switch to 
the official image).
   
   It's using python 3.10 as a base.
   
   Our deployment runs postgres 13.1 + bundled celery + redis 6.0.9 + workers 
on-prem. We have historically run 3 HA schedulers, but as noted I'm trying to 
run with a single scheduler to see if that improves performance.
   
   The other (possibly related change) is that in 2.3.3 we had 
`max_tis_per_query = 512` and in 2.7.x we are using `max_tis_per_query = 16` as 
per the 2.7.0 release notes.
   
   We also have the following config overrides which are unchanged since 2.3.3:
   ```
         - AIRFLOW__CORE__PARALLELISM=1000
         - AIRFLOW__SCHEDULER__PARSING_PROCESSES=3
         - AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=120
         - AIRFLOW__SCHEDULER__MAX_FILE_PROCESS_INTERVAL=300
   ```
   
   ### Anything else
   
   Our install has ~200 dags. Most dags are small, with only a handful of dags 
with >10 tasks per run.
   
   The problem seems to happen fairly frequently (we run ~6 similar-looking 
dags 4 times a day, I see failures in multiple dags.
   
   The behaviour sounds quite similar to #33155. I have also noticed tasks 
which don't appear to have any logs, but I can trace the task in flower 
(celery) and I can see that the worker ran the task successfully; it's even 
marked as success in flower. There's just no logs in the airflow gui, and the 
task has been marked as failed. e.g. here are some logs from the worker for a 
task that is marked as failed in the GUI, with no log file, but clearly ran ok.
   ```
   [2023-09-13 09:40:12,903: INFO/MainProcess] Task 
airflow.providers.celery.executors.celery_executor_utils.execute_command[723d73ae-323f-421b-8350-44a0c08c82d3]
 received
   [2023-09-13 09:40:12,981: INFO/ForkPoolWorker-7] 
[723d73ae-323f-421b-8350-44a0c08c82d3] Executing command in Celery: ['airflow', 
'tasks', 'run', '__redacted_dag_name__', 'Notebooks.notebook', 
'scheduled__2023-09-13T07:05:00+00:00', '--local', '--subdir', 
'DAGS_FOLDER/pipeline_estimates.py']
   [2023-09-13 09:41:45,673: INFO/ForkPoolWorker-7] Task 
airflow.providers.celery.executors.celery_executor_utils.execute_command[723d73ae-323f-421b-8350-44a0c08c82d3]
 succeeded in 92.75607865909114s: None
   ```
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] argibbs opened a new issue, #34339: Tasks being marked as failed even after running successfully

Reply via email to