lopezvit opened a new issue, #38186:
URL: https://github.com/apache/airflow/issues/38186

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.6.3
   
   ### What happened?
   
   The problem is that, quite often (but not always!), the task that (I guess) 
Airflow detect as a zombie is not retried; and I don't understand why: for me 
this is clearly a bug.
   
   The tasks is memory intensive, and I guess that it is the underlaying 
problem. I have increased the worker memory size from 4 GB to 6.5 GB and it 
hasn't failed yet.
   
   But this doesn't look like a very sustainable solution since the memory is 
expensive, and because, when the task is retried it always succeed (probably 
because the worker doesn't have so much pressure anymore) as it can be checked 
by other execution of the same task (as can be checked in the _Anything else?_ 
section).
   
   I have went through the documentation and the troubleshooting and known 
issues and the only related issue was #37041 but it is hard to tell, since 
Composer uses Celery executor.
   
   What is the business impact you are facing?
   
   Task that are failing force a human to go and retry to tasks manually 
(currently left in failed state to allow better troubleshooting of the issue). 
The solution of increasing the memory seems expensive, as the issue is not in 
our code, but in the infrastructure.
   
   ### What you think should happen instead?
   
   Well, since this happens when there is a moment of high demand, just by 
simply retrying, **as it should**, should solve the problem without any human 
intervention.
   
   ### How to reproduce
   
   We have a quite memory intensive (around 200MB) task that requires to be run 
every time around 7 times, as it fetches 7 days worth of data from the past, as 
it might take so many days for the data to be golden.
   When all these tasks are running in parallel, (and possibly other tasks from 
other dags) it uses all the memory from the VM, which provokes the task to be 
killed.
   
   This is anyway a rare occurrence, as the DAG is scheduled twice an hour, 16 
hours a day and it only happened 18 times during 3 days period.
   
   ### Operating System
   
   composer-2.5.2-airflow-2.6.3
   
   ### Versions of Apache Airflow Providers
   
   directly from the 
[documentation](https://cloud.google.com/composer/docs/concepts/versioning/composer-versions#images):
   absl-py==2.0.0
   agate==1.6.3
   aiodebug==2.3.0
   aiofiles==23.2.1
   aiohttp==3.8.6
   aiosignal==1.3.1
   alembic==1.11.1
   amqp==5.1.1
   anyio==3.7.1
   apache-airflow==2.6.3+composer
   apache-airflow-providers-apache-beam==5.3.0
   apache-airflow-providers-cncf-kubernetes==7.9.0
   apache-airflow-providers-common-sql==1.8.0
   apache-airflow-providers-dbt-cloud==3.4.0
   apache-airflow-providers-ftp==3.6.1
   apache-airflow-providers-google==10.11.1
   apache-airflow-providers-hashicorp==3.5.0
   apache-airflow-providers-http==4.6.0
   apache-airflow-providers-imap==3.4.0
   apache-airflow-providers-mysql==5.2.0
   apache-airflow-providers-postgres==5.8.0
   apache-airflow-providers-sendgrid==3.3.0
   apache-airflow-providers-sqlite==3.5.0
   apache-airflow-providers-ssh==3.8.1
   apache-beam==2.51.0
   apispec==5.2.2
   appdirs==1.4.4
   argcomplete==3.1.1
   asgiref==3.7.2
   astunparse==1.6.3
   async-timeout==4.0.2
   attrs==23.1.0
   Babel==2.12.1
   backoff==2.2.1
   backports.zoneinfo==0.2.1
   bcrypt==4.0.1
   billiard==4.1.0
   blinker==1.6.2
   cachecontrol==0.13.1
   cachelib==0.9.0
   cachetools==5.3.1
   cattrs==23.1.2
   celery==5.3.1
   certifi==2023.7.22
   cffi==1.15.1
   chardet==5.2.0
   charset-normalizer==3.1.0
   click==8.1.3
   click-didyoumean==0.3.0
   click-plugins==1.1.1
   click-repl==0.3.0
   clickclick==20.10.2
   cloudpickle==2.2.1
   colorama==0.4.6
   colorlog==4.8.0
   ConfigUpdater==3.1.1
   connexion==2.14.2
   crcmod==1.7
   cron-descriptor==1.4.0
   croniter==1.4.1
   cryptography==41.0.5
   db-dtypes==1.1.1
   dbt-bigquery==1.5.4
   dbt-core==1.5.4
   dbt-extractor==0.4.1
   decorator==5.1.1
   Deprecated==1.2.14
   diff-cover==8.0.0
   dill==0.3.1.1
   distlib==0.3.6
   dnspython==2.3.0
   docopt==0.6.2
   docutils==0.20.1
   email-validator==1.3.1
   exceptiongroup==1.1.2
   fastavro==1.9.0
   fasteners==0.19
   filelock==3.12.2
   firebase-admin==6.2.0
   Flask==2.2.5
   Flask-AppBuilder==4.3.1
   Flask-Babel==2.0.0
   Flask-Bcrypt==1.0.1
   Flask-Caching==2.0.2
   Flask-JWT-Extended==4.5.2
   Flask-Limiter==3.3.1
   Flask-Login==0.6.2
   flask-session==0.5.0
   Flask-SQLAlchemy==2.5.1
   Flask-WTF==1.1.1
   flatbuffers==23.5.26
   flower==2.0.0
   frozenlist==1.3.3
   fsspec==2023.10.0
   future==0.18.3
   gast==0.4.0
   gcloud-aio-auth==4.2.3
   gcloud-aio-bigquery==7.0.0
   gcloud-aio-storage==9.0.0
   gcsfs==2023.10.0
   google-ads==22.1.0
   google-api-core==2.14.0
   google-api-python-client==2.107.0
   google-apitools==0.5.32
   google-auth==2.23.4
   google-auth-httplib2==0.1.1
   google-auth-oauthlib==1.0.0
   google-cloud-access-context-manager==0.1.16
   google-cloud-aiplatform==1.36.2
   google-cloud-appengine-logging==1.3.2
   google-cloud-asset==3.20.0
   google-cloud-audit-log==0.2.5
   google-cloud-automl==2.11.3
   google-cloud-batch==0.17.3
   google-cloud-bigquery==3.13.0
   google-cloud-bigquery-datatransfer==3.12.1
   google-cloud-bigquery-storage==2.22.0
   google-cloud-bigtable==2.21.0
   google-cloud-build==3.21.0
   google-cloud-common==1.2.0
   google-cloud-compute==1.14.1
   google-cloud-container==2.33.0
   google-cloud-core==2.3.3
   google-cloud-datacatalog==3.16.0
   google-cloud-datacatalog-lineage==0.3.1
   google-cloud-datacatalog-lineage-producer-client==0.1.0
   google-cloud-dataflow-client==0.8.5
   google-cloud-dataform==0.5.4
   google-cloud-dataplex==1.8.1
   google-cloud-dataproc==5.7.0
   google-cloud-dataproc-metastore==1.13.0
   google-cloud-datastore==2.18.0
   google-cloud-dlp==3.13.0
   google-cloud-documentai==2.20.2
   google-cloud-filestore==1.6.2
   google-cloud-firestore==2.13.1
   google-cloud-kms==2.19.2
   google-cloud-language==2.11.1
   google-cloud-logging==3.8.0
   google-cloud-memcache==1.7.3
   google-cloud-monitoring==2.16.0
   google-cloud-orchestration-airflow==1.9.2
   google-cloud-org-policy==1.8.3
   google-cloud-os-config==1.15.3
   google-cloud-os-login==2.11.0
   google-cloud-pubsub==2.18.4
   google-cloud-pubsublite==0.6.1
   google-cloud-redis==2.13.2
   google-cloud-resource-manager==1.10.4
   google-cloud-run==0.10.0
   google-cloud-secret-manager==2.16.4
   google-cloud-spanner==3.40.1
   google-cloud-speech==2.22.0
   google-cloud-storage==2.13.0
   google-cloud-storage-transfer==1.9.2
   google-cloud-tasks==2.14.2
   google-cloud-texttospeech==2.14.2
   google-cloud-translate==3.12.1
   google-cloud-videointelligence==2.11.4
   google-cloud-vision==3.4.5
   google-cloud-workflows==1.12.1
   google-crc32c==1.5.0
   google-pasta==0.2.0
   google-re2==1.1
   google-resumable-media==2.6.0
   googleapis-common-protos==1.60.0
   graphviz==0.20.1
   greenlet==2.0.2
   grpc-google-iam-v1==0.12.7
   grpcio==1.59.2
   grpcio-gcp==0.2.2
   grpcio-status==1.59.2
   gunicorn==20.1.0
   h11==0.14.0
   h5py==3.10.0
   hdfs==2.7.3
   hologram==0.0.16
   httpcore==0.17.3
   httplib2==0.22.0
   httpx==0.24.1
   humanize==4.7.0
   hvac==2.0.0
   idna==3.4
   importlib-metadata==4.13.0
   importlib-resources==5.12.0
   inflection==0.5.1
   iniconfig==2.0.0
   isodate==0.6.1
   itsdangerous==2.1.2
   jaraco.classes==3.3.0
   jeepney==0.8.0
   Jinja2==3.1.2
   Js2Py==0.74
   json-merge-patch==0.2
   jsonschema==4.18.6
   jsonschema-specifications==2023.7.1
   keras==2.13.1
   keyring==24.3.0
   keyrings.google-artifactregistry-auth==1.1.2
   kombu==5.3.1
   kubernetes==23.6.0
   kubernetes-asyncio==24.2.3
   lazy-object-proxy==1.9.0
   leather==0.3.4
   libclang==16.0.6
   limits==3.5.0
   linkify-it-py==2.0.2
   lockfile==0.12.2
   Logbook==1.5.3
   looker-sdk==23.16.0
   Mako==1.2.4
   Markdown==3.4.3
   markdown-it-py==3.0.0
   MarkupSafe==2.1.3
   marshmallow==3.19.0
   marshmallow-enum==1.5.1
   marshmallow-oneofschema==3.0.1
   marshmallow-sqlalchemy==0.26.1
   mashumaro==3.6
   mdit-py-plugins==0.4.0
   mdurl==0.1.2
   minimal-snowplow-tracker==0.0.2
   more-itertools==10.1.0
   msgpack==1.0.5
   multidict==6.0.4
   mysqlclient==2.2.0
   networkx==2.8.8
   numpy==1.24.3
   oauth2client==4.1.3
   oauthlib==3.2.2
   objsize==0.6.1
   opt-einsum==3.3.0
   ordered-set==4.1.0
   orjson==3.9.10
   overrides==6.5.0
   packaging==23.1
   pandas==2.0.3
   pandas-gbq==0.19.2
   paramiko==3.3.1
   parsedatetime==2.4
   pathspec==0.9.0
   pendulum==2.1.2
   pip==20.2.4
   pipdeptree==2.13.1
   pkgutil-resolve-name==1.3.10
   platformdirs==3.8.1
   pluggy==1.2.0
   prison==0.2.1
   prometheus-client==0.17.0
   prompt-toolkit==3.0.39
   proto-plus==1.22.3
   protobuf==4.24.4
   psutil==5.9.5
   psycopg2-binary==2.9.9
   pyarrow==11.0.0
   pyasn1==0.5.0
   pyasn1-modules==0.3.0
   pycparser==2.21
   pydantic==1.10.12
   pydata-google-auth==1.8.2
   pydot==1.4.2
   Pygments==2.16.1
   pyjsparser==2.7.1
   PyJWT==2.7.0
   pymongo==4.6.0
   PyNaCl==1.5.0
   pyOpenSSL==23.3.0
   pyparsing==3.1.1
   pytest==7.4.3
   python-daemon==3.0.1
   python-dateutil==2.8.2
   python-http-client==3.3.7
   python-nvd3==0.15.0
   python-slugify==8.0.1
   pytimeparse==1.1.8
   pytz==2023.3
   pytzdata==2020.1
   PyYAML==6.0
   redis==3.5.3
   referencing==0.30.2
   regex==2023.10.3
   requests==2.31.0
   requests-oauthlib==1.3.1
   requests-toolbelt==1.0.0
   rfc3339-validator==0.1.4
   rich==13.4.2
   rich-argparse==1.2.0
   rpds-py==0.10.0
   rsa==4.9
   SecretStorage==3.3.3
   sendgrid==6.10.0
   setproctitle==1.3.2
   setuptools==66.1.1
   shapely==2.0.2
   six==1.16.0
   sniffio==1.3.0
   SQLAlchemy==1.4.49
   sqlalchemy-bigquery==1.8.0
   SQLAlchemy-JSONField==1.0.1.post0
   sqlalchemy-spanner==1.6.2
   SQLAlchemy-Utils==0.41.1
   sqlfluff==2.3.3
   sqllineage==1.4.8
   sqlparse==0.4.4
   sshtunnel==0.4.0
   starkbank-ecdsa==2.2.0
   statsd==4.0.1
   tabulate==0.9.0
   tblib==2.0.0
   tenacity==8.2.2
   tensorboard==2.13.0
   tensorboard-data-server==0.7.2
   tensorflow==2.13.1
   tensorflow-estimator==2.13.0
   tensorflow-io-gcs-filesystem==0.34.0
   termcolor==2.3.0
   text-unidecode==1.3
   toml==0.10.2
   tomli==2.0.1
   tornado==6.3.2
   tqdm==4.66.1
   typing-extensions==4.5.0
   tzdata==2023.3
   tzlocal==5.2
   uc-micro-py==1.0.2
   unicodecsv==0.14.1
   uritemplate==4.1.1
   urllib3==1.26.18
   vine==5.0.0
   virtualenv==20.23.1
   wcwidth==0.2.6
   websocket-client==1.6.1
   Werkzeug==2.2.3
   wheel==0.41.3
   wrapt==1.15.0
   WTForms==3.0.1
   yarl==1.9.2
   zipp==3.15.0
   zstandard==0.22.0
   
   ### Deployment
   
   Google Cloud Composer
   
   ### Deployment details
   
   ### Version
   `composer-2.5.2-airflow-2.6.3`
   
   ### Airflow Configuration Overrides
   
   - scheduler
     - scheduler_heartbeat_sec: 45
     - dag_dir_list_interval: 10
     - scheduler_zombie_task_threshold: 500
     - catchup_by_default: False
   - core
     - max_active_tasks_per_dag: 25
     - max_active_runs_per_dag: 5
     - dagbag_import_timeout: 60
     - dag_concurrency: 25
     - dags_are_paused_at_creation: True
   - webserver
   - dag_orientation
   - TB
     - instance_name: <REDACTED>
     - navbar_color: #009DE0
   - secrets
     - backend: 
airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
   
   ### Environment Configuration:
   
   - Resources
     - Workloads configuration
       - Scheduler: One scheduler with 0.5 vCPU, 2 GB memory, 1 GB storage
       - Triggerer: Disabled
       - Web server: 0.5 vCPU, 2 GB memory, 1 GB storage
       - Worker: Auto-scaling between 1 and 5 workers, with 1 vCPU, 4 GB 
memory, 1 GB storage each
     - Core infrastructure
       - Environment size: Small
       
   ### Pypi Packages
   Name | Version
   ------ | --------
   apache-airflow-providers-http | ==4.7.0
   Django | ==3.1.14
   djangorestframework | ==3.13.1
   
   
   ### Anything else?
   
   ```
   *** Reading remote log from 
gs://<REDACTED>/logs/dag_id=<REDACTED>_integration/run_id=scheduled__2024-02-18T16:48:00+00:00/task_id=read_Orders_endpoint_minus-6/attempt=1.log.
   [2024-02-18, 19:19:14 EET] {taskinstance.py:1104} INFO - Dependencies all 
met for dep_context=non-requeueable deps ti=<TaskInstance: 
<REDACTED>_integration.read_Orders_endpoint_minus-6 
scheduled__2024-02-18T16:48:00+00:00 [queued]>
   [2024-02-18, 19:19:14 EET] {taskinstance.py:1104} INFO - Dependencies all 
met for dep_context=requeueable deps ti=<TaskInstance: 
<REDACTED>_integration.read_Orders_endpoint_minus-6 
scheduled__2024-02-18T16:48:00+00:00 [queued]>
   [2024-02-18, 19:19:14 EET] {taskinstance.py:1309} INFO - Starting attempt 1 
of 2
   [2024-02-18, 19:19:15 EET] {taskinstance.py:1328} INFO - Executing 
<Task(PythonOperator): read_Orders_endpoint_minus-6> on 2024-02-18 
16:48:00+00:00
   [2024-02-18, 19:19:15 EET] {standard_task_runner.py:57} INFO - Started 
process 30752 to run task
   [2024-02-18, 19:19:15 EET] {standard_task_runner.py:84} INFO - Running: 
['airflow', 'tasks', 'run', '<REDACTED>_integration', 
'read_Orders_endpoint_minus-6', 'scheduled__2024-02-18T16:48:00+00:00', 
'--job-id', '114249', '--raw', '--subdir', 
'DAGS_FOLDER/<REDACTED>/<REDACTED>_dag.py', '--cfg-path', '/tmp/tmpcv7_sbw4']
   [2024-02-18, 19:19:15 EET] {standard_task_runner.py:85} INFO - Job 114249: 
Subtask read_Orders_endpoint_minus-6
   [2024-02-18, 19:19:17 EET] {task_command.py:414} INFO - Running 
<TaskInstance: <REDACTED>_integration.read_Orders_endpoint_minus-6 
scheduled__2024-02-18T16:48:00+00:00 [running]> on host airflow-worker-w5f47
   [2024-02-18, 19:19:19 EET] {taskinstance.py:1547} INFO - Exporting env vars: 
AIRFLOW_CTX_DAG_EMAIL='airf...@airflow.com' AIRFLOW_CTX_DAG_OWNER='<REDACTED>' 
AIRFLOW_CTX_DAG_ID='<REDACTED>_integration' 
AIRFLOW_CTX_TASK_ID='read_Orders_endpoint_minus-6' 
AIRFLOW_CTX_EXECUTION_DATE='2024-02-18T16:48:00+00:00' 
AIRFLOW_CTX_TRY_NUMBER='1' 
AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-02-18T16:48:00+00:00'
   [2024-02-18, 19:19:20 EET] {base.py:73} INFO - Using connection ID 
'<REDACTED>' for task execution.
   [2024-02-18, 19:19:29 EET] {logging_mixin.py:150} INFO - Fetching Orders on 
2024-02-12
   [2024-02-18, 19:19:43 EET] {local_task_job_runner.py:225} INFO - Task exited 
with return code Negsignal.SIGKILL
   [2024-02-18, 19:19:44 EET] {taskinstance.py:2656} INFO - 0 downstream tasks 
scheduled from follow-on schedule check
   ```
   ### Example of the previously correct execution (it did retry):
   ```
   *** Reading remote log from 
gs://<REDACTED>/logs/dag_id=<REDACTED>_integration/run_id=scheduled__2024-02-18T16:18:00+00:00/task_id=read_Orders_endpoint_minus-6/attempt=2.log.
   [2024-02-18, 18:54:54 EET] {taskinstance.py:1104} INFO - Dependencies all 
met for dep_context=non-requeueable deps ti=
   [2024-02-18, 18:54:54 EET] {taskinstance.py:1104} INFO - Dependencies all 
met for dep_context=requeueable deps ti=
   [2024-02-18, 18:54:54 EET] {taskinstance.py:1309} INFO - Starting attempt 2 
of 2
   [2024-02-18, 18:54:54 EET] {taskinstance.py:1328} INFO - Executing  on 
2024-02-18 16:18:00+00:00
   [2024-02-18, 18:54:54 EET] {standard_task_runner.py:57} INFO - Started 
process 30071 to run task
   [2024-02-18, 18:54:54 EET] {standard_task_runner.py:84} INFO - Running: 
['airflow', 'tasks', 'run', '<REDACTED>_integration', 
'read_Orders_endpoint_minus-6', 'scheduled__2024-02-18T16:18:00+00:00', 
'--job-id', '114236', '--raw', '--subdir', 
'DAGS_FOLDER/<REDACTED>/<REDACTED>_dag.py', '--cfg-path', '/tmp/tmpqa1q7bn5']
   [2024-02-18, 18:54:54 EET] {standard_task_runner.py:85} INFO - Job 114236: 
Subtask read_Orders_endpoint_minus-6
   [2024-02-18, 18:54:54 EET] {task_command.py:414} INFO - Running  on host 
airflow-worker-w5f47
   [2024-02-18, 18:54:55 EET] {taskinstance.py:1547} INFO - Exporting env vars: 
AIRFLOW_CTX_DAG_EMAIL='airf...@airflow.com' AIRFLOW_CTX_DAG_OWNER='<REDACTED>' 
AIRFLOW_CTX_DAG_ID='<REDACTED>_integration' 
AIRFLOW_CTX_TASK_ID='read_Orders_endpoint_minus-6' 
AIRFLOW_CTX_EXECUTION_DATE='2024-02-18T16:18:00+00:00' 
AIRFLOW_CTX_TRY_NUMBER='2' 
AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-02-18T16:18:00+00:00'
   [2024-02-18, 18:54:55 EET] {base.py:73} INFO - Using connection ID 
'<REDACTED>' for task execution.
   [2024-02-18, 18:54:56 EET] {logging_mixin.py:150} INFO - Fetching Orders on 
2024-02-12
   [2024-02-18, 18:54:57 EET] {sql_to_gcs.py:161} INFO - Executing query
   [2024-02-18, 18:54:57 EET] {sql_to_gcs.py:180} INFO - Writing local data 
files
   [2024-02-18, 18:54:57 EET] {sql_to_gcs.py:185} INFO - Uploading chunk file 
#0 to GCS.
   [2024-02-18, 18:54:58 EET] {base.py:73} INFO - Using connection ID 
'google_cloud_default' for task execution.
   [2024-02-18, 18:54:58 EET] {credentials_provider.py:353} INFO - Getting 
connection using `google.auth.default()` since no explicit credentials are 
provided.
   [2024-02-18, 18:54:58 EET] {gcs.py:562} INFO - File /tmp/tmpo0xnp84a 
uploaded to Orders/2024/02/12/Orders_20240212.json in 
<REDACTED>_datafiles_datalake-207612 bucket
   [2024-02-18, 18:54:58 EET] {sql_to_gcs.py:188} INFO - Removing local file
   [2024-02-18, 18:54:58 EET] {python.py:183} INFO - Done. Returned value was: 
None
   [2024-02-18, 18:54:58 EET] {taskinstance.py:1346} INFO - Marking task as 
SUCCESS. dag_id=<REDACTED>_integration, task_id=read_Orders_endpoint_minus-6, 
execution_date=20240218T161800, start_date=20240218T165454, 
end_date=20240218T165458
   [2024-02-18, 18:54:58 EET] {local_task_job_runner.py:225} INFO - Task exited 
with return code 0
   [2024-02-18, 18:54:58 EET] {taskinstance.py:2656} INFO - 1 downstream tasks 
scheduled from follow-on schedule check
   ```
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to