Re: [I] High memory retention and gradual increase over time [airflow]
LengYue12389 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3604610868 Great, I can upgrade to 3.1.4 now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
potiuk commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3602525037 Also possibly https://github.com/apache/airflow/pull/58944 might address remaining memory issue. I will close it provisionally but @LourTV - if yuo could also check upcoming 3.1.4 where we will have the fix for the image, that would be fantastic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
potiuk closed issue #55768: High memory retention and gradual increase over time URL: https://github.com/apache/airflow/issues/55768 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
potiuk commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3476281393 Fantastic news @kaxil ! Again thanks for all the analysis (and hte PR of the month @wjddn279 ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
kaxil commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3461520365 As promised in https://github.com/apache/airflow/issues/56641#issuecomment-3456893101 , here's our analysis on impact on Celery workers comparing 3.0.6 vs 3.1.1 (with fixes) with one of our deployments: In Airflow 3.0.6 various tasks running on Celery failed with OOMs as the memory leaks were significant. After applying changes in https://github.com/apache/airflow/pull/56695 to 3.0.6 or directly upgrading to 3.1.1, the memory stayed mostly-flat and there were 0 task failures. --- **Celery Worker** with 4 GB memory on **Airflow 3.0.6** https://github.com/user-attachments/assets/ba5b96a6-df9b-45c3-b9c6-8019c8378c36"; /> **Celery Worker** with 4 GB memory on **Airflow 3.1.1** https://github.com/user-attachments/assets/ad7b520b-6bb8-4f55-bbde-8fe98f895b0d"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
vertIcod3r commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3455183387 I can at least confirm that I am using LocalExecutor -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
tirkarthi commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3443725120 We are also experiencing high memory usage for api-server that grows over time reaching the memory limit assigned to the pod. The server uses `airflow api-server` with default configurations on Kubernetes. The scheduler uses KubernetesExecutor and we didn't see noticeable issues. I tried locally using memray with below command using 500 dags. The server started with 25MB and then went to 95MB, 175MB to stay there. Here is the command that gives a live view of the memory usage in case someone wants to run it a different setup. ``` PYTHONMALLOC=malloc memray run --live -m uvicorn airflow.api_fastapi.main:app --workers 1 --port 8000 ``` https://github.com/user-attachments/assets/98b4d51d-e222-4b5c-a277-e0f7f51a609a"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
wjddn279 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3401275826 @potiuk I'll check it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3371985650 @alkismavridis Yes, we met with the same issue when using the `num_runs` to control the scheduler. Attached is the cron script we built to monitor the memory usage and restart the scheduler container when the usage is high. It has been running well for the past few days. [memory_watchdog.sh](https://github.com/user-attachments/files/22724267/memory_watchdog.sh) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
dstandish commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3402740293 Those who are echoing that they see the problem as well, please share executor info -- there seemed to be some suggestion that local executor may be implicated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
wjddn279 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3374894204 @kaxil I am still in the process of investigating, but to summarize my findings so far: the increase in memory usage mainly comes from the worker processes created inside the scheduler container. The scheduler itself runs as a single process, so its own memory growth is limited and does not contribute significantly. However, since 32 worker processes are forked by default, their combined effect amplifies the overall memory usage roughly by a factor of 32. To investigate further, I used Python’s tracemalloc and modified the Airflow code to trace objects consuming large amounts of memory within the worker processes. [tracemalloc_log.txt](https://github.com/user-attachments/files/22733403/tracemalloc_log.txt) The investigation showed that most of the memory usage originates from library imports. Since each process loads the same libraries independently, the total footprint scales almost linearly with the number of workers. After removing certain heavy imports and rerunning the system, I observed a significant reduction in per-worker memory usage. A container that previously failed within 30 minutes under an 8 GiB memory limit was able to run for over two hours without issues. (This test environment is quite demanding, as 100 DAGs are triggered every minute.) However, the total memory used by all workers still continued to increase over time. https://github.com/user-attachments/assets/566e707b-d66d-4ecb-b389-09cbc522609d"; /> As shown in the attached figure, there are large variations in PSS between workers, and eventually all workers cross into higher memory usage regions. My current hypothesis is as follows: - In the LocalExecutor, workers are initially forked from the scheduler process, so they share memory pages through copy-on-write (COW). When additional libraries are imported or specific logic is executed later, those shared pages are duplicated, leading to higher private memory usage. - In addition to the import-related issue, I suspect that scheduler objects inherited by the workers are gradually modified during execution, causing small but steady PSS growth as copy-on-write pages are created. - In other executor types (e.g., CeleryExecutor, KubernetesExecutor), workers are launched as independent processes, so every library import directly adds to total memory usage without COW benefits. My hypothesis is that some heavy libraries are still present, and while they are initially loaded into the scheduler’s shared memory during forking, certain logic causes them to be re-imported later, resulting in private memory allocations per process. I will continue to trace the problematic libraries and will open separate issues as I identify them. Please let me know if you have any questions or different perspectives on my findings so far. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
kaxil commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353598174 A temporary workaround would be to set [`AIRFLOW__SCHEDULER__NUM_RUNS`](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#num-runs) to a small value like `100`: ```ini [scheduler] num_runs = 100 ``` This forces the scheduler to exit and restart after 100 scheduling loops, which clears the accumulated memory. If you're running in Kubernetes with proper health probes, the scheduler pod will automatically restart. --- The issue stems from **unbounded growth of the `DBDagBag` cache** inside the scheduler. This wasn't a significant problem in Airflow 2.x, but in 3.x with the introduction of DAG versioning, the problem is amplified: 1. **`DBDagBag._dags` has no size limit**: The cache at [`airflow-core/src/airflow/models/dagbag.py:49`](https://github.com/apache/airflow/blob/964997a7a6da5041fb19e48bc31866ffc6fe7bc7/airflow-core/src/airflow/models/dagbag.py#L49) is an unbounded dictionary that stores deserialized DAGs indefinitely. 2. **DAG versioning multiplies cache entries**: In 3.x, each DAG version gets a unique ID. The cache is keyed by `dag_version_id` (not `dag_id`), meaning: - Every DAG change creates a new version with a new cache entry - For dynamic DAGs that change frequently, this creates a new entry on every parse - Old versions are **never evicted** from the cache - Each `SerializedDAG` object holds full task dictionaries, dependencies, and metadata - This could explain the ~10GB growth you're seeing from the scheduler alone - Especially worse if you use Dynamic dags -- that would lead to more Serialized dag versions. Setting `num_runs` forces a scheduler restart, which clears the Python process memory and resets all caches effectively working around the unbounded cache growth for now until we fix it properly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3344209617 I am having the same issue too. We also run Airflow in dockers. The same DAGs ran well in Airflow 2 for years, only using less than 15 GB of memory. After migrating to 3.1.0, the scheduler alone takes up over 128 GB of memory in 2 days. This is certainly unacceptable in the production environment. We have to roll back to Airflow 2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
alkismavridis commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3371858574 Plot twist: The workaround seems to be harmful for us. We run the scheduler in a docker container. When the num_runs finishes, the scheduler stops and thus our docker health check (`airflow jobs check --job-type SchedulerJob --local || bash -c 'kill -s 15 -1 && (sleep 10; kill -s 9 -1)' ` kills the container, so that the whole docker container can restart. The problem is that active tasks also get killed this way. This is a big problem. I search a way to only restart the scheduler and not the whole docker container. Maybe I write a small bash script with an endless while loop that runs the scheduler with a "num_run", and use this script as the entry point of my docker container. If anybody has some other idea, I am very happy to hear it 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
wjddn279 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356570158 @LourTV Looking at the query results, I can see that the DAG version keeps increasing. Could it be that there are multiple DAG files generating the DAG ID `index_idm_claude_metrics_in_elasticsearch`? If the DAG files are alternately parsed, the hash value will change, causing the version to keep increasing, which would lead to the issue @kaxil mentioned. It seems necessary to check the value of the `data` column in the `serialized_dag` table to identify what has changed and prevent the version from increasing meaninglessly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3391961070 > > Maybe I write a small bash script with an endless while loop that runs the scheduler with a "num_run", and use this script as the entry point of my docker container. > > You can add a script in your `healthcheck` to check for number of running tasks to ensure you only kill the scheduler when no tasks are running. This is exactly what we did in our bash script. It runs by a cron job every 3 minutes to check the active tasks before restarting the scheduler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
jedcunningham commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353180527 Can someone check if there is growth in the `dag_version` or `serialized_dag` tables in your environment when memory use climbs? Row count is sufficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353988457 A quick feedback, I just tried setting the `num_run` to 100. It worked to restart the scheduler, but the scheduler container would become "unhealthy" after the restart, which concerns me. Instead of changing this setting, we built a cron script to: # - Monitor system memory usage # - Restart Airflow scheduler if usage exceeds threshold AND no tasks are running # - Force restart if memory exceeds critical threshold # - Log all actions with timestamp and memory/task details We used docker-compose restart to bounce the container so the health status won't change. Hopefully this works till the fix is ready. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
wjddn279 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356590885 Like in @alkismavridis ’s case, I also observed that even though the number of DAGs is not large and the DAG versions are not continuously increasing, the scheduler’s memory usage keeps growing. Therefore, I infer that there is another type of memory leak besides the cause mentioned by kaxil. When deploying Airflow with LocalExecutor using Docker Compose, I confirmed that worker processes are created as child processes of the scheduler, and the allocated memory of both the scheduler and worker processes keeps increasing. I will continue investigating the root cause, and once I find it, I will share the details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
LengYue12389 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3326137879 I also encountered this problem, the usage of dag-processor internal spring will continue to increase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
kaxil commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3373423961 > Maybe I write a small bash script with an endless while loop that runs the scheduler with a "num_run", and use this script as the entry point of my docker container. You can add a script in your `healthcheck` to check for number of running tasks to ensure you only kill the scheduler when no tasks are running. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
LengYue12389 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3326158709 I also encountered this problem, it's also version 3.0.6 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
alkismavridis commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353193422 @jedcunningham will check it tomorrow and let you know. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
alkismavridis commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3377308914 In case it helps, the active PIDs (numbers are according to `docker stats`) also go up as the memory goes up. Maybe processes are not properly closed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3357518252 Looks like when the scheduler's memory usage grows to some point, the log service will fail and you won't get any updates in the UI log page or the log file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
LengYue12389 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3392978693 It seems that the airflow api-server process also has a memory leak issue https://github.com/user-attachments/assets/481b89f6-a7ff-4d66-8fb8-5485329e22e1"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
wjddn279 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3404276888 @potiuk I’ve created a separate issue to discuss the problem based on the results of the analysis performed with Memray. https://github.com/apache/airflow/issues/56641 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
potiuk commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3397801728 Maybe someone experiencing the memory issue could help is with running their Airflow processes with https://github.com/bloomberg/memray and getting some more insights ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3402928673 > It seems that the airflow api-server process also has a memory leak issue https://private-user-images.githubusercontent.com/79817309/500116374-481b89f6-a7ff-4d66-8fb8-5485329e22e1.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NjA0NjMyMzUsIm5iZiI6MTc2MDQ2MjkzNSwicGF0aCI6Ii83OTgxNzMwOS81MDAxMTYzNzQtNDgxYjg5ZjYtYTdmZi00ZDY2LThmYjgtNTQ4NTMyOWUyMmUxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTEwMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUxMDE0VDE3Mjg1NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTMwMDYzODc5NTViMjhkZDEyYjM3ZjVhMDE2NmYxNTlmYWEzMTkzZmUzZjJiZTVkMWI1M2UwMTQ1ZDM0NzA4ODAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.W_1Tyg0IdzqNTuU0fXrro72ub2O1E47q72WWXCTgZ1k";> We have the same memory leak issue for the api server as well. It has been slowly increasing to ~43GB of memory usage over the past 2 weeks. The memory usage never drops. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
kaxil commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3373428829 Any leads though if someone did a memory check on which process is taking up a lot of memory -- except from DagBag growth that we are aware off. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3354859393 Yes, local executor -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3361470737
@alkismavridis Thanks for the suggestion!
This is our scheduler section:
` airflow-scheduler:
labels:
com.datadoghq.ad.logs: '[{"source": "airflow", "service":
"airflow-scheduler"}]'
<<: *airflow-common
command: scheduler
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8974/health";]
interval: 30s
timeout: 10s
retries: 5
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully`
The docker-compose did restart the container after the scheduler exited, but
the healthcheck failed soon.
But if we manually run the native docker-compose restart command to bounce
the container, the healthcheck is always good:
`/usr/local/bin/docker-compose restart airflow-scheduler`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
LourTV commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3360876034 @wjddn279 In our case, the significant growth in the dag_version and serialized_dag tables was caused by the use of a dynamic start_date in the DAG definition (start_date = datetime.now() - timedelta(days=1)). This resulted in Airflow generating a different hash on each execution, thereby creating a new DAG version every time. After identifying this issue, we updated the DAG to use a fixed start_date. Since making this change, the number of DAG versions has stabilized and is no longer increasing. Following this adjustment, we restarted Airflow and monitored the system for several hours to assess whether the memory usage issue persisted. Unfortunately, the results remain the same, although Airflow is no longer generating excessive DAG versions, we continue to observe a gradual increase in RAM and swap usage over time, eventually leading to an out-of-memory condition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
alkismavridis commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3359816439 I can confirm that the workaround from @kaxil is working for us. We played a bit around with the value of num_runs and figured out that something in the range of 3000 does the job for us. It restarts every aprox. 50mins and the RAM usage stays in the range of 1G - 1.5G which is OK. https://github.com/user-attachments/assets/a71404c8-8160-4399-959b-1776bb664e53"; /> @lucidumio I am just guessing here, but I assume the airflow process ends (because of num_runs), but your docker-container does not restart. Your health check runs the command to check it the scheduler is running, the answer is no, and thus you end up with an unhealthy container. The way we solve this problem is that our healthcheck command ALSO terminates the container when it finds it unhealthy. Then. docker restarts it. Here is our docker-compose section for the scheduler. Please note 2 things: - restart: always - Our healthcheck command. The section after the OR operator kills the all PIDS. Thus, container restarts. ```yaml airflow-scheduler: <<: *airflow-common container_name: airflow-scheduler command: scheduler healthcheck: test: ["CMD-SHELL", "airflow jobs check --job-type SchedulerJob --local || bash -c 'kill -s 15 -1 && (sleep 10; kill -s 9 -1)'"] interval: 30s timeout: 10s retries: 5 start_period: 150s restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully mem_limit: 4000m ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
kaxil commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356837438 Yeah API-server also uses the same DagBag object I mentioned which is unbounded right now and continuously grows. We should fix it with a proper cache -- Jed & I have some ideas on how to do it without impacting performance or cause cache trash. But some comments above suggests that there could be more than just DagBag (with dag versions + dag serialization) causing higher memory! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356660460 It would be good to add another variable, e.g., `num_dag_versions`, to let user control the number of dag versions to keep in cache. So if the user does not need this feature, the value can be set as 1. Also, the API (Web) server seems to have memory leak issue as well. When browsing a DAG page with a large number of tasks under the Grid view, the api server's memory usage will grow quickly, but it won't drop after closing the page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
LourTV commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356469122 > Can someone check if there is growth in the `dag_version` or `serialized_dag` tables in your environment when memory use climbs? Row count is sufficient. Hello, Since we had to roll back the Airflow version in our production environment, I executed the query in the NPRD environment. This environment currently runs only four DAGs: two scheduled every 5 minutes and two scheduled once per day. The results are as follows: airflow_db=# select count(*) from dag_version; count --- 16635 (1 row) airflow_db=# select count(*) from serialized_dag; count --- 16635 (1 row) Additionally, while analyzing the same DAG (index_ibm_cloud_metrics_in_elasticsearch), we observed that in some runs, two DAG versions were generated, whereas in other runs only one was created, despite the code remaining unchanged during the corresponding timestamps: https://github.com/user-attachments/assets/ca31c4b6-e950-431c-98db-197eda35edcb"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
alkismavridis commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356272562 > Can someone check if there is growth in the `dag_version` or `serialized_dag` tables in your environment when memory use climbs? Row count is sufficient. During the "zombie" state where the scheduler has eaten 99.9% of the RAM and does nothing, I got: https://github.com/user-attachments/assets/969903ee-4671-4e62-b6ac-1af2d60fc81d"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
wjddn279 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3354375726 @lucidumio Hi, I'm interested in this issue as well, so I'm looking into it. Do you also use LocalExecutor? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
lucidumio commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353625144 Thanks for the reply! Will changing the num_runs cause any side effects or instability issues? For example, one task is running or about to run, will the task get disrupted? I see that it is not recommended to change this setting in production. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
zachliu commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353715158 maybe remotely related: https://github.com/apache/airflow/issues/50708 but in my case, the problem is the "opposite": the `dag-processor` and `celery worker` show signs of memory leak over time but the `scheduler` is totally fine 😂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
LengYue12389 commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3342223825 This issue still exists in version 3.1.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
vertIcod3r commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3338200774 I am having the same issue. Simple development docker running 3.0.6 airflow scheduler and apiserver do not seem to be releasing memory . When scheduler reaches the limit it stops working as in #56045 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
alkismavridis commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3322662552 We are also affected. We are running Airflow 3.0.6 via docker compose. We are using LocalExecutor. The compose file starts couple of containers, all of which have pretty constant memory usage over time except from the scheduler which eats all available memory (we constrained it to 4GB), and then restarts. Restarting is actually the good scenario, because then it at least works. Sometimes it stays in a "zombie" state where it simply does nothing and no tasks are being triggered for days, until we realise that somethings is off. When this happens, we see in the WebUI tones of tasks being stuck on "Queued" state. Any help is most welcome. This makes our whole infrastructure unstable :( https://github.com/user-attachments/assets/c040f2d5-e46b-4f96-a643-320025869387"; /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] High memory retention and gradual increase over time [airflow]
boring-cyborg[bot] commented on issue #55768: URL: https://github.com/apache/airflow/issues/55768#issuecomment-3302632413 Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
