Re: [I] High memory retention and gradual increase over time [airflow]

2025-12-02 Thread via GitHub


LengYue12389 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3604610868

   Great, I can upgrade to 3.1.4 now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-12-02 Thread via GitHub


potiuk commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3602525037

   Also possibly https://github.com/apache/airflow/pull/58944 might address 
remaining memory issue. I will close it provisionally but @LourTV - if yuo 
could also check upcoming 3.1.4 where we will have the fix for the image, that 
would be fantastic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-12-02 Thread via GitHub


potiuk closed issue #55768: High memory retention and gradual increase over time
URL: https://github.com/apache/airflow/issues/55768


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-11-01 Thread via GitHub


potiuk commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3476281393

   Fantastic news @kaxil ! Again thanks for all the analysis (and hte PR of the 
month @wjddn279 ) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-29 Thread via GitHub


kaxil commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3461520365

   As promised in 
https://github.com/apache/airflow/issues/56641#issuecomment-3456893101 , here's 
our analysis on impact on Celery workers comparing 3.0.6 vs 3.1.1 (with fixes) 
with one of our deployments:
   
   In Airflow 3.0.6 various tasks running on Celery failed with OOMs as the 
memory leaks were significant. After applying changes in 
https://github.com/apache/airflow/pull/56695 to 3.0.6 or directly upgrading to 
3.1.1, the memory stayed mostly-flat and there were 0 task failures.
   
   ---
   
   **Celery Worker** with 4 GB memory on **Airflow 3.0.6**
   https://github.com/user-attachments/assets/ba5b96a6-df9b-45c3-b9c6-8019c8378c36";
 />
   
   **Celery Worker** with 4 GB memory on **Airflow 3.1.1**
   https://github.com/user-attachments/assets/ad7b520b-6bb8-4f55-bbde-8fe98f895b0d";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-28 Thread via GitHub


vertIcod3r commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3455183387

   I can at least confirm that I am using LocalExecutor


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-24 Thread via GitHub


tirkarthi commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3443725120

   We are also experiencing high memory usage for api-server that grows over 
time reaching the memory limit assigned to the pod. The server uses `airflow 
api-server` with default configurations on Kubernetes. The scheduler uses 
KubernetesExecutor and we didn't see noticeable issues. I tried locally using 
memray with below command using 500 dags. The server started with 25MB and then 
went to 95MB, 175MB to stay there. Here is the command that gives a live view 
of the memory usage in case someone wants to run it a different setup.
   
   ```
   PYTHONMALLOC=malloc memray run --live -m uvicorn 
airflow.api_fastapi.main:app --workers 1 --port 8000
   ```
   
   https://github.com/user-attachments/assets/98b4d51d-e222-4b5c-a277-e0f7f51a609a";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


wjddn279 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3401275826

   @potiuk 
   I'll check it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3371985650

   @alkismavridis Yes, we met with the same issue when using the `num_runs` to 
control the scheduler. 
   
   Attached is the cron script we built to monitor the memory usage and restart 
the scheduler container when the usage is high. It has been running well for 
the past few days. 
   
   
[memory_watchdog.sh](https://github.com/user-attachments/files/22724267/memory_watchdog.sh)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


dstandish commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3402740293

   Those who are echoing that they see the problem as well, please share 
executor info -- there seemed to be some suggestion that local executor may be 
implicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


wjddn279 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3374894204

   @kaxil 
   
   I am still in the process of investigating, but to summarize my findings so 
far:
   the increase in memory usage mainly comes from the worker processes created 
inside the scheduler container.
   
   The scheduler itself runs as a single process, so its own memory growth is 
limited and does not contribute significantly. However, since 32 worker 
processes are forked by default, their combined effect amplifies the overall 
memory usage roughly by a factor of 32.
   
   To investigate further, I used Python’s tracemalloc and modified the Airflow 
code to trace objects consuming large amounts of memory within the worker 
processes.
   
   
[tracemalloc_log.txt](https://github.com/user-attachments/files/22733403/tracemalloc_log.txt)
   
   The investigation showed that most of the memory usage originates from 
library imports. Since each process loads the same libraries independently, the 
total footprint scales almost linearly with the number of workers. 
   
   After removing certain heavy imports and rerunning the system, I observed a 
significant reduction in per-worker memory usage. A container that previously 
failed within 30 minutes under an 8 GiB memory limit was able to run for over 
two hours without issues. (This test environment is quite demanding, as 100 
DAGs are triggered every minute.)
   
   However, the total memory used by all workers still continued to increase 
over time.
   https://github.com/user-attachments/assets/566e707b-d66d-4ecb-b389-09cbc522609d";
 />
   As shown in the attached figure, there are large variations in PSS between 
workers, and eventually all workers cross into higher memory usage regions. My 
current hypothesis is as follows:
   - In the LocalExecutor, workers are initially forked from the scheduler 
process, so they share memory pages through copy-on-write (COW). When 
additional libraries are imported or specific logic is executed later, those 
shared pages are duplicated, leading to higher private memory usage.
   - In addition to the import-related issue, I suspect that scheduler objects 
inherited by the workers are gradually modified during execution, causing small 
but steady PSS growth as copy-on-write pages are created.
   - In other executor types (e.g., CeleryExecutor, KubernetesExecutor), 
workers are launched as independent processes, so every library import directly 
adds to total memory usage without COW benefits.
   
   
   My hypothesis is that some heavy libraries are still present, and while they 
are initially loaded into the scheduler’s shared memory during forking, certain 
logic causes them to be re-imported later, resulting in private memory 
allocations per process.
   
   I will continue to trace the problematic libraries and will open separate 
issues as I identify them.
   Please let me know if you have any questions or different perspectives on my 
findings so far.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


kaxil commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353598174

   A temporary workaround would be to set 
[`AIRFLOW__SCHEDULER__NUM_RUNS`](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#num-runs)
 to a small value like `100`:
   
   ```ini
   [scheduler]
   num_runs = 100
   ```
   
   This forces the scheduler to exit and restart after 100 scheduling loops, 
which clears the accumulated memory. If you're running in Kubernetes with 
proper health probes, the scheduler pod will automatically restart.
   
   ---
   
   The issue stems from **unbounded growth of the `DBDagBag` cache** inside the 
scheduler. This wasn't a significant problem in Airflow 2.x, but in 3.x with 
the introduction of DAG versioning, the problem is amplified:
   
   1. **`DBDagBag._dags` has no size limit**: The cache at 
[`airflow-core/src/airflow/models/dagbag.py:49`](https://github.com/apache/airflow/blob/964997a7a6da5041fb19e48bc31866ffc6fe7bc7/airflow-core/src/airflow/models/dagbag.py#L49)
 is an unbounded dictionary that stores deserialized DAGs indefinitely.
   
   2. **DAG versioning multiplies cache entries**: In 3.x, each DAG version 
gets a unique ID. The cache is keyed by `dag_version_id` (not `dag_id`), 
meaning:
  - Every DAG change creates a new version with a new cache entry
  - For dynamic DAGs that change frequently, this creates a new entry on 
every parse
  - Old versions are **never evicted** from the cache
  - Each `SerializedDAG` object holds full task dictionaries, dependencies, 
and metadata
  - This could explain the ~10GB growth you're seeing from the scheduler 
alone
  - Especially worse if you use Dynamic dags -- that would lead to more 
Serialized dag versions.
   
   Setting `num_runs` forces a scheduler restart, which clears the Python 
process memory and resets all caches effectively working around the unbounded 
cache growth for now until we fix it properly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3344209617

   I am having the same issue too. We also run Airflow in dockers. The same 
DAGs ran well in Airflow 2 for years, only using less than 15 GB of memory. 
After migrating to 3.1.0, the scheduler alone takes up over 128 GB of memory in 
2 days. This is certainly unacceptable in the production environment. We have 
to roll back to Airflow 2. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


alkismavridis commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3371858574

   Plot twist: The workaround seems to be harmful for us.
   
   We run the scheduler in a docker container. When the num_runs finishes, the 
scheduler stops and thus our docker health check (`airflow jobs check 
--job-type SchedulerJob --local || bash -c 'kill -s 15 -1 && (sleep 10; kill -s 
9 -1)' ` kills the container, so that the whole docker container can restart.
   
   The problem is that active tasks also get killed this way. This is a big 
problem.
   
   I search a way to only restart the scheduler and not the whole docker 
container.
   Maybe I write a small bash script with an endless while loop that runs the 
scheduler with a "num_run", and use this script as the entry point of my docker 
container.
   
   If anybody has some other idea, I am very happy to hear it 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


wjddn279 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356570158

   @LourTV 
   Looking at the query results, I can see that the DAG version keeps 
increasing.
   Could it be that there are multiple DAG files generating the DAG ID 
`index_idm_claude_metrics_in_elasticsearch`?
   If the DAG files are alternately parsed, the hash value will change, causing 
the version to keep increasing, which would lead to the issue @kaxil mentioned.
   
   It seems necessary to check the value of the `data` column in the 
`serialized_dag` table to identify what has changed and prevent the version 
from increasing meaninglessly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-18 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3391961070

   > > Maybe I write a small bash script with an endless while loop that runs 
the scheduler with a "num_run", and use this script as the entry point of my 
docker container.
   > 
   > You can add a script in your `healthcheck` to check for number of running 
tasks to ensure you only kill the scheduler when no tasks are running.
   
   This is exactly what we did in our bash script. It runs by a cron job every 
3 minutes to check the active tasks before restarting the scheduler. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


jedcunningham commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353180527

   Can someone check if there is growth in the `dag_version` or 
`serialized_dag` tables in your environment when memory use climbs? Row count 
is sufficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353988457

   A quick feedback, I just tried setting the `num_run` to 100. It worked to 
restart the scheduler, but the scheduler container would become "unhealthy" 
after the restart, which concerns me.
   
   Instead of changing this setting, we built a cron script to:
   
   #   - Monitor system memory usage
   #   - Restart Airflow scheduler if usage exceeds threshold AND no tasks are 
running
   #   - Force restart if memory exceeds critical threshold
   #   - Log all actions with timestamp and memory/task details
   
   We used docker-compose restart to bounce the container so the health status 
won't change. Hopefully this works till the fix is ready. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


wjddn279 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356590885

Like in @alkismavridis ’s case, I also observed that even though the number 
of DAGs is not large and the DAG versions are not continuously increasing, the 
scheduler’s memory usage keeps growing. Therefore, I infer that there is 
another type of memory leak besides the cause mentioned by kaxil.
   
   When deploying Airflow with LocalExecutor using Docker Compose, I confirmed 
that worker processes are created as child processes of the scheduler, and the 
allocated memory of both the scheduler and worker processes keeps increasing.
   
   I will continue investigating the root cause, and once I find it, I will 
share the details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


LengYue12389 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3326137879

   
   I also encountered this problem, the usage of dag-processor internal spring 
will continue to increase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


kaxil commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3373423961

   > Maybe I write a small bash script with an endless while loop that runs the 
scheduler with a "num_run", and use this script as the entry point of my docker 
container.
   
   You can add a script in your `healthcheck` to check for number of running 
tasks to ensure you only kill the scheduler when no tasks are running.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


LengYue12389 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3326158709

   I also encountered this problem, it's also version 3.0.6


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


alkismavridis commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353193422

   @jedcunningham will check it tomorrow and let you know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


alkismavridis commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3377308914

   In case it helps, the active PIDs (numbers are according to `docker stats`) 
also go up as the memory goes up.
   
   Maybe processes are not properly closed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3357518252

   Looks like when the scheduler's memory usage grows to some point, the log 
service will fail and you won't get any updates in the UI log page or the log 
file. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


LengYue12389 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3392978693

   It seems that the airflow api-server process also has a memory leak issue
   https://github.com/user-attachments/assets/481b89f6-a7ff-4d66-8fb8-5485329e22e1";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


wjddn279 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3404276888

   @potiuk 
   I’ve created a separate issue to discuss the problem based on the results of 
the analysis performed with Memray.
   https://github.com/apache/airflow/issues/56641


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-17 Thread via GitHub


potiuk commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3397801728

   Maybe someone experiencing the memory issue could help is with running their 
Airflow processes with https://github.com/bloomberg/memray and getting some 
more insights ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-14 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3402928673

   > It seems that the airflow api-server process also has a memory leak issue 
https://private-user-images.githubusercontent.com/79817309/500116374-481b89f6-a7ff-4d66-8fb8-5485329e22e1.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NjA0NjMyMzUsIm5iZiI6MTc2MDQ2MjkzNSwicGF0aCI6Ii83OTgxNzMwOS81MDAxMTYzNzQtNDgxYjg5ZjYtYTdmZi00ZDY2LThmYjgtNTQ4NTMyOWUyMmUxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTEwMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUxMDE0VDE3Mjg1NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTMwMDYzODc5NTViMjhkZDEyYjM3ZjVhMDE2NmYxNTlmYWEzMTkzZmUzZjJiZTVkMWI1M2UwMTQ1ZDM0NzA4ODAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.W_1Tyg0IdzqNTuU0fXrro72ub2O1E47q72WWXCTgZ1k";>
   
   We have the same memory leak issue for the api server as well. It has been 
slowly increasing to ~43GB of memory usage over the past 2 weeks. The memory 
usage never drops.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-06 Thread via GitHub


kaxil commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3373428829

   Any leads though if someone did a memory check on which process is taking up 
a lot of memory -- except from DagBag growth that we are aware off.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-02 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3354859393

   Yes, local executor


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-02 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3361470737

   @alkismavridis Thanks for the suggestion! 
   
   This is our scheduler section:
   
   `  airflow-scheduler:
   labels:
 com.datadoghq.ad.logs: '[{"source": "airflow", "service": 
"airflow-scheduler"}]'
   <<: *airflow-common
   command: scheduler
   healthcheck:
 test: ["CMD", "curl", "--fail", "http://localhost:8974/health";]
 interval: 30s
 timeout: 10s
 retries: 5
   restart: always
   depends_on:
 <<: *airflow-common-depends-on
 airflow-init:
   condition: service_completed_successfully`
   
   The docker-compose did restart the container after the scheduler exited, but 
the healthcheck failed soon. 
   
   But if we manually run the native docker-compose restart command to bounce 
the container, the healthcheck is always good:
   
   `/usr/local/bin/docker-compose restart airflow-scheduler`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-02 Thread via GitHub


LourTV commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3360876034

   @wjddn279  In our case, the significant growth in the dag_version and 
serialized_dag tables was caused by the use of a dynamic start_date in the DAG 
definition (start_date = datetime.now() - timedelta(days=1)). This resulted in 
Airflow generating a different hash on each execution, thereby creating a new 
DAG version every time.
   
   After identifying this issue, we updated the DAG to use a fixed start_date. 
Since making this change, the number of DAG versions has stabilized and is no 
longer increasing.
   
   Following this adjustment, we restarted Airflow and monitored the system for 
several hours to assess whether the memory usage issue persisted. 
Unfortunately, the results remain the same, although Airflow is no longer 
generating excessive DAG versions, we continue to observe a gradual increase in 
RAM and swap usage over time, eventually leading to an out-of-memory condition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-02 Thread via GitHub


alkismavridis commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3359816439

   I can confirm that the workaround from @kaxil is working for us. We played a 
bit around with the value of num_runs and figured out that something in the 
range of 3000 does the job for us. It restarts every aprox. 50mins and the RAM 
usage stays in the range of 1G - 1.5G which is OK.
   
   https://github.com/user-attachments/assets/a71404c8-8160-4399-959b-1776bb664e53";
 />
   
   @lucidumio I am just guessing here, but I assume the airflow process ends 
(because of num_runs), but your docker-container does not restart. Your health 
check runs the command to check it the scheduler is running, the answer is no, 
and thus you end up with an unhealthy container.
   
   The way we solve this problem is that our healthcheck command ALSO 
terminates the container when it finds it unhealthy. Then. docker restarts it.
   
   Here is our docker-compose section for the scheduler. Please note 2 things: 
   -  restart: always
   - Our healthcheck command. The section after the OR operator kills the all 
PIDS. Thus, container restarts.
   
   ```yaml
   airflow-scheduler:
   <<: *airflow-common
   container_name: airflow-scheduler
   command: scheduler
   healthcheck:
 test: ["CMD-SHELL", "airflow jobs check --job-type SchedulerJob 
--local || bash -c 'kill -s 15 -1 && (sleep 10; kill -s 9 -1)'"]
 interval: 30s
 timeout: 10s
 retries: 5
 start_period: 150s
   restart: always
   depends_on:
 <<: *airflow-common-depends-on
 airflow-init:
   condition: service_completed_successfully
   mem_limit: 4000m
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-01 Thread via GitHub


kaxil commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356837438

   Yeah API-server also uses the same DagBag object I mentioned which is 
unbounded right now and continuously grows. We should fix it with a proper 
cache -- Jed & I have some ideas on how to do it without impacting performance 
or cause cache trash.
   
   But some comments above suggests that there could be more than just DagBag 
(with dag versions + dag serialization) causing higher memory!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-01 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356660460

   It would be good to add another variable, e.g., `num_dag_versions`, to let 
user control the number of dag versions to keep in cache. So if the user does 
not need this feature, the value can be set as 1.
   
   Also, the API (Web) server seems to have memory leak issue as well. When 
browsing a DAG page with a large number of tasks under the Grid view, the api 
server's memory usage will grow quickly, but it won't drop after closing the 
page. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-01 Thread via GitHub


LourTV commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356469122

   > Can someone check if there is growth in the `dag_version` or 
`serialized_dag` tables in your environment when memory use climbs? Row count 
is sufficient.
   
   Hello,
   Since we had to roll back the Airflow version in our production environment, 
I executed the query in the NPRD environment. This environment currently runs 
only four DAGs: two scheduled every 5 minutes and two scheduled once per day. 
The results are as follows:
   
   airflow_db=# select count(*) from dag_version;
count
   ---
16635
   (1 row)
   
   airflow_db=# select count(*) from serialized_dag;
count
   ---
16635
   (1 row)
   
   
   Additionally, while analyzing the same DAG 
(index_ibm_cloud_metrics_in_elasticsearch), we observed that in some runs, two 
DAG versions were generated, whereas in other runs only one was created, 
despite the code remaining unchanged during the corresponding timestamps:
   
   https://github.com/user-attachments/assets/ca31c4b6-e950-431c-98db-197eda35edcb";
 />
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-10-01 Thread via GitHub


alkismavridis commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3356272562

   > Can someone check if there is growth in the `dag_version` or 
`serialized_dag` tables in your environment when memory use climbs? Row count 
is sufficient.
   
   During the "zombie" state where the scheduler has eaten 99.9% of the RAM and 
does nothing, I got:
   
   https://github.com/user-attachments/assets/969903ee-4671-4e62-b6ac-1af2d60fc81d";
 />
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-09-30 Thread via GitHub


wjddn279 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3354375726

   @lucidumio 
   Hi, I'm interested in this issue as well, so I'm looking into it.
   Do you also use LocalExecutor?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-09-30 Thread via GitHub


lucidumio commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353625144

   Thanks for the reply! Will changing the num_runs cause any side effects or 
instability issues? For example, one task is running or about to run, will the 
task get disrupted? I see that it is not recommended to change this setting in 
production. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-09-30 Thread via GitHub


zachliu commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3353715158

   maybe remotely related: https://github.com/apache/airflow/issues/50708
   but in my case, the problem is the "opposite": the `dag-processor` and 
`celery worker` show signs of memory leak over time but the `scheduler` is 
totally fine 😂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-09-27 Thread via GitHub


LengYue12389 commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3342223825

   This issue still exists in version 3.1.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-09-26 Thread via GitHub


vertIcod3r commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3338200774

   I am having the same issue. Simple development docker running 3.0.6 airflow 
scheduler and apiserver do not seem to be releasing memory . When scheduler 
reaches the limit it stops working as in #56045 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-09-25 Thread via GitHub


alkismavridis commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3322662552

   We are also affected. We are running Airflow 3.0.6 via docker compose. We 
are using LocalExecutor.
   The compose file starts couple of containers, all of which have pretty 
constant memory usage over time except from the scheduler which eats all 
available memory (we constrained it to 4GB), and then restarts.
   
   Restarting is actually the good scenario, because then it at least works. 
Sometimes it stays in a "zombie" state where it simply does nothing and no 
tasks are being triggered for days, until we realise that somethings is off. 
When this happens, we see in the WebUI tones of tasks being stuck on "Queued" 
state. 
   
   Any help is most welcome. This makes our whole infrastructure unstable :(
   
   https://github.com/user-attachments/assets/c040f2d5-e46b-4f96-a643-320025869387";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] High memory retention and gradual increase over time [airflow]

2025-09-17 Thread via GitHub


boring-cyborg[bot] commented on issue #55768:
URL: https://github.com/apache/airflow/issues/55768#issuecomment-3302632413

   Thanks for opening your first issue here! Be sure to follow the issue 
template! If you are willing to raise PR to address this issue please do so, no 
need to wait for approval.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]