Re: [D] Massive metadata table even after clean with CLI [airflow]

2025-07-11 Thread via GitHub


GitHub user Thelma345 edited a comment on the discussion: Massive metadata 
table even after clean with CLI

You're likely dealing with table bloat, which is common in PostgreSQL even 
after deleting rows. Commands like VACUUM and REINDEX don’t always reclaim 
space on disk. To free up actual storage, try running VACUUM FULL, just note it 
will lock the table during execution. Also, make sure autovacuum is properly 
configured and running. In AWS RDS, some vacuum operations may behave 
differently due to managed settings.

For large tables like public.job, consider table partitioning in future setups 
to simplify cleanup.

Just like a well-maintained database helps performance, clean and organized 
spaces help students focus and succeed. You can explore a practical example of 
this kind of consistent maintenance 
[here](https://naecleaningsolutions.com/school-cleaning-services-austin/), 
where keeping learning environments clean plays a big role in daily success.

GitHub link: 
https://github.com/apache/airflow/discussions/52889#discussioncomment-13737216


This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]



Re: [D] Massive metadata table even after clean with CLI [airflow]

2025-07-11 Thread via GitHub


GitHub user Thelma345 added a comment to the discussion: Massive metadata table 
even after clean with CLI

You're likely dealing with table bloat, which is common in PostgreSQL even 
after deleting rows. Commands like VACUUM and REINDEX don’t always reclaim 
space on disk. To free up actual storage, try running VACUUM FULL, just note it 
will lock the table during execution. Also, make sure autovacuum is properly 
configured and running. In AWS RDS, some vacuum operations may behave 
differently due to managed settings.

For large tables like public.job, consider table partitioning in future setups 
to simplify cleanup.

Just like a well-maintained database helps performance, clean and organized 
spaces help students focus and succeed. You can explore a practical example of 
this kind of consistent maintenance here, where keeping learning environments 
clean plays a big role in daily success.

GitHub link: 
https://github.com/apache/airflow/discussions/52889#discussioncomment-13737216


This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]



Re: [D] Massive metadata table even after clean with CLI [airflow]

2025-07-08 Thread via GitHub


GitHub user joaofernandes5 added a comment to the discussion: Massive metadata 
table even after clean with CLI

Hi @potiuk. I'm sorry for the confusion, I was saying table _job_ but I was 
looking into the **log** table.
I'm using the following query to check my table size:
```
SELECT 
schemaname || '.' || relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS table_size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS 
index_size
FROM pg_catalog.pg_statio_user_tables 
ORDER BY pg_total_relation_size(relid) DESC;
```

The results shows that I really do not have any data before my cut-off date:
```
airflow-metadata=> SELECT COUNT(*) 
FROM public.log 
WHERE dttm < '2025-03-01';
 count 
---
 0
(1 row)

airflow-metadata=> SELECT COUNT(*) 
FROM public.log 
WHERE dttm > '2025-03-01';
  count  
-
 7896329
(1 row)
```

But it still with 20gb:
   
size  | table_size | index_size 
---+++
 public.log | 21 GB 
| 17 GB  | 4108 MB


Do you think that it is correct? It is really a really great amount, but logs 
are usually really have. Maybe I should reduce my cut-off date..


GitHub link: 
https://github.com/apache/airflow/discussions/52889#discussioncomment-13695812


This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]



Re: [D] Massive metadata table even after clean with CLI [airflow]

2025-07-08 Thread via GitHub


GitHub user joaofernandes5 added a comment to the discussion: Massive metadata 
table even after clean with CLI

Hi @DjVinnii,
Thanks for your time :))
I'm sorry for not making it clear, I've already used both drop-archived and 
skip-archived. However, the problem continues.
It is really strange because it was 22gb before running the command and it 
still 22gb after running.

GitHub link: 
https://github.com/apache/airflow/discussions/52889#discussioncomment-13695289


This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]



Re: [D] Massive metadata table even after clean with CLI [airflow]

2025-07-05 Thread via GitHub


GitHub user potiuk added a comment to the discussion: Massive metadata table 
even after clean with CLI

For that purpose 
https://airflow.apache.org/docs/apache-airflow/2.10.5/database-erd-ref.html 
exists - while we do not advice users to rely on that DB structure (because it 
will change over time in non-compatible ways) - we make a snapshhot of the DB 
structure with every version of Airlfow so that it can aid you in similar 
investigations.

GitHub link: 
https://github.com/apache/airflow/discussions/52889#discussioncomment-13667897


This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]



Re: [D] Massive metadata table even after clean with CLI [airflow]

2025-07-05 Thread via GitHub


GitHub user potiuk added a comment to the discussion: Massive metadata table 
even after clean with CLI

That's one option - but then it would be a separate table.  

How exactly are you checking size of public.job table (i.e. how do you **know** 
it takes 20 GB and how did you compare it to before ? It might simply be that 
you have a lot of entris in the job table which are older than the date you 
specified  - because of some bug or maybe some failed attempts of yours to 
generate a lot of jobs. I'd also advise you to take a look at the content of 
that table - it's likely for some reason you could have generated A LOT of jobs 
and all of them are "after" the cut-off date you specified.

You might have a state of the DB that typical "tools" provided by Airflow CLI 
might be not enough and you migh have to employ your deployment manager 
investigative efforts to find out what's in the DB and why it is so big, 
because likely you got into some unusal state. 

Also it might be that you generate so many jobs that if you want to keep all of 
them since 1st of March the simply WILL take that much space. Not very likely 
for a JOB table but well - we do not know your case. In which case you either 
should keep much shorter history or rethink the way why so many jobs are 
created.

Simply counting how many entries are in the table and looking at the data and 
aggregating it- by day, type or smth could show you where you might try to 
start investigate where the numbers are coming from - whether it's normal and 
expected, or whether you had some unusual event that generated it.

All on you to get and run some DB queries to check it.

GitHub link: 
https://github.com/apache/airflow/discussions/52889#discussioncomment-13667890


This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]



Re: [D] Massive metadata table even after clean with CLI [airflow]

2025-07-04 Thread via GitHub


GitHub user DjVinnii added a comment to the discussion: Massive metadata table 
even after clean with CLI

Hi @joaofernandes5! The `airflow db clean` command will archive the data 
instead of directly deleting it. After running `airflow db clean` you could run 
`airflow db drop-archived`. It is also possible to use the --skip-archive flag 
with the `airflow db clean` command to skip archival and immediately remove the 
data. I'd suggest having a look at the [Airflow CLI 
documentation](https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#db)
 to see all the possibilities.

GitHub link: 
https://github.com/apache/airflow/discussions/52889#discussioncomment-13666345


This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]