Re: [D] Massive metadata table even after clean with CLI [airflow]
GitHub user Thelma345 edited a comment on the discussion: Massive metadata table even after clean with CLI You're likely dealing with table bloat, which is common in PostgreSQL even after deleting rows. Commands like VACUUM and REINDEX don’t always reclaim space on disk. To free up actual storage, try running VACUUM FULL, just note it will lock the table during execution. Also, make sure autovacuum is properly configured and running. In AWS RDS, some vacuum operations may behave differently due to managed settings. For large tables like public.job, consider table partitioning in future setups to simplify cleanup. Just like a well-maintained database helps performance, clean and organized spaces help students focus and succeed. You can explore a practical example of this kind of consistent maintenance [here](https://naecleaningsolutions.com/school-cleaning-services-austin/), where keeping learning environments clean plays a big role in daily success. GitHub link: https://github.com/apache/airflow/discussions/52889#discussioncomment-13737216 This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
Re: [D] Massive metadata table even after clean with CLI [airflow]
GitHub user Thelma345 added a comment to the discussion: Massive metadata table even after clean with CLI You're likely dealing with table bloat, which is common in PostgreSQL even after deleting rows. Commands like VACUUM and REINDEX don’t always reclaim space on disk. To free up actual storage, try running VACUUM FULL, just note it will lock the table during execution. Also, make sure autovacuum is properly configured and running. In AWS RDS, some vacuum operations may behave differently due to managed settings. For large tables like public.job, consider table partitioning in future setups to simplify cleanup. Just like a well-maintained database helps performance, clean and organized spaces help students focus and succeed. You can explore a practical example of this kind of consistent maintenance here, where keeping learning environments clean plays a big role in daily success. GitHub link: https://github.com/apache/airflow/discussions/52889#discussioncomment-13737216 This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
Re: [D] Massive metadata table even after clean with CLI [airflow]
GitHub user joaofernandes5 added a comment to the discussion: Massive metadata table even after clean with CLI Hi @potiuk. I'm sorry for the confusion, I was saying table _job_ but I was looking into the **log** table. I'm using the following query to check my table size: ``` SELECT schemaname || '.' || relname AS table_name, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, pg_size_pretty(pg_relation_size(relid)) AS table_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS index_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC; ``` The results shows that I really do not have any data before my cut-off date: ``` airflow-metadata=> SELECT COUNT(*) FROM public.log WHERE dttm < '2025-03-01'; count --- 0 (1 row) airflow-metadata=> SELECT COUNT(*) FROM public.log WHERE dttm > '2025-03-01'; count - 7896329 (1 row) ``` But it still with 20gb: size | table_size | index_size ---+++ public.log | 21 GB | 17 GB | 4108 MB Do you think that it is correct? It is really a really great amount, but logs are usually really have. Maybe I should reduce my cut-off date.. GitHub link: https://github.com/apache/airflow/discussions/52889#discussioncomment-13695812 This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
Re: [D] Massive metadata table even after clean with CLI [airflow]
GitHub user joaofernandes5 added a comment to the discussion: Massive metadata table even after clean with CLI Hi @DjVinnii, Thanks for your time :)) I'm sorry for not making it clear, I've already used both drop-archived and skip-archived. However, the problem continues. It is really strange because it was 22gb before running the command and it still 22gb after running. GitHub link: https://github.com/apache/airflow/discussions/52889#discussioncomment-13695289 This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
Re: [D] Massive metadata table even after clean with CLI [airflow]
GitHub user potiuk added a comment to the discussion: Massive metadata table even after clean with CLI For that purpose https://airflow.apache.org/docs/apache-airflow/2.10.5/database-erd-ref.html exists - while we do not advice users to rely on that DB structure (because it will change over time in non-compatible ways) - we make a snapshhot of the DB structure with every version of Airlfow so that it can aid you in similar investigations. GitHub link: https://github.com/apache/airflow/discussions/52889#discussioncomment-13667897 This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
Re: [D] Massive metadata table even after clean with CLI [airflow]
GitHub user potiuk added a comment to the discussion: Massive metadata table even after clean with CLI That's one option - but then it would be a separate table. How exactly are you checking size of public.job table (i.e. how do you **know** it takes 20 GB and how did you compare it to before ? It might simply be that you have a lot of entris in the job table which are older than the date you specified - because of some bug or maybe some failed attempts of yours to generate a lot of jobs. I'd also advise you to take a look at the content of that table - it's likely for some reason you could have generated A LOT of jobs and all of them are "after" the cut-off date you specified. You might have a state of the DB that typical "tools" provided by Airflow CLI might be not enough and you migh have to employ your deployment manager investigative efforts to find out what's in the DB and why it is so big, because likely you got into some unusal state. Also it might be that you generate so many jobs that if you want to keep all of them since 1st of March the simply WILL take that much space. Not very likely for a JOB table but well - we do not know your case. In which case you either should keep much shorter history or rethink the way why so many jobs are created. Simply counting how many entries are in the table and looking at the data and aggregating it- by day, type or smth could show you where you might try to start investigate where the numbers are coming from - whether it's normal and expected, or whether you had some unusual event that generated it. All on you to get and run some DB queries to check it. GitHub link: https://github.com/apache/airflow/discussions/52889#discussioncomment-13667890 This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
Re: [D] Massive metadata table even after clean with CLI [airflow]
GitHub user DjVinnii added a comment to the discussion: Massive metadata table even after clean with CLI Hi @joaofernandes5! The `airflow db clean` command will archive the data instead of directly deleting it. After running `airflow db clean` you could run `airflow db drop-archived`. It is also possible to use the --skip-archive flag with the `airflow db clean` command to skip archival and immediately remove the data. I'd suggest having a look at the [Airflow CLI documentation](https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#db) to see all the possibilities. GitHub link: https://github.com/apache/airflow/discussions/52889#discussioncomment-13666345 This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
