Re: [DISCUSS] Managing airflow database size and retention

Jarek Potiuk Tue, 30 Dec 2025 00:38:02 -0800

*Why not partitions for Job?*

I don't think partitioning is a good idea for the Jobs table - and simply
because "job" and "partition" logic is incompatible.


The problem here is any "reasonable" partitioning you could come up with
for jobs to speed up your pruning cannot be  "static" (and this is what
makes partitioning really work).  Partitioning really only makes sense when
your record can be assigned to a partition at the creation time and it
never changes. Job row - by its nature will have to change partitions if
you want to make use of partitioning. Something that you cannot prune now -
because it's "active" now, should - in the future be "completed" and ready
to be pruned, so it should move to a partition that can be pruned.

So, you logically need:

* a "partition" or "partitions" of sorts (or some other bucket of jobs)
which keeps currently active jobs
* other "partitions" (or some other buckets) that are keeping only the jobs
that are not active any more and can be safely pruned

Since each job might take an arbitrarily long time to change from "active"
to "completed", what you really need is a way for jobs to change from
"active" partition to the non-active one. This is not really compliant with
how partition works, because partitioning is static by definition and
moving things between partitions is effectively the same as deleting rows
from one table and creating them in another. Partitions can be thought as
completely separate tables, that do not share indexes, but have the same
structure. Such a "delete"/"insert" pattern for "Job" rows would completely
destroy performance and would be worse than your WAL issue today.

*How to possibly achieve what you want with UNLOGGED? *

> and if we set it to unlogged, we lose the replication feature

Yeah - that's the tradeoff that I was mentioning before for the JOB table.
You would definitely lose the "instant" and easy physical block replication
capabilites, however you could instead implement a way to track "current"
jobs and restore them when needed in case of fail-over (which I understand
a bit would complicate your replication failover mechanism). I did not know
how difficult it would be, and I have no "good" experience with managing
Postgres, but with Postgres reputation, I would be surprised if there was
no feature in Postgres that you could use. I looked it up a bit and .. I
think there is one.

Postgres has something that can help you with that - i.e. fine-grained,
logical replication
https://www.postgresql.org/docs/current/logical-replication.html. From what
I understand, logical replication can work in parallel with physical
replication. And unlike physical block-based replication it can be more
fine-grained, and work not only on a "row" fine-grained selection but also
on "actions" performed. For example - you can tell it to only publish jobs
in a given state, (WHERE) but also to not publish certain operations (say
"DELETE").

I think your case should be nicely handled for Job by a combination of
several things:

a) mark the JOB table as UNLOGGED
b) enable logical replication for JOB table from primary to replica and
only publish INSERT and UPDATE but not publish DELETE and TRUNCATE
c) (here I am a bit theorising - because I am not sure how it might work)
but you could subscribe on the replica side to receive those INSERTS and
UPDATES and apply them to the replica's JOBS table. I **guess** it would
work since JOBS table is unlogged (though reading UNLOGGED feature, the
table is in a weird Shroedinger state at the replica - it exists but any
access to it in replica will yield an error, so maybe you will have to
instead SUBSCRIBE to receive INSERTS and UPDATES to a JOBS_REPLICA table or
smth like that).
d) You run "airflow db clean jobs" (or equivalent query) periodically in
BOTH -> Primary and Replica instances. That would avoid all the WAL /
logical replica overload, because the table is UNLOGGED, and DELETE
operations would not be replicated. Effectively you would have independent
"pruning" of data on both Primary and Replica (and you can scale it easily
- with as many replicas you want to have)
e) if you can subscribe directly to receive the UPDATE and INSERTS in
replica JOB table, you are done -> at the moment of failover, all the
"current" rows will be in-sync (as much as logical replication lag is
concerned). There might be different set of "completed" jobs on both
primary and secondary, but that is not an issue.

f) if you can only subscribe to receive it in another table, then failover
will need to be paired with renaming the JOB_REPLICA table to become JOB
table, but this is practically 0 cost operation.

All that can be done without changing Airflow code

J.


On Tue, Dec 30, 2025 at 6:51 AM Natanel <[email protected]> wrote:

> That is a good idea, however the larger and more problematic table is the
> job table, we are using replication across our dB's for disaster recovery,
> and if we set it to unlogged, we lose the replication feature, for the log
> table it is fine, as we do not mind losing audit logs when we switch that
> active db, however, for the job it will cause problems.
>
> As the scheduler, dag processor, triggerer and webserver/apiserver, along
> with all the running task data will be lost, and might cause those jobs to
> crash when they try to update the latest_heartbeat field, or the state.
>
> I do not think we can solve this issue without some code changes to make
> the tables partition friendly.
>
> Any other suggestions that might avoid a code change will help as it will
> allow us to fix the issue faster than upgrading a major release, but I do
> think that the code change may benefit some of the community.
>
> On Tue, Dec 30, 2025, 07:02 Rahul Vats <[email protected]> wrote:
>
> > +1 to Jarek's suggestion. The UNLOGGED table approach for log and job
> > tables makes sense for high-volume deployments. Pairing with external log
> > storage provides a safety net for recovery scenarios.
> >
> > Regards,
> > Rahul Vats
> >
> > On Tue, 30 Dec 2025 at 01:50, Jarek Potiuk <[email protected]> wrote:
> >
> > > Wait.. Actually ... Stupid me!
> > >
> > > I just realized setting the "unlogged" for the two tables means that
> you
> > do
> > > not have to do any partitioning at all ....... Because you will not
> have
> > > all the WAL issues AT ALL.
> > >
> > > So it's not really about making things "partitioning friendly" but
> simply
> > > "mark those two tables as unlogged" and make sure that you handle
> > recovery
> > > when plug is pulled. No partitioning needed at all. Might be connected
> > with
> > > the "send audit logs somewhere".
> > >
> > > So - we might not need **any** change to airflow, just a documentation
> > > (maybe even unofficial blog of some sort) - describing how to mark
> those
> > > two tables as unlogged and how to handle recovery in case of replica
> > > failover.
> > >
> > >
> > >
> > > On Mon, Dec 29, 2025 at 9:15 PM Jarek Potiuk <[email protected]> wrote:
> > >
> > > > > But eg lookiups to log table might use ti key attrs .
> > > >
> > > > Yeah - that would be the price to pay. Trade-offs, trade-offs
> > everywhere.
> > > >
> > > > Such queries over a multi-partition table would have to do full table
> > > scan
> > > > - they will not be able to use indexes (unless the queries use
> > partition
> > > > key). So they would **work** but would be **slow** if the table gets
> > too
> > > > big.
> > > > However I would argue that if someone would like to partition those
> > > > tables, they will frequently and aggressively prune the old
> partitions
> > > > (this is the reason why they want to have it) - which will
> effectively
> > > keep
> > > > those tables "small" (relatively). And even if full table scan is
> used,
> > > if
> > > > there is a question about task_id audit log, that would be fine.
> > > >
> > > > Actually - I strongly believe - and we had this discussion in the
> past
> > -
> > > > that the log table is quite a bit of abomination, because it's not a
> > > "true"
> > > > audit log if it is kept in modifiable database, and anyone who want
> to
> > > make
> > > > it a "true" audit log will have to effectively send those log entries
> > to
> > > a
> > > > "write-only" storage and query the audit logs there. And that nicely
> > fits
> > > > into "unlogged" pattern -> you could add a trigger in your DB to
> > > > automatically send audit logs somewhere "write-only" and aggressively
> > > prune
> > > > the old data (i.e. partitions).
> > > >
> > > > So - from the side of airflow that would mean that those tables are
> > > > "partitioning friendly", but not to implement partitioning itself.
> > > Leaving
> > > > "partitioning|" and "handling recovery" and "handling audit logs" up
> to
> > > the
> > > > Deployment Manager. Which IMHO should happen anyway for "log" table
> and
> > > > would be nice pattern to describe as a way to achieve "true" audit
> log.
> > > >
> > > > J,.
> > > >
> > > >
> > > >
> > > > On Mon, Dec 29, 2025 at 9:02 PM Daniel Standish via dev <
> > > > [email protected]> wrote:
> > > >
> > > >> The problem with partitioning them is that when querying you would
> > need
> > > to
> > > >> know the partition.
> > > >> But eg lookiups to log table might use ti key attrs .
> > > >>
> > > >> I would try just trimming log and job table in small batches
> > frequently
> > > >> using an appropriate index, before thinking about the complexity of
> > > >> partitioning.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Dec 29, 2025 at 11:24 AM Jarek Potiuk <[email protected]>
> > wrote:
> > > >>
> > > >> > I have a feeling that making those **two** tables
> partition-friendly
> > > >> should
> > > >> > be easy - and maybe that's all that we need. That would make it
> > > possible
> > > >> > for you to use the partitioning, (where it would not be necessary
> > for
> > > >> > everyone).
> > > >> > There might be a few ways of doing it without losing uniqueness.
> For
> > > >> > example Job ID and audit log ID could  follow certain conventions
> -
> > > and
> > > >> > always start with a partition key) thus providing uniqueness we
> > need.
> > > >> >
> > > >> > Those two tables are pretty specific and neither job id nor log id
> > > >> impact
> > > >> > anything else in our data model.
> > > >> >
> > > >> > I think such a change to make those two tables "partition
> friendly"
> > > >> could
> > > >> > be accepted, but we could only say it after seeing a POC.
> > > >> >
> > > >> > J.
> > > >> >
> > > >> >
> > > >> > On Mon, Dec 29, 2025 at 8:01 PM <[email protected]> wrote:
> > > >> >
> > > >> > > I do think that the second option is best, it is also what we
> > wanted
> > > >> to
> > > >> > > do, the only reason we did not do that is because from our
> tests,
> > > sql
> > > >> > > alchemy sometimes breaks as it expects certain constraints which
> > are
> > > >> not
> > > >> > > there, mainly for update queries, select works well, if I am not
> > > >> > mistaken,
> > > >> > > there are 3 or 4 large tables, job being the largest (7 times
> > larger
> > > >> than
> > > >> > > the second largest), the question is, will such a pull request
> be
> > > >> > approved?
> > > >> > >
> > > >> > > As we do lose the unique constraint (as it becomes per
> partition),
> > > >> though
> > > >> > > it is a sequence that most likely won't repeat until the
> previous
> > > has
> > > >> > been
> > > >> > > deleted, but if not, we might query unrelated job or log data,
> and
> > > so
> > > >> > > changes in the api server are also needed, creating the pr is
> not
> > a
> > > >> > > problem, the question is how beneficial will it be, as if it is
> > done
> > > >> to
> > > >> > > those problematic tables, it means that the preferred way to
> > manage
> > > >> > > retention is from the db, and can be an optional alembic script.
> > > >> > >
> > > >> > > I do not want to just push a fix that will add more complexity
> > than
> > > >> the
> > > >> > > benefit it will bring.
> > > >> > >
> > > >> > > If we do go the pr way, a separate discussion is probably needed
> > to
> > > >> > decide
> > > >> > > how it should be done (most likely an additional airflow command
> > to
> > > >> turn
> > > >> > on
> > > >> > > or off the partitions and retention)
> > > >> > >
> > > >> > > Doing it as a custom solution has caused problems with
> sqlalchemy,
> > > >> and we
> > > >> > > do not want to do so as if later an alembic script relies on the
> > > >> primary
> > > >> > > key in some way, we will need to fix it manually and deal with
> the
> > > >> > problems
> > > >> > > it may cause when we update.
> > > >> > >
> > > >> > > > On 29 Dec 2025, at 21:36, Jarek Potiuk <[email protected]>
> > wrote:
> > > >> > > >
> > > >> > > > Another option is that you ( could make ONLY for those two
> > > tables -
> > > >> > > > partition-friendly.  And do not partition anything else.  I
> > think
> > > >> that
> > > >> > > > **could** be possible - both have datetime fields that could
> be
> > > >> used as
> > > >> > > > partition keys - you would have to assess if you can do it as
> > your
> > > >> > > "custom"
> > > >> > > > solution or whether it would require some changes to airflow
> > > models.
> > > >> > But
> > > >> > > I
> > > >> > > > can't see foreign key problems if ONLY those two tables are
> > > >> > partitioned,
> > > >> > > so
> > > >> > > > likely you could do it yourself in your DB.
> > > >> > > >
> > > >> > > > In this case - maybe "let's solve those tables that are
> > > >> problematic" is
> > > >> > > > easier to do than "let's apply partitioning to everything".
> > > >> > > >
> > > >> > > >> On Mon, Dec 29, 2025 at 7:31 PM Jarek Potiuk <
> [email protected]
> > >
> > > >> > wrote:
> > > >> > > >>
> > > >> > > >> Purely theoretically, you could change the log and job tables
> > to
> > > be
> > > >> > > >> unlogged - and thus avoid WAL for them.
> > > >> > > >>
> > > >> > > >> The drawback of this:
> > > >> > > >>
> > > >> > > >> * the data in those tables will be lost if you pull the plug
> or
> > > >> kill
> > > >> > -9
> > > >> > > >> the primary postgres server
> > > >> > > >> * the tables are not available (at all) in replicas - so in
> > case
> > > >> of a
> > > >> > > >> fallback, you would have to have a manual data import/export
> > for
> > > >> those
> > > >> > > >> tables on fail-over, rather than rely on replicas being
> "ready
> > to
> > > >> > > fallback
> > > >> > > >> immediately". Or accept data loss.
> > > >> > > >> * the data from those tables will not be present in backups
> > > >> > > >>
> > > >> > > >> I am not 100% sure, but I believe loss of data in both tables
> > is
> > > >> not
> > > >> > > >> really catastrophic for Airflow, so maybe it's acceptable
> risk
> > > (but
> > > >> > > likely
> > > >> > > >> you should do a disaster-recovery test to see what happens
> and
> > > how
> > > >> to
> > > >> > > >> recover in case, indeed, someone pulls the plug on your
> > postgres
> > > >> > server.
> > > >> > > >>
> > > >> > > >> J,
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>> On Mon, Dec 29, 2025 at 7:16 PM Natanel <
> > > [email protected]>
> > > >> > > wrote:
> > > >> > > >>>
> > > >> > > >>> Yes, the problem is the manual deletions, we have tried it,
> it
> > > >> > > resulted in
> > > >> > > >>> the same exact issue, as the scheduled db procedures which
> > clean
> > > >> up
> > > >> > the
> > > >> > > >>> rows marked as deleted actually get deleted, and so it takes
> > up
> > > >> > > storage,
> > > >> > > >>> yet it does not solve the WAL problem, the problematic table
> > is
> > > >> > > actually
> > > >> > > >>> not task_instance, it is relatively small, the log and job
> > > tables
> > > >> are
> > > >> > > the
> > > >> > > >>> biggest tables (the problem with them is the primary key
> > change
> > > >> > > required),
> > > >> > > >>> by a multiple of 10 (or more, cluster dependant).
> > > >> > > >>>
> > > >> > > >>> The smaller batches might solve the issue, however, it seems
> > to
> > > >> just
> > > >> > > delay
> > > >> > > >>> the problem a little rather than solve it, as deleting data
> > > with a
> > > >> > > delete
> > > >> > > >>> query (especially a lot of data) is not a very light
> > operation,
> > > >> and
> > > >> > so
> > > >> > > I
> > > >> > > >>> think that this is the main issue.
> > > >> > > >>>
> > > >> > > >>> It would be nice if we could use partitions instead, as it
> is
> > a
> > > >> > lighter
> > > >> > > >>> operation, and does not require us to maintain a query and
> > > manage
> > > >> our
> > > >> > > db,
> > > >> > > >>> I
> > > >> > > >>> have thought about changing the models, most of the changes
> > are
> > > >> > > relatively
> > > >> > > >>> simple, for some it is just removing the foreign key and
> > relying
> > > >> on
> > > >> > ORM
> > > >> > > >>> level constraints, for others, it requires adding a pre
> query
> > to
> > > >> have
> > > >> > > the
> > > >> > > >>> same constrains but I do not like that idea, maybe there is
> > > >> another
> > > >> > > way to
> > > >> > > >>> make airflow "partition-friendly"?
> > > >> > > >>>
> > > >> > > >>> I can't think of a nice way to do so, maybe it does not
> exist,
> > > as
> > > >> the
> > > >> > > db
> > > >> > > >>> clean is as simple as a delete query gets, yet when there
> is a
> > > >> lot of
> > > >> > > >>> data,
> > > >> > > >>> it is all duplicated in WALs.
> > > >> > > >>>
> > > >> > > >>> On Mon, Dec 29, 2025, 19:40 Daniel Standish via dev <
> > > >> > > >>> [email protected]>
> > > >> > > >>> wrote:
> > > >> > > >>>
> > > >> > > >>>> Have you looked at doing manual deletions?  I.e. writing
> your
> > > own
> > > >> > sql?
> > > >> > > >>>>
> > > >> > > >>>> The db clean command is probably not "optimal" for all
> > > scenarios.
> > > >> > > >>>>
> > > >> > > >>>> So for example, if the main problem table for you is
> > > >> task_instance,
> > > >> > > you
> > > >> > > >>>> could periodically delete TI records in smaller batches
> using
> > > >> some
> > > >> > > >>>> appropriate index (whether it exists now or you add it).
> > Then
> > > >> maybe
> > > >> > > you
> > > >> > > >>>> would not stress the db as hard.
> > > >> > > >>>>
> > > >> > > >>>> Airflow isn't designed to use partitions so, you may not
> get
> > > good
> > > >> > > >>> results
> > > >> > > >>>> with that approach.
> > > >> > > >>>>
> > > >> > > >>>>
> > > >> > > >>>>
> > > >> > > >>>> On Mon, Dec 29, 2025 at 7:32 AM Natanel <
> > > [email protected]
> > > >> >
> > > >> > > >>> wrote:
> > > >> > > >>>>
> > > >> > > >>>>> Hello everyone, after having issues with the 'airflow db
> > > clean'
> > > >> > > >>> command,
> > > >> > > >>>>> where due to the amount of dags and tasks that are running
> > > every
> > > >> > day
> > > >> > > >>> in
> > > >> > > >>>> our
> > > >> > > >>>>> deployments, we get a lot of new data every day, which is
> > > >> stored in
> > > >> > > >>> the
> > > >> > > >>>>> database, and when we delete the data, due to the way
> PGSQL
> > > >> works,
> > > >> > > the
> > > >> > > >>>>> WAL's get replicated to both the archive storage and the
> > main
> > > >> data
> > > >> > > >>>> storage
> > > >> > > >>>>> of the db instance, which in turn, causes a significant
> jump
> > > in
> > > >> cpu
> > > >> > > >>>> usage,
> > > >> > > >>>>> ram usage and disk usage, whenever we run the command,
> which
> > > >> causes
> > > >> > > >>> all
> > > >> > > >>>>> kinds of issues, we even had it once fill up the db
> storage,
> > > and
> > > >> > > >>> causing
> > > >> > > >>>>> the database to be unresponsive, forcing us to move to our
> > > >> backup
> > > >> > > >>>> database,
> > > >> > > >>>>> after we haven't ran the command for a few months due to
> > human
> > > >> > error.
> > > >> > > >>>>>
> > > >> > > >>>>> As of now, I know that this is the accepted and widely
> used
> > > way
> > > >> of
> > > >> > > >>>> managing
> > > >> > > >>>>> the airflow database's size, however, we noticed that it
> may
> > > >> cause
> > > >> > > >>> issues
> > > >> > > >>>>> in certain cases, just like in our case, where if the db
> has
> > > not
> > > >> > been
> > > >> > > >>>>> cleaned up for a while, cleaning it can be problematic.
> > > >> > > >>>>>
> > > >> > > >>>>> We decided to try and partition the table, and use pgsql's
> > > >> built in
> > > >> > > >>>>> retention of partitions, which does not issue a DELETE
> > query,
> > > >> and
> > > >> > is
> > > >> > > >>>>> lighter and faster, while being simpler to use, however,
> we
> > > have
> > > >> > > >>>>> encountered issues due to having Foreign Key constraints
> in
> > > some
> > > >> > > >>> tables,
> > > >> > > >>>>> having to duplicate such keys and other than forcing code
> > > >> changes
> > > >> > (as
> > > >> > > >>> the
> > > >> > > >>>>> foreign key must include the partitioned key, as the
> > > partitioned
> > > >> > key
> > > >> > > >>> must
> > > >> > > >>>>> be part of the primary key), while also having the issue
> of
> > > >> > > sqlalchemy
> > > >> > > >>>>> breaking once we change the primary key, with the addition
> > of
> > > >> the
> > > >> > > >>>>> constraints on the primary key breaking.
> > > >> > > >>>>>
> > > >> > > >>>>> And in Mysql, due to the foreign keys, it is not possible
> to
> > > >> > > partition
> > > >> > > >>>>> tables which include them, as it is not supported yet
> > > >> (according to
> > > >> > > >>> this
> > > >> > > >>>>> <
> > > >> > > >>>>>
> > > >> > > >>>>
> > > >> > > >>>
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://dev.mysql.com/doc/refman/8.4/en/partitioning-limitations-storage-engines.html
> > > >> > > >>>>>>
> > > >> > > >>>>> ).
> > > >> > > >>>>>
> > > >> > > >>>>> Has anyone else tried to use the databases built in
> > partition
> > > >> > > >>> retention
> > > >> > > >>>>> system instead of the 'airflow db clean' command?
> > > >> > > >>>>>
> > > >> > > >>>>> Thanks, Natanel.
> > > >> > > >>>>>
> > > >> > > >>>>
> > > >> > > >>>
> > > >> > > >>
> > > >> > >
> > > >> > >
> > > ---------------------------------------------------------------------
> > > >> > > To unsubscribe, e-mail: [email protected]
> > > >> > > For additional commands, e-mail: [email protected]
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Managing airflow database size and retention

Reply via email to