Thanks Vikram,

This is a crucial AIP for Airflow 3.3+. I skimmed through it and will
provide more comments over the coming days, but it very much looks like
what I imagined for state management in Airflow.
It has about the right abstraction layer, focusing on building
infrastructure that serves the previously articulated - use cases and
likely supports other use cases we are not yet aware of. I really like how
it maps the "generic" interface into those cases.

I have this old "rule of thumb": you need at least three use cases to be
able to design a truly reusable infrastructure API/component. .. Here we
have 3 use cases it will serve :)

Jl


On Sat, Mar 21, 2026 at 8:44 PM Vikram Koka via dev <[email protected]>
wrote:

> Dear Airflowers,
>
> Over the last several months, there have been a lot of discussions in the
> devlist around improvements needed for long running jobs outside of Airflow
> (raised by XD and others), and about improved event triggering (raised by
> Jake and others). XD, Jake, and I have gotten together and collaborated on
> a unified approach for Task State Management within Airflow which we would
> like to propose.
>
> Apache Airflow has been built around stateless, idempotent tasks, and this
> has served the community incredibly well. But as production AI and data
> workloads have grown more sophisticated, a clear gap has emerged that the
> community has been working around for a while.
>
> Three patterns keep coming up. An incremental operator needs to know where
> it left off last time, so it does not reprocess data it has already
> handled. An operator running a Databricks or EMR job needs to survive a
> worker disruption without cancelling a job that was 90% complete and
> starting over from scratch. A long-running async task processing thousands
> of files needs to checkpoint its progress so a retry picks up where it left
> off, not from the beginning.
>
> All three patterns are forcing users into the same workarounds today
> generally bending XCom beyond its intended purpose, or building their own
> state persistence outside of Airflow entirely.
>
> We think we can do better. AIP-XX: Task State Management is a new
> foundation AIP that addresses all three patterns through a single, minimal,
> pluggable framework. Built on top of the Execution API from AIP-72, with
> full async support consistent with AIP-98, Task State is deliberately and
> cleanly separate from XCom, with different scoping, different lifecycle
> semantics, and different garbage collection mechanics. It also provides the
> foundation for a simplified AIP-93 (Asset Watermarking)
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-93+Asset+Watermarks+and+State+Variables
> >
> and for long running remote operations using either the AIP-tbd Persistent
> Parameter for Airflow Operators
> <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> >
> or AIP-96 (Resumable Operators)
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-96+Resumable+Operators
> >
> .
>
> Full draft is on Confluence as Draft AIP-xx: Task State Management
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/Draft%3A+AIP-xx%3A+Task+State+Management
> >
>
> We would love to hear your thoughts. Please comment on the AIP doc.
>
> Best regards,
> Vikram, XD, and Jake
> --
>
> Vikram Koka
> Chief Strategy Officer
> Email: [email protected]
>
>
> <https://www.astronomer.io/>
>

Reply via email to