Guangyang and I have created a draft AIP, which you can find here:
https://cwiki.apache.org/confluence/display/AIRFLOW/%5BDRAFT%5D+AIP-93.
Curious to get folks' thoughts and opinions!


On Fri, Jul 25, 2025 at 9:12 PM Guangyang Li <lig...@gmail.com> wrote:

> Jake and I have put down the details in this AIP draft doc
> <
> https://docs.google.com/document/d/1gnGpTDhTpxpC48-kvr3jxL4GWyagMKzItWV-30GZu2U/edit?tab=t.0
> >
> .
>
> Instead of calling this feature state persisting, we decided to call it
> asset watermark,
> as it's mainly an enhancement of asset-oriented processing. You can find
> the details
> and examples in the doc. Please comment. We plan to convert it to an
> official AIP
> doc later once it's stable.
>
>
> Guangyang
>
> On Thu, Jun 12, 2025 at 4:48 PM Karen Braganza <karenbraganz...@gmail.com>
> wrote:
>
> > I think the process_state model would be useful in the HttpEventTrigger
> > that I am working on. The HttpEventTrigger sends requests to an API and
> > triggers an event based on a user-defined response_check function. If the
> > response_check function needs to evaluate multiple API responses
> > cumulatively, it would be useful to store and retrieve past API
> responses.
> > It would also be useful for task instances to retrieve the process_state
> > data. For the HttpEventTrigger, this would mean enabling task instances
> to
> > retrieve and act on API response data received within the trigger.
> >
> > I'm sure there would be similar use cases in other EventTriggers as well.
> >
> > On Thu, Jun 12, 2025 at 1:26 PM Daniel Standish
> > <daniel.stand...@astronomer.io.invalid> wrote:
> >
> > > Alright since I was summoned...
> > >
> > > When I was an airflow user, I did a lot of incremental processes.
> Pretty
> > > much everything was incremental.  Data warehousing / analytics shop /
> > > e-commerce reporting / integrations this kind of thing.
> > >
> > > One common use case is implementing something like a fivetran, which I
> > did
> > > a few times.
> > >
> > > For me, execution date was almost entirely useless.  Execution date is
> > > there for partition-driven workloads.
> > >
> > > For incremental, you need to track your state somehow.
> > >
> > > That's why I experimented with various state storage interfaces, and
> > > developed a watermark operator, which we used a lot.  And I demoed a
> > > version of them here <https://github.com/apache/airflow/pull/19051>,
> and
> > > authored AIP-30
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence
> > > >
> > > .
> > >
> > > I wrote AIP-30 when I was still contributing to Airflow for funsies,
> and
> > > didn't get a ton of engagement on it so it sort of languished, then
> when
> > I
> > > became full time airflow dev, there were other priorities.
> > >
> > > But to me the use case is still pretty obvious.  Nothing we have added
> > > since then really explicitly supports incremental workflows.
> > >
> > > To me the question is (as it was then, and I think I mentioned this in
> > the
> > > AIP), do you provide a generic interface where user controls namespace
> > and
> > > name of the state you are trying to persist?  Or instead do you provide
> > > mechanisms to store state on existing objects.  So e.g. on trigger, on
> > > task, on whatever, you can do `self.save_state(key...)` etc.  In my
> > > proposal I think I leaned towards generic, and it seems Jake leans the
> > same
> > > way.  There are pros and cons.
> > >
> > > In terms of the underlying storage mechanism, it seems pretty
> reasonable
> > to
> > > allow this to be pluggable like everything else.  I used different
> > > "backends" at different times -- s3, or database.  Typically you don't
> > need
> > > mega low latency with the type of tasks Airflow is used for.
> > >
> >
>

Reply via email to