Guangyang and I have created a draft AIP, which you can find here: https://cwiki.apache.org/confluence/display/AIRFLOW/%5BDRAFT%5D+AIP-93. Curious to get folks' thoughts and opinions!
On Fri, Jul 25, 2025 at 9:12 PM Guangyang Li <lig...@gmail.com> wrote: > Jake and I have put down the details in this AIP draft doc > < > https://docs.google.com/document/d/1gnGpTDhTpxpC48-kvr3jxL4GWyagMKzItWV-30GZu2U/edit?tab=t.0 > > > . > > Instead of calling this feature state persisting, we decided to call it > asset watermark, > as it's mainly an enhancement of asset-oriented processing. You can find > the details > and examples in the doc. Please comment. We plan to convert it to an > official AIP > doc later once it's stable. > > > Guangyang > > On Thu, Jun 12, 2025 at 4:48 PM Karen Braganza <karenbraganz...@gmail.com> > wrote: > > > I think the process_state model would be useful in the HttpEventTrigger > > that I am working on. The HttpEventTrigger sends requests to an API and > > triggers an event based on a user-defined response_check function. If the > > response_check function needs to evaluate multiple API responses > > cumulatively, it would be useful to store and retrieve past API > responses. > > It would also be useful for task instances to retrieve the process_state > > data. For the HttpEventTrigger, this would mean enabling task instances > to > > retrieve and act on API response data received within the trigger. > > > > I'm sure there would be similar use cases in other EventTriggers as well. > > > > On Thu, Jun 12, 2025 at 1:26 PM Daniel Standish > > <daniel.stand...@astronomer.io.invalid> wrote: > > > > > Alright since I was summoned... > > > > > > When I was an airflow user, I did a lot of incremental processes. > Pretty > > > much everything was incremental. Data warehousing / analytics shop / > > > e-commerce reporting / integrations this kind of thing. > > > > > > One common use case is implementing something like a fivetran, which I > > did > > > a few times. > > > > > > For me, execution date was almost entirely useless. Execution date is > > > there for partition-driven workloads. > > > > > > For incremental, you need to track your state somehow. > > > > > > That's why I experimented with various state storage interfaces, and > > > developed a watermark operator, which we used a lot. And I demoed a > > > version of them here <https://github.com/apache/airflow/pull/19051>, > and > > > authored AIP-30 > > > < > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-30%3A+State+persistence > > > > > > > . > > > > > > I wrote AIP-30 when I was still contributing to Airflow for funsies, > and > > > didn't get a ton of engagement on it so it sort of languished, then > when > > I > > > became full time airflow dev, there were other priorities. > > > > > > But to me the use case is still pretty obvious. Nothing we have added > > > since then really explicitly supports incremental workflows. > > > > > > To me the question is (as it was then, and I think I mentioned this in > > the > > > AIP), do you provide a generic interface where user controls namespace > > and > > > name of the state you are trying to persist? Or instead do you provide > > > mechanisms to store state on existing objects. So e.g. on trigger, on > > > task, on whatever, you can do `self.save_state(key...)` etc. In my > > > proposal I think I leaned towards generic, and it seems Jake leans the > > same > > > way. There are pros and cons. > > > > > > In terms of the underlying storage mechanism, it seems pretty > reasonable > > to > > > allow this to be pluggable like everything else. I used different > > > "backends" at different times -- s3, or database. Typically you don't > > need > > > mega low latency with the type of tasks Airflow is used for. > > > > > >