I agree with TP wholeheartedly. The basic reason why XCom is deleted when
restarting is to maintain idempotency principles. And if we allow XCom to
be used to break idempotency (that's basically what state per task is
about) - then XCom will stop serving its purpose.

And of course - we are in the new "world" where we are not only supporting
idempotent tasks, Various optimisations and different kinds of workloads
require breaking the "old" idempotency rules we used to have when Airflow
was used mainly for ETL. And deletion of XCom state was also questioned
back then because people **wanted** to use Xcom in other ways. But we held
strongly and I think that was a good choice.

And while repurposing XCom to do "something" else might seem like a good
idea - even for Apple, because they could internally agree to some
convention and use it as "solution". But when you look at Airflow as a
product, repurposing XCome to also do something else (i.e. storing state)
seems a bit "lazy" and "short-cut-y".

What does it save if you do it this way? Few things:

* not having to do database migration to implement new feature
* avoiding having a clearly defined API where state can be stored for
various purposes on different levels (Task Instance, Task, Task Group
maybe, Dag, Team eventually)
* avoiding to think and prepare for all the various use cases that people
really would like to use it
* avoiding to write the use-case documentation explaining how you can use
state
* avoiding to write all the test cases making sure that all those use cases
are served way
* not thinking too much about performance and security implications of
those ("Xcom has it already sorted out, I am sure it's going to be fine")

Yes, it can be done way faster this way. and I understand some commercial
users could have chosen this way as a shortcut to handle a specific use
case they had in mind. This is absolutely understandable, and this is what
I would even expect a for-profit company to do to increase so-called
"time-to-market" and start reaping the benefits of it faster.

But should we do it in Airflow the same way ? We are not a for-profit
company, time-to-market of such a feature is secondary, compared to the
stability, maintainability and having a "product" vision.
I consider all the above points as absolutely crucial properties of a
"product" - which Airflow is. They might not be needed in a "solution", but
having a good "product" - absolutely requires all those things,

When we switched to Airflow 3, one of the ideas was to remove all the bad
"solution-y" decisions we made in the past that slowed us down in general
and - more importantly - turned us into (as Daniel used to say) into
"back-compatibility engineers"

Does it mean it will take longer and require more dedication and effort
and discussions to agree on the scope ? Absolutely. Is this a bad thing? I
don't think so.

J.


On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev <
[email protected]> wrote:

> What is the motivation behind storing internal state in a task, instead of
> splitting the logic on state boundaries into multiple tasks? That’s what
> the task abstraction is supposed for, and you wouldn’t need to a separate
> mechanism for that—regular XCom would just work.
>
> While storing state is a legitimate use case, I feel this particular idea
> would have a more negative impact on encouraging people to do too many
> things in one task. I’d even argue the examples given in the Confluence
> document are already so.
>
> TP
>
>
> > On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote:
> >
> > Hi folks!
> >
> > We would like to propose a new feature in Airflow, a boolean
> > parameter  "persist_xcom_through_retry" Parameter in all Airflow
> Operators.
> > Our team added this feature in our internal fork a few years back, and it
> > has been benefiting our users extensively.
> >
> > *I have created an AIP
> > at
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> > <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333
> >*.
> > Below is a summary (in the complete AIP, we have a more detailed problem
> > statement and quite a few interesting use-case examples):
> >
> >
> >
> >
> > *Traditionally, XCom is defined as “a mechanism that lets Tasks talk to
> > each other”. However, XCom also has the capacity and potential to help
> > persist and manage task state within a task itself.Currently, Apache
> > Airflow automatically clears a task instance’s XCom data when it is
> > retried. This behavior, while ensuring clean state for retry attempts,
> > creates limitations:*
> >
> >   - *Loss of Internal Progress: Tasks that have internal checkpointing or
> >   progress tracking lose all intermediate state on retry, forcing restart
> >   from the beginning.*
> >   - *Resource State Loss: Tasks cannot maintain state about allocated
> >   resources (compute instances, downstream job IDs, etc.) across retry
> >   attempts, leading to redundant expensive setup operations.*
> >   - *No Recovery/Resume Capability: There's no way for tasks to resume
> >   from internal checkpoints when transient failures occur during
> >   long-running atomicoperations.*
> >   - *Poor User Experience: users must implement external state management
> >   systems to work around this limitation, adding complexity to DAG
> authoring.*
> >
> >
> > *This proposal aims at extending the capacity of XCom by allowing
> > persisting a Task Instance’s XCom through its retries, enabling users to
> > build more resilient and efficient pipelines. This is particularly useful
> > for the type of tasks which are atomic (so one such task cannot be split
> > into multiple tasks) and need to manage internal state or checkpoints. *
> >
> >
> > We look forward to your feedback and thoughts. Thanks!
> >
> >
> > Regards,
> >
> > XD
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to