Proposed Alternative: Complete and propose a regular "state" storage proposal - there were plenty of discussions about that - including Asset Watermarks that Ash mentioned. I think the best way is to lead that discussion to completion, and as result come up with a state management that can be used in this case as well.
As mentioned in my previous - mail - my thinking we are not in "time-to-market" game. We are more in "delliver good product". If it will take more time, so be it, but let's do it properly. There is not much to loose by having it later, but there is a lot to loose collectively if our users will start misusing half-backed feature that will mislead them to do something we do not want them to do. J. On Tue, Nov 18, 2025 at 11:25 PM Xiaodong Deng <[email protected]> wrote: > In addition, I understand we would like to stick to certain > design/principles. However, if that is blocking certain reasonable use > cases, either alternative solutions need to be provided or "principles" > need to be adjusted. > > That's what I'm hoping for here. > > Thanks again! > > > Regards, > XD > > On 2025/11/18 22:20:36 Xiaodong Deng wrote: > > Thanks for your valuable feedback, folks. > > > > Hi @TP, > > > > There are cases where breaking down to multiple tasks is not feasible or > not the best option. For example, the use case 1 I have shared in the > Confluence doc appendix. > > > > There are also examples where splitting into multiple tasks may seem > make sense but may cause down-side effect. In use case 2 and 4 in the > Confluence doc appendix, I shared why we do it in a single task instead of > splitting them into two tasks. > > > > Some tasks are simply atomic. > > > > > > Hi @Jarek, > > > > I'm glad we are talking about idempotency. That's exactly why sometimes > we cannot break down some tasks. In the "Problem Examples" section in the > Confluence doc, I covered that at some extent. > > > > Would love to discuss more on this, or learn from you for any > alternative solutions which can become available to Airflow users in a > timely manner. > > > > Many thanks! > > > > > > Regards, > > XD > > > > On 2025/11/16 09:48:10 Jarek Potiuk wrote: > > > I agree with TP wholeheartedly. The basic reason why XCom is deleted > when > > > restarting is to maintain idempotency principles. And if we allow XCom > to > > > be used to break idempotency (that's basically what state per task is > > > about) - then XCom will stop serving its purpose. > > > > > > And of course - we are in the new "world" where we are not only > supporting > > > idempotent tasks, Various optimisations and different kinds of > workloads > > > require breaking the "old" idempotency rules we used to have when > Airflow > > > was used mainly for ETL. And deletion of XCom state was also questioned > > > back then because people **wanted** to use Xcom in other ways. But we > held > > > strongly and I think that was a good choice. > > > > > > And while repurposing XCom to do "something" else might seem like a > good > > > idea - even for Apple, because they could internally agree to some > > > convention and use it as "solution". But when you look at Airflow as a > > > product, repurposing XCome to also do something else (i.e. storing > state) > > > seems a bit "lazy" and "short-cut-y". > > > > > > What does it save if you do it this way? Few things: > > > > > > * not having to do database migration to implement new feature > > > * avoiding having a clearly defined API where state can be stored for > > > various purposes on different levels (Task Instance, Task, Task Group > > > maybe, Dag, Team eventually) > > > * avoiding to think and prepare for all the various use cases that > people > > > really would like to use it > > > * avoiding to write the use-case documentation explaining how you can > use > > > state > > > * avoiding to write all the test cases making sure that all those use > cases > > > are served way > > > * not thinking too much about performance and security implications of > > > those ("Xcom has it already sorted out, I am sure it's going to be > fine") > > > > > > Yes, it can be done way faster this way. and I understand some > commercial > > > users could have chosen this way as a shortcut to handle a specific use > > > case they had in mind. This is absolutely understandable, and this is > what > > > I would even expect a for-profit company to do to increase so-called > > > "time-to-market" and start reaping the benefits of it faster. > > > > > > But should we do it in Airflow the same way ? We are not a for-profit > > > company, time-to-market of such a feature is secondary, compared to the > > > stability, maintainability and having a "product" vision. > > > I consider all the above points as absolutely crucial properties of a > > > "product" - which Airflow is. They might not be needed in a > "solution", but > > > having a good "product" - absolutely requires all those things, > > > > > > When we switched to Airflow 3, one of the ideas was to remove all the > bad > > > "solution-y" decisions we made in the past that slowed us down in > general > > > and - more importantly - turned us into (as Daniel used to say) into > > > "back-compatibility engineers" > > > > > > Does it mean it will take longer and require more dedication and effort > > > and discussions to agree on the scope ? Absolutely. Is this a bad > thing? I > > > don't think so. > > > > > > J. > > > > > > > > > On Sun, Nov 16, 2025 at 9:43 AM Tzu-ping Chung via dev < > > > [email protected]> wrote: > > > > > > > What is the motivation behind storing internal state in a task, > instead of > > > > splitting the logic on state boundaries into multiple tasks? That’s > what > > > > the task abstraction is supposed for, and you wouldn’t need to a > separate > > > > mechanism for that—regular XCom would just work. > > > > > > > > While storing state is a legitimate use case, I feel this particular > idea > > > > would have a more negative impact on encouraging people to do too > many > > > > things in one task. I’d even argue the examples given in the > Confluence > > > > document are already so. > > > > > > > > TP > > > > > > > > > > > > > On 14 Nov 2025, at 08:32, Xiaodong Deng <[email protected]> wrote: > > > > > > > > > > Hi folks! > > > > > > > > > > We would like to propose a new feature in Airflow, a boolean > > > > > parameter "persist_xcom_through_retry" Parameter in all Airflow > > > > Operators. > > > > > Our team added this feature in our internal fork a few years back, > and it > > > > > has been benefiting our users extensively. > > > > > > > > > > *I have created an AIP > > > > > at > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333 > > > > > < > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333 > > > > >*. > > > > > Below is a summary (in the complete AIP, we have a more detailed > problem > > > > > statement and quite a few interesting use-case examples): > > > > > > > > > > > > > > > > > > > > > > > > > *Traditionally, XCom is defined as “a mechanism that lets Tasks > talk to > > > > > each other”. However, XCom also has the capacity and potential to > help > > > > > persist and manage task state within a task itself.Currently, > Apache > > > > > Airflow automatically clears a task instance’s XCom data when it is > > > > > retried. This behavior, while ensuring clean state for retry > attempts, > > > > > creates limitations:* > > > > > > > > > > - *Loss of Internal Progress: Tasks that have internal > checkpointing or > > > > > progress tracking lose all intermediate state on retry, forcing > restart > > > > > from the beginning.* > > > > > - *Resource State Loss: Tasks cannot maintain state about > allocated > > > > > resources (compute instances, downstream job IDs, etc.) across > retry > > > > > attempts, leading to redundant expensive setup operations.* > > > > > - *No Recovery/Resume Capability: There's no way for tasks to > resume > > > > > from internal checkpoints when transient failures occur during > > > > > long-running atomicoperations.* > > > > > - *Poor User Experience: users must implement external state > management > > > > > systems to work around this limitation, adding complexity to DAG > > > > authoring.* > > > > > > > > > > > > > > > *This proposal aims at extending the capacity of XCom by allowing > > > > > persisting a Task Instance’s XCom through its retries, enabling > users to > > > > > build more resilient and efficient pipelines. This is particularly > useful > > > > > for the type of tasks which are atomic (so one such task cannot be > split > > > > > into multiple tasks) and need to manage internal state or > checkpoints. * > > > > > > > > > > > > > > > We look forward to your feedback and thoughts. Thanks! > > > > > > > > > > > > > > > Regards, > > > > > > > > > > XD > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [email protected] > > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
