Hi folks! We would like to propose a new feature in Airflow, a boolean parameter "persist_xcom_through_retry" Parameter in all Airflow Operators. Our team added this feature in our internal fork a few years back, and it has been benefiting our users extensively.
*I have created an AIP at https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=399278333>*. Below is a summary (in the complete AIP, we have a more detailed problem statement and quite a few interesting use-case examples): *Traditionally, XCom is defined as “a mechanism that lets Tasks talk to each other”. However, XCom also has the capacity and potential to help persist and manage task state within a task itself.Currently, Apache Airflow automatically clears a task instance’s XCom data when it is retried. This behavior, while ensuring clean state for retry attempts, creates limitations:* - *Loss of Internal Progress: Tasks that have internal checkpointing or progress tracking lose all intermediate state on retry, forcing restart from the beginning.* - *Resource State Loss: Tasks cannot maintain state about allocated resources (compute instances, downstream job IDs, etc.) across retry attempts, leading to redundant expensive setup operations.* - *No Recovery/Resume Capability: There's no way for tasks to resume from internal checkpoints when transient failures occur during long-running atomicoperations.* - *Poor User Experience: users must implement external state management systems to work around this limitation, adding complexity to DAG authoring.* *This proposal aims at extending the capacity of XCom by allowing persisting a Task Instance’s XCom through its retries, enabling users to build more resilient and efficient pipelines. This is particularly useful for the type of tasks which are atomic (so one such task cannot be split into multiple tasks) and need to manage internal state or checkpoints. * We look forward to your feedback and thoughts. Thanks! Regards, XD
