jroachgolf84 commented on code in PR #68136:
URL: https://github.com/apache/airflow/pull/68136#discussion_r3367364175


##########
airflow-core/docs/core-concepts/resumable-tasks.rst:
##########
@@ -0,0 +1,187 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+.. _concepts-resumable-tasks:
+
+Resumable Tasks
+===============
+
+.. versionadded:: 3.3.0
+
+Many data engineering workflows involve submitting work to an external system 
and waiting for it
+to complete. A Spark job, a BigQuery query, a Kubernetes batch pod, an EMR 
step: these are all
+tasks where the real work happens outside Airflow, and the operator's job is 
mostly to submit,
+poll, and collect the result.
+
+These tasks share a common failure mode. In classic operator cases, the worker 
slot is held for the
+entire polling duration, and if the worker process is restarted or the host is 
preempted, the task
+retries from scratch, losing all the progress made. Depending on the operator, 
that means the external
+job is submitted again, creating a duplicate run in context of the external 
system.
+
+Airflow recommends three approaches for handling long-running external work. 
Understanding the trade-offs
+between them helps you choose the right one for your situation.
+
+.. _concepts-resumable-tasks-deferrable:
+
+Deferrable Operators
+--------------------
+
+A deferrable operator pauses itself at the point where it would otherwise 
start polling, hands
+the polling work to the Triggerer component, and releases its worker slot. 
When the external
+condition is met, the Triggerer wakes the task and the worker resumes from 
where the operator
+left off.
+
+This is the most resource-efficient option. A single Triggerer process can 
concurrently watch
+thousands of conditions, so the rest of the worker pool stays free for other 
tasks.
+
+The trade-offs are:
+
+* A Triggerer component must be running. Deployments that do not include a 
Triggerer cannot use this pattern.
+* Writing a custom deferrable operator requires implementing a separate 
``Trigger`` class in

Review Comment:
   ```suggestion
   * Writing a custom deferrable operator requires implementing a ``Trigger`` 
class in
   ```



##########
task-sdk/docs/resumable-job-mixin.rst:
##########
@@ -0,0 +1,167 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+.. _sdk-resumable-job-mixin:
+
+ResumableJobMixin
+=================
+
+.. versionadded:: 3.3.0
+
+:class:`~airflow.sdk.ResumableJobMixin` is a mixin for operators that submit 
long-running jobs
+to an external system and poll for its completion. It makes the operator 
crash-safe by persisting
+the external job identifier to the task state store before polling begins. If 
the worker is restarted

Review Comment:
   ```suggestion
   the external job identifier to task state store before polling begins. If 
the worker is restarted
   ```



##########
airflow-core/docs/core-concepts/resumable-tasks.rst:
##########
@@ -0,0 +1,187 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+.. _concepts-resumable-tasks:
+
+Resumable Tasks
+===============
+
+.. versionadded:: 3.3.0
+
+Many data engineering workflows involve submitting work to an external system 
and waiting for it
+to complete. A Spark job, a BigQuery query, a Kubernetes batch pod, an EMR 
step: these are all
+tasks where the real work happens outside Airflow, and the operator's job is 
mostly to submit,
+poll, and collect the result.
+
+These tasks share a common failure mode. In classic operator cases, the worker 
slot is held for the
+entire polling duration, and if the worker process is restarted or the host is 
preempted, the task
+retries from scratch, losing all the progress made. Depending on the operator, 
that means the external
+job is submitted again, creating a duplicate run in context of the external 
system.
+
+Airflow recommends three approaches for handling long-running external work. 
Understanding the trade-offs
+between them helps you choose the right one for your situation.
+
+.. _concepts-resumable-tasks-deferrable:
+
+Deferrable Operators
+--------------------
+
+A deferrable operator pauses itself at the point where it would otherwise 
start polling, hands
+the polling work to the Triggerer component, and releases its worker slot. 
When the external
+condition is met, the Triggerer wakes the task and the worker resumes from 
where the operator
+left off.
+
+This is the most resource-efficient option. A single Triggerer process can 
concurrently watch
+thousands of conditions, so the rest of the worker pool stays free for other 
tasks.
+
+The trade-offs are:
+
+* A Triggerer component must be running. Deployments that do not include a 
Triggerer cannot use this pattern.
+* Writing a custom deferrable operator requires implementing a separate 
``Trigger`` class in
+  addition to the operator itself.
+* The polling logic runs inside the Triggerer's async event loop. Blocking 
calls inside a
+  Trigger stall the entire Triggerer process.
+
+If a deferrable operator already exists for your use case, or your team is 
comfortable
+implementing one, this is the recommended path considering its resource 
efficiency.
+
+For more details, see :doc:`../authoring-and-scheduling/deferring`.
+
+.. _concepts-resumable-tasks-resumable:
+
+Resumable Tasks
+---------------
+
+A resumable task uses the task state store to persist a checkpoint before
+it would otherwise lose progress. On retry, the task reads that checkpoint and 
continues from
+where it left off rather than starting over.
+
+The worker slot is held for the full duration of the task, the same as a 
standard synchronous
+operator. The benefit is crash safety and continuity, not resource efficiency.
+
+Resumable tasks are useful when:
+
+* No deferrable operator exists for your external system and writing one is 
not practical.
+* You want crash recovery without operating a Triggerer.
+* The task processes work incrementally (for example, reading files from a 
list or paginating
+  through API results) and should be able to resume from its last completed 
batch.
+
+**General pattern**
+
+The task reads a checkpoint from ``task_store`` at the start, does some work, 
writes an updated
+checkpoint, and either continues or finishes. On the next run (whether due to 
a retry after a
+crash or a deliberate reschedule), it reads the checkpoint again and picks up 
from there.
+
+.. code-block:: python
+
+    from airflow.sdk import dag, task
+
+
+    @dag(schedule=None)
+    def process_files_dag():
+
+        @task(retries=5)
+        def process_files(context=None):
+            task_store = context["task_store"]
+            files = ["a.csv", "b.csv", "c.csv", "d.csv"]
+
+            last_processed = task_store.get("last_processed")
+            start_index = 0
+            if last_processed is not None:
+                start_index = files.index(last_processed) + 1
+
+            for file in files[start_index:]:
+                # ... process the file ...
+                task_store.set("last_processed", file)
+
+        process_files()
+
+
+    process_files_dag()
+
+This pattern works without any additional work, just plain old context. The 
state store is just

Review Comment:
   ```suggestion
   This pattern works without any additional work, relying only on ``context``. 
The state store is just
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to