Re: [PR] Add `ResumableJobMixin` with `SparkSubmitOperator` as a case study for surviving worker failures (standalone) [airflow]

via GitHub Mon, 25 May 2026 01:52:18 -0700


amoghrajesh commented on code in PR #67118:
URL: https://github.com/apache/airflow/pull/67118#discussion_r3297098490



##########
providers/apache/spark/src/airflow/providers/apache/spark/operators/spark_submit.py:
##########
@@ -198,8 +221,63 @@ def execute(self, context: Context) -> None:
             self.conf = 
inject_transport_information_into_spark_properties(self.conf, context)
         if self._hook is None:
             self._hook = self._get_hook()
+        if self._hook._should_track_driver_status:
+            return self.execute_resumable(context)

Review Comment:
   @Alexhans thanks for chiming in. Yes, exactly thats the intent. 
`GlueJobOperator``resume_glue_job_on_retry` is solving the same problem but 
with two rough edges that task_state addresses cleanly:
   
   * XCom is shared across all TIs in a DAG run — using it for retry continuity 
is a _workaround_. `task_state` is scoped to a single TI across its retries, 
which is exactly the right scope for "what job did I submit last time."
   "Scan all job runs as fallback" exists because there's no reliable place to 
store the ID before the worker dies. With `task_state`, the ID is persisted 
before polling starts, ie: no scan needed, no ambiguity about which run belongs 
to this retry.
   * Migrating Glue to `ResumableJobMixin` can be a probably nice follow-up — 
drop the XCom write + scan logic, implement the six abstract methods for 
resumable, and the retry behaviour becomes standard. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add `ResumableJobMixin` with `SparkSubmitOperator` as a case study for surviving worker failures (standalone) [airflow]

Reply via email to