nailo2c commented on PR #67473:
URL: https://github.com/apache/airflow/pull/67473#issuecomment-4637017647

   I tested the `SparkSubmitOperator.on_kill()` path against a Kerberized 
Dataproc/YARN cluster, and the YARN kill request failed because the RM REST 
`PUT` was sent without Kerberos auth.
   
   What passed:
   
   - Kerberos RM REST polling for a successful YARN app.
   - Crash recovery when the original YARN app already succeeded: retry skipped 
resubmission.
   - Crash recovery while the original YARN app was still running: retry 
reconnected to the same app.
   - Crash recovery after the original YARN app was killed/failed: retry 
submitted a fresh app.
   
   What failed:
   
   ```text
   YARN application submitted: application_1780697276912_0017
   Tracking YARN application application_1780697276912_0017 via ResourceManager 
REST API polling
   YARN application application_1780697276912_0017 status: RUNNING
   YARN kill request for application_1780697276912_0017 returned HTTP 401
   YARN application application_1780697276912_0017 is still RUNNING
   YARN application application_1780697276912_0017 status: FINISHED
   ```
   
   The RM state confirmed that the app was not killed; it completed naturally:
   
   ```json
   {
     "id": "application_1780697276912_0017",
     "state": "FINISHED",
     "finalStatus": "SUCCEEDED",
     "name": "airflow-67473-kill"
   }
   ```
   
   So the local task process received `SIGTERM`, Airflow attempted to kill the 
YARN app, but the Kerberized RM returned `401`. The YARN app kept running and 
eventually finished successfully instead of becoming `KILLED`.
   
   This appears to come from `SparkSubmitOperator.on_kill()` calling the public 
`hook.kill_yarn_application()`, while that public method does not pass 
`auth=self._resolved_yarn_rm_auth` to the REST `PUT`. The existing private 
`_kill_yarn_application()` path does pass the resolved auth.
   
   I also verified the proposed fix locally by making the public method 
delegate to
   the existing private method:
   
   ```python
   def kill_yarn_application(self, application_id: str) -> None:
       """Public alias for ResumableJobMixin / operator on_kill paths."""
       self._kill_yarn_application(application_id)
   ```
   
   With that change in place, the same `SIGTERM` / `on_kill()` path sent an
   authenticated RM REST request and killed the YARN app:
   
   ```text
   YARN kill request for application_1780697276912_0019 returned HTTP 202
   YARN application application_1780697276912_0019 status: KILLED
   RuntimeError: YARN application application_1780697276912_0019 ended with 
state: KILLED, final status: KILLED
   ```
   
   ```json
   {
     "id": "application_1780697276912_0019",
     "state": "KILLED",
     "finalStatus": "KILLED",
     "name": "airflow-67473-kill"
   }
   ```
   
   It would also be worth adding an operator-level `on_kill()` test, since a 
hook-level `on_kill()` test does not cover this operator path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to