nailo2c commented on PR #67473:
URL: https://github.com/apache/airflow/pull/67473#issuecomment-4637017647
I tested the `SparkSubmitOperator.on_kill()` path against a Kerberized
Dataproc/YARN cluster, and the YARN kill request failed because the RM REST
`PUT` was sent without Kerberos auth.
What passed:
- Kerberos RM REST polling for a successful YARN app.
- Crash recovery when the original YARN app already succeeded: retry skipped
resubmission.
- Crash recovery while the original YARN app was still running: retry
reconnected to the same app.
- Crash recovery after the original YARN app was killed/failed: retry
submitted a fresh app.
What failed:
```text
YARN application submitted: application_1780697276912_0017
Tracking YARN application application_1780697276912_0017 via ResourceManager
REST API polling
YARN application application_1780697276912_0017 status: RUNNING
YARN kill request for application_1780697276912_0017 returned HTTP 401
YARN application application_1780697276912_0017 is still RUNNING
YARN application application_1780697276912_0017 status: FINISHED
```
The RM state confirmed that the app was not killed; it completed naturally:
```json
{
"id": "application_1780697276912_0017",
"state": "FINISHED",
"finalStatus": "SUCCEEDED",
"name": "airflow-67473-kill"
}
```
So the local task process received `SIGTERM`, Airflow attempted to kill the
YARN app, but the Kerberized RM returned `401`. The YARN app kept running and
eventually finished successfully instead of becoming `KILLED`.
This appears to come from `SparkSubmitOperator.on_kill()` calling the public
`hook.kill_yarn_application()`, while that public method does not pass
`auth=self._resolved_yarn_rm_auth` to the REST `PUT`. The existing private
`_kill_yarn_application()` path does pass the resolved auth.
I also verified the proposed fix locally by making the public method
delegate to
the existing private method:
```python
def kill_yarn_application(self, application_id: str) -> None:
"""Public alias for ResumableJobMixin / operator on_kill paths."""
self._kill_yarn_application(application_id)
```
With that change in place, the same `SIGTERM` / `on_kill()` path sent an
authenticated RM REST request and killed the YARN app:
```text
YARN kill request for application_1780697276912_0019 returned HTTP 202
YARN application application_1780697276912_0019 status: KILLED
RuntimeError: YARN application application_1780697276912_0019 ended with
state: KILLED, final status: KILLED
```
```json
{
"id": "application_1780697276912_0019",
"state": "KILLED",
"finalStatus": "KILLED",
"name": "airflow-67473-kill"
}
```
It would also be worth adding an operator-level `on_kill()` test, since a
hook-level `on_kill()` test does not cover this operator path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]