tardunge opened a new pull request, #63938:
URL: https://github.com/apache/airflow/pull/63938
Adds a `RayKubernetesOperator` that manages RayJob custom resources on
Kubernetes via the KubeRay operator, enabling orchestration of Ray workloads
(distributed ML training, batch inference, data processing) directly from
Airflow DAGs.
**Motivation:** The CNCF Kubernetes provider supports Spark jobs via
`SparkKubernetesOperator` but has no equivalent for Ray. The only Ray-related
code in Airflow is in the Google Cloud provider (for GCP's managed Ray
service), which doesn't help users running KubeRay on self-managed clusters.
We've been running this operator in production for ~8 months.
**New components:**
- `RayKubernetesOperator` — creates and monitors RayJob CRDs
- YAML/JSON application file or `template_spec` dict input
- Reattach on scheduler restart (label-based job discovery)
- Dual timeout: cluster startup + job execution
- Log streaming from Ray head pod via `PodManager`
- Automatic cleanup on completion/failure (`delete_on_termination`)
- RFC 1123 compliant name sanitization with optional random suffix
- `RayObjectLauncher` — manages RayJob CRD lifecycle (create, status, delete)
- Tenacity retry for transient 409 conflicts
- `RayJobStatus` / `RayJobDeploymentStatus` constants
**No new dependencies** — uses `kubernetes` client (already required) and
`tenacity` (already required by the provider).
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes — Claude Code (Opus 4.6)
Generated-by: Claude Code (Opus 4.6) following [the
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]