1fanwang opened a new pull request, #68480:
URL: https://github.com/apache/airflow/pull/68480
## Why
`KubernetesExecutor` creates worker pods one at a time within each scheduler
loop. In
`sync()` the dequeued tasks are handed to `run_next()`, which builds the pod
and calls
`create_namespaced_pod` **synchronously**, blocking on the API response
before the next
pod is created. (`run_pod_async` refers only to the watcher tracking the pod
afterwards —
the create call itself blocks.)
When per-create latency is high — API round-trip plus mutating admission
webhooks, common
in larger clusters — this serializes the scheduler loop and caps how many
pods a loop can
launch. Raising `worker_pods_creation_batch_size` does not help, because the
dequeued batch
is still created sequentially.
## What changed
Opt-in concurrent creation, off by default:
- `[kubernetes_executor] async_pod_creation` (bool, default `False`) — when
enabled, the
pods dequeued in a scheduler loop are built synchronously (pod-mutation
hook and
reconciliation unchanged), then their create calls are issued concurrently
using the
asynchronous Kubernetes client (`kubernetes_asyncio`, already a provider
dependency),
bounded by a semaphore.
- `[kubernetes_executor] pod_creation_max_concurrency` (int, default `0` →
falls back to
`worker_pods_creation_batch_size`) — caps simultaneous create requests so
the burst
respects API-server priority-and-fairness / rate limits.
- The sequential path stays the default; its create behaviour is unchanged.
The async
`ApiException` is converted to the sync client's `ApiException`, so the
existing retry /
exceeded-quota / 429-backoff handling is reused untouched by both paths.
- New `pod_creation_batch_duration` timer (and `pod_creation_batch_size`
gauge), emitted by
both paths, measure per-loop batch creation time.
## E2E validation
Live `kind` cluster, full Airflow deployed via the official Helm chart with
`KubernetesExecutor`, a fan-out DAG, and a mutating admission webhook
injecting 200ms
per-create latency on worker pods (modelling real API + admission latency,
since kind
admits in sub-ms). Batch wall-time measured from the webhook's admission
timestamps,
independent of Airflow's own metrics.
Per scheduler-loop batch (what `pod_creation_batch_duration` measures):
| metric | serial (`async_pod_creation=False`) | async (`=True`) |
|---|---|---|
| in-loop batch create time | ~3.3s | ~0.11s |
| concurrent admissions (<50ms apart) | 0 / 29 | 24 / 29 |
End-to-end for a 30-task fan-out: **7.95s → 3.06s** (~2.5x). The end-to-end
factor is
bounded by how many tasks the scheduler queues per loop; the per-loop create
itself is
~30x faster, and the speedup scales with batch size per loop (the large
single-DAG burst
case).
## Tests
- The existing sequential create / publish-retry / 403-quota / 409 / 410 /
429 / 500 cases
pass unchanged — no behaviour change on the default path.
- New: concurrent creation creates every pod; the semaphore bound is
respected; the
async→sync `ApiException` conversion drives the existing quota-requeue and
429-`create_pods_after` backoff; a build/create failure for one pod
isolates to that task
while the rest of the batch still create; config parsing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]