1fanwang opened a new pull request, #68480:
URL: https://github.com/apache/airflow/pull/68480

   ## Why
   
   `KubernetesExecutor` creates worker pods one at a time within each scheduler 
loop. In
   `sync()` the dequeued tasks are handed to `run_next()`, which builds the pod 
and calls
   `create_namespaced_pod` **synchronously**, blocking on the API response 
before the next
   pod is created. (`run_pod_async` refers only to the watcher tracking the pod 
afterwards —
   the create call itself blocks.)
   
   When per-create latency is high — API round-trip plus mutating admission 
webhooks, common
   in larger clusters — this serializes the scheduler loop and caps how many 
pods a loop can
   launch. Raising `worker_pods_creation_batch_size` does not help, because the 
dequeued batch
   is still created sequentially.
   
   ## What changed
   
   Opt-in concurrent creation, off by default:
   
   - `[kubernetes_executor] async_pod_creation` (bool, default `False`) — when 
enabled, the
     pods dequeued in a scheduler loop are built synchronously (pod-mutation 
hook and
     reconciliation unchanged), then their create calls are issued concurrently 
using the
     asynchronous Kubernetes client (`kubernetes_asyncio`, already a provider 
dependency),
     bounded by a semaphore.
   - `[kubernetes_executor] pod_creation_max_concurrency` (int, default `0` → 
falls back to
     `worker_pods_creation_batch_size`) — caps simultaneous create requests so 
the burst
     respects API-server priority-and-fairness / rate limits.
   - The sequential path stays the default; its create behaviour is unchanged. 
The async
     `ApiException` is converted to the sync client's `ApiException`, so the 
existing retry /
     exceeded-quota / 429-backoff handling is reused untouched by both paths.
   - New `pod_creation_batch_duration` timer (and `pod_creation_batch_size` 
gauge), emitted by
     both paths, measure per-loop batch creation time.
   
   ## E2E validation
   
   Live `kind` cluster, full Airflow deployed via the official Helm chart with
   `KubernetesExecutor`, a fan-out DAG, and a mutating admission webhook 
injecting 200ms
   per-create latency on worker pods (modelling real API + admission latency, 
since kind
   admits in sub-ms). Batch wall-time measured from the webhook's admission 
timestamps,
   independent of Airflow's own metrics.
   
   Per scheduler-loop batch (what `pod_creation_batch_duration` measures):
   
   | metric | serial (`async_pod_creation=False`) | async (`=True`) |
   |---|---|---|
   | in-loop batch create time | ~3.3s | ~0.11s |
   | concurrent admissions (<50ms apart) | 0 / 29 | 24 / 29 |
   
   End-to-end for a 30-task fan-out: **7.95s → 3.06s** (~2.5x). The end-to-end 
factor is
   bounded by how many tasks the scheduler queues per loop; the per-loop create 
itself is
   ~30x faster, and the speedup scales with batch size per loop (the large 
single-DAG burst
   case).
   
   ## Tests
   
   - The existing sequential create / publish-retry / 403-quota / 409 / 410 / 
429 / 500 cases
     pass unchanged — no behaviour change on the default path.
   - New: concurrent creation creates every pod; the semaphore bound is 
respected; the
     async→sync `ApiException` conversion drives the existing quota-requeue and
     429-`create_pods_after` backoff; a build/create failure for one pod 
isolates to that task
     while the rest of the batch still create; config parsing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to