akshayjadiyanv opened a new pull request, #38701:
URL: https://github.com/apache/beam/pull/38701
## Summary
Adds opt-in support for NVIDIA Dynamo (`ai-dynamo[vllm]`) as the underlying
engine for `VLLMCompletionsModelHandler` and `VLLMChatModelHandler`. When
`use_dynamo=True`, the handler launches a `dynamo.frontend` process as the
OpenAI-compatible local endpoint and a separate `dynamo.vllm` worker, instead
of `vllm.entrypoints.openai.api_server`. Existing native-vLLM behavior is
unchanged when the flag is absent.
This supersedes #36966 (now closed) and rebases the same approach onto
current master, preserving the recent batching-kwargs additions to the
ModelHandler base.
## Embedded mode scope and limitations
This change adds **embedded, single-worker Dynamo** — Beam launches one
Dynamo frontend + one vLLM worker per Beam worker, in-process. The following
Dynamo features are **not active** in embedded mode:
- KV-aware routing (defaults to `--router-mode round-robin` and
`--no-router-kv-events`).
- Disaggregated prefill / decode workers.
- KVBM offload across nodes.
- The Dynamo Planner (autoscaling) and Grove orchestration.
Embedded Dynamo also requires an etcd-style discovery service. When
`ETCD_ENDPOINTS` is unset, Beam starts a local `etcd` process (requires the
`etcd` binary in the worker container); when set, Beam uses the external
discovery service.
## API additions
`VLLMCompletionsModelHandler` and `VLLMChatModelHandler` gain two
**keyword-only** parameters:
- `use_dynamo: bool = False` — opt in to Dynamo.
- `dynamo_frontend_kwargs: Optional[dict[str, Optional[str]]] = None` —
extra kwargs forwarded to `dynamo.frontend`.
When `use_dynamo=True`, the existing `vllm_server_kwargs` are forwarded to
`dynamo.vllm` instead of `vllm.entrypoints.openai.api_server`. Sensible Dynamo
defaults are layered in so users only need `use_dynamo=True` for a working
setup.
`validate_inference_args` is now a no-op on both handlers, so OpenAI-style
request kwargs (e.g. `max_tokens`) can be passed via
`RunInference(model_handler, inference_args={...})`.
## Example pipeline
`apache_beam/examples/inference/vllm_text_completion.py` gains
`--use_dynamo` and `--max_tokens` flags. Without `--use_dynamo`, behavior is
unchanged from current master.
## Validation
- Unit tests: new `vllm_inference_test.py` covers (a) native vLLM still
launches a single server process; (b) Dynamo launches two processes with
separate frontend/engine kwargs; (c) `validate_inference_args` accepts OpenAI
request kwargs.
- End-to-end smoke: validated the runtime code on a GCP T4 VM (DirectRunner)
with `Qwen/Qwen3-0.6B` running through `ai-dynamo[vllm]` 0.7.0 + the embedded
etcd path.
- Lint: yapf, pylint (10.00/10), flake8, isort all clean for the touched
files.
## Not in this change (deliberate follow-ups)
- **Dataflow IT for the Dynamo path**: a commented-out block is included in
`sdks/python/test-suites/dataflow/common.gradle` documenting how to enable it.
Enabling requires bumping `vllm.dockerfile.old` to install `etcd` +
`ai-dynamo[vllm]` and provisioning an L4 (or larger) GPU pool. The existing two
vLLM Dataflow ITs are unchanged.
- **Dockerfile changes**: `vllm.dockerfile.old` is intentionally left alone.
Bumping the Beam version and adding `ai-dynamo[vllm]` to the existing
2.58.1-based image has non-trivial dependency-resolution risk and should be
handled alongside enabling the Dataflow IT.
## Credits
Builds on the `users/damccorm/dynamo` branch and PR #36966. The runtime/API
design here mirrors that work, rebased onto current master and scoped to
embedded mode with explicit defaults.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]