akshayjadiyanv opened a new pull request, #38701:
URL: https://github.com/apache/beam/pull/38701

   ## Summary
   
   Adds opt-in support for NVIDIA Dynamo (`ai-dynamo[vllm]`) as the underlying 
engine for `VLLMCompletionsModelHandler` and `VLLMChatModelHandler`. When 
`use_dynamo=True`, the handler launches a `dynamo.frontend` process as the 
OpenAI-compatible local endpoint and a separate `dynamo.vllm` worker, instead 
of `vllm.entrypoints.openai.api_server`. Existing native-vLLM behavior is 
unchanged when the flag is absent.
   
   This supersedes #36966 (now closed) and rebases the same approach onto 
current master, preserving the recent batching-kwargs additions to the 
ModelHandler base.
   
   ## Embedded mode scope and limitations
   
   This change adds **embedded, single-worker Dynamo** — Beam launches one 
Dynamo frontend + one vLLM worker per Beam worker, in-process. The following 
Dynamo features are **not active** in embedded mode:
   
   - KV-aware routing (defaults to `--router-mode round-robin` and 
`--no-router-kv-events`).
   - Disaggregated prefill / decode workers.
   - KVBM offload across nodes.
   - The Dynamo Planner (autoscaling) and Grove orchestration.
   
   Embedded Dynamo also requires an etcd-style discovery service. When 
`ETCD_ENDPOINTS` is unset, Beam starts a local `etcd` process (requires the 
`etcd` binary in the worker container); when set, Beam uses the external 
discovery service.
   
   ## API additions
   
   `VLLMCompletionsModelHandler` and `VLLMChatModelHandler` gain two 
**keyword-only** parameters:
   
   - `use_dynamo: bool = False` — opt in to Dynamo.
   - `dynamo_frontend_kwargs: Optional[dict[str, Optional[str]]] = None` — 
extra kwargs forwarded to `dynamo.frontend`.
   
   When `use_dynamo=True`, the existing `vllm_server_kwargs` are forwarded to 
`dynamo.vllm` instead of `vllm.entrypoints.openai.api_server`. Sensible Dynamo 
defaults are layered in so users only need `use_dynamo=True` for a working 
setup.
   
   `validate_inference_args` is now a no-op on both handlers, so OpenAI-style 
request kwargs (e.g. `max_tokens`) can be passed via 
`RunInference(model_handler, inference_args={...})`.
   
   ## Example pipeline
   
   `apache_beam/examples/inference/vllm_text_completion.py` gains 
`--use_dynamo` and `--max_tokens` flags. Without `--use_dynamo`, behavior is 
unchanged from current master.
   
   ## Validation
   
   - Unit tests: new `vllm_inference_test.py` covers (a) native vLLM still 
launches a single server process; (b) Dynamo launches two processes with 
separate frontend/engine kwargs; (c) `validate_inference_args` accepts OpenAI 
request kwargs.
   - End-to-end smoke: validated the runtime code on a GCP T4 VM (DirectRunner) 
with `Qwen/Qwen3-0.6B` running through `ai-dynamo[vllm]` 0.7.0 + the embedded 
etcd path.
   - Lint: yapf, pylint (10.00/10), flake8, isort all clean for the touched 
files.
   
   ## Not in this change (deliberate follow-ups)
   
   - **Dataflow IT for the Dynamo path**: a commented-out block is included in 
`sdks/python/test-suites/dataflow/common.gradle` documenting how to enable it. 
Enabling requires bumping `vllm.dockerfile.old` to install `etcd` + 
`ai-dynamo[vllm]` and provisioning an L4 (or larger) GPU pool. The existing two 
vLLM Dataflow ITs are unchanged.
   - **Dockerfile changes**: `vllm.dockerfile.old` is intentionally left alone. 
Bumping the Beam version and adding `ai-dynamo[vllm]` to the existing 
2.58.1-based image has non-trivial dependency-resolution risk and should be 
handled alongside enabling the Dataflow IT.
   
   ## Credits
   
   Builds on the `users/damccorm/dynamo` branch and PR #36966. The runtime/API 
design here mirrors that work, rebased onto current master and scoped to 
embedded mode with explicit defaults.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to