py312 [beam]

via GitHub Wed, 17 Jun 2026 07:34:53 -0700


dependabot[bot] opened a new pull request, #39003:
URL: https://github.com/apache/beam/pull/39003


   Bumps [vllm](https://github.com/vllm-project/vllm) from 0.10.1.1 to 0.23.0.
   <details>
   <summary>Release notes</summary>
   <p><em>Sourced from <a 
href="https://github.com/vllm-project/vllm/releases";>vllm's 
releases</a>.</em></p>
   <blockquote>
   <h2>v0.23.0</h2>
   <h1>vLLM v0.23.0 Release Notes</h1>
   <p>Please note that Minimax M3 is not yet supported in this version. Please 
follow <a href="https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3";>vLLM recipe</a> 
for usage guides for M3.</p>
   <h2>Highlights</h2>
   <p>This release features 408 commits from 200 contributors (63 new)!</p>
   <ul>
   <li><strong>DeepSeek-V4 matures across backends</strong>: Following its 
introduction in v0.22.0, DeepSeek-V4 received another large hardening and 
optimization pass. Its sparse MLA metadata is now decoupled from DeepSeek-V3.2 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44699";>#44699</a>), 
it gained a TRTLLM-gen attention kernel (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43827";>#43827</a>), 
EPLB support for the Mega-MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43339";>#43339</a>), 
selective prefix-cache retention for sliding-window KV cache (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43447";>#43447</a>), 
and an index-share feature for DSA MTP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44420";>#44420</a>). 
The model was also detached from <code>torch.compile</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43746";>#43746</a>, 
<a href="https://redirect.github.com/vllm-p
 roject/vllm/issues/43891">#43891</a>), its attention and RoPE paths were 
refactored (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44569";>#44569</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44262";>#44262</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43926";>#43926</a>), 
and an XPU attention decode path was added (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42953";>#42953</a>).</li>
   <li><strong>Model Runner V2 expands to more dense models</strong>: MRv2 is 
now selected by default for <strong>Llama and Mistral dense models</strong> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43458";>#43458</a>) 
in addition to Qwen3. It gained a FlashInfer sampler (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42472";>#42472</a>), 
breakable CUDA graphs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44050";>#44050</a>), 
pipeline-parallel bubble elimination (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42187";>#42187</a>), 
kernel block-size support for hybrid models (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38831";>#38831</a>), 
and Gemma 4 MTP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43241";>#43241</a>).</li>
   <li><strong>Rust frontend grows up</strong>: The experimental Rust frontend 
added a streaming <code>generate</code> endpoint (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43779";>#43779</a>), 
dynamic LoRA endpoints (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43778";>#43778</a>), 
<code>/version</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43854";>#43854</a>) 
and <code>/server_info</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43942";>#43942</a>) 
endpoints, a server-router extension hook (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43774";>#43774</a>), 
request-ID headers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43883";>#43883</a>), 
and many new tool parsers (InternLM2 <a 
href="https://redirect.github.com/vllm-project/vllm/issues/43481";>#43481</a>, 
hy_v3 <a 
href="https://redirect.github.com/vllm-project/vllm/issues/43872";>#43872</a>, 
Phi-4-mini <a href="https://redir
 ect.github.com/vllm-project/vllm/issues/44213">#44213</a>, Gemma4 <a 
href="https://redirect.github.com/vllm-project/vllm/issues/43850";>#43850</a>).</li>
   <li><strong>Gemma 4</strong>: Added encoder-free <strong>Gemma 4 
Unified</strong> support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44429";>#44429</a>) 
and Gemma 4 MTP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43241";>#43241</a>), 
plus numerous accuracy and startup fixes.</li>
   <li><strong>Transformers v5 compatibility</strong>: vLLM now targets 
Transformers v5, with vendored MiniCPM-V/O processors (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44282";>#44282</a>) 
and compatibility fixes for Sarvam (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38804";>#38804</a>) 
and Voxtral (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44559";>#44559</a>).</li>
   <li><strong>Multi-tier KV cache offloading</strong>: The offloading 
framework gained an <strong>object-store secondary tier</strong> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41968";>#41968</a>), 
HMA enabled by default for capable connectors (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41847";>#41847</a>), 
tiering support for HMA models (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44287";>#44287</a>), 
and a per-request offloading policy via the <code>on_new_request</code> 
lifecycle hook (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43205";>#43205</a>).</li>
   <li><strong>Unified parser</strong>: Reasoning and tool-call parsing are now 
unified behind a single <code>Parser.parse()</code> interface (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44267";>#44267</a>), 
with the Responses parser migrated to it (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42977";>#42977</a>).</li>
   </ul>
   <h3>Model Support</h3>
   <ul>
   <li><strong>New models</strong>: Step-3.7-Flash (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43859";>#43859</a>), 
Cosmos3 Reasoner (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43356";>#43356</a>), 
Gemma 4 Unified encoder-free (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44429";>#44429</a>), 
JetBrains Mellum v2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43992";>#43992</a>), 
Granite Speech Plus (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43519";>#43519</a>), 
Cohere Mini Code (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44707";>#44707</a>).</li>
   <li><strong>Gemma 4</strong>: Encoder-free Unified support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44429";>#44429</a>), 
MTP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43241";>#43241</a>), 
native ViT linear layers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43798";>#43798</a>), 
vision-embedder excluded from quantization (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44571";>#44571</a>), 
and fixes for MTP under TP&gt;1 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43909";>#43909</a>), 
block-table mismatch under concurrency (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43982";>#43982</a>), 
transformers-processor startup crash (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44232";>#44232</a>), 
and CPU init (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44615";>#44615</a>).</li>
   <li><strong>Transformers v5</strong>: Vendor MiniCPM-V/O processors (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44282";>#44282</a>), 
Sarvam compat (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38804";>#38804</a>), 
Voxtral <code>fetch_audio</code> for transformers≥5.10 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44559";>#44559</a>).</li>
   <li><strong>Model fixes &amp; enhancements</strong>: 
Qwen3-VL/Qwen3-omni-thinker deepstack accuracy under <code>torch.compile</code> 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43617";>#43617</a>), 
EVS for Qwen3-VL (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44205";>#44205</a>), 
GLM-5.1 PP loading (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42944";>#42944</a>), 
GLM-4.1V processor logits (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43575";>#43575</a>), 
GLM-4.6V video loader (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44417";>#44417</a>), 
OlmoHybrid init (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43846";>#43846</a>), 
HyperCLOVAX remote-code removal (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43860";>#43860</a>), 
Bailing-MoE rotary factor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43770";>#43770</a>), 
Step3 PP residual KeyError (<a href="htt
 ps://redirect.github.com/vllm-project/vllm/issues/37622">#37622</a>), 
MiniCPM-V-4.6 video (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44509";>#44509</a>), 
MiniCPM-O audio unpadding (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38053";>#38053</a>), 
MiniCPM-V batched preprocessing (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44609";>#44609</a>), 
FunASR-Nano init (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44215";>#44215</a>), 
Cohere routing method (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44021";>#44021</a>), 
Kimi-K2.5 FlashInfer ViT metadata (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44493";>#44493</a>).</li>
   <li><strong>Multimodal</strong>: Auto-select registered video loader for 
VLMs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44126";>#44126</a>), 
O(log n) multimodal item handling per step (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44212";>#44212</a>), 
local image encoding in benchmarks (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43843";>#43843</a>), 
interleaved custom image benchmark datasets (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43636";>#43636</a>).</li>
   <li><strong>Pooling/Classification</strong>: Proper exceptions for pooling 
UX (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44593";>#44593</a>), 
<code>extra_repr()</code> for pooler classes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44805";>#44805</a>), 
LoRA-adapter-name pooling fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44410";>#44410</a>), 
resettled generative scoring entrypoint (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44153";>#44153</a>), 
expanded pooler unit tests (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43818";>#43818</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44471";>#44471</a>).</li>
   <li><strong>Refactor</strong>: AutoWeightsLoader for InternLM2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38278";>#38278</a>).</li>
   </ul>
   <h3>Engine Core</h3>
   <ul>
   <li><strong>Model Runner V2</strong>: Default for Llama and Mistral dense 
models (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43458";>#43458</a>), 
FlashInfer sampler (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42472";>#42472</a>), 
breakable CUDA graphs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44050";>#44050</a>), 
removed Eagle's dedicated CUDA graph pool (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44078";>#44078</a>), 
pipeline-parallel bubble elimination (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42187";>#42187</a>), 
kernel block size for hybrid models (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38831";>#38831</a>), 
zeroing of freshly allocated KV blocks for hybrid + FP8 KV cache (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43990";>#43990</a>), 
actual batch <code>max_seq_len</code> for attention metadata (<a 
href="https://redirect.github.com/vllm-project/
 vllm/issues/43991">#43991</a>), rejection-sampling acceptance-rate fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40651";>#40651</a>), 
KVConnector + PP cleanup (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43732";>#43732</a>), 
speculator-prefill warmup/capture (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44253";>#44253</a>).</li>
   <li><strong>Speculative decoding (DFlash)</strong>: Causal DFlash (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43445";>#43445</a>), 
proper lookahead-slot allocation (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43733";>#43733</a>), 
prefix-cache corruption fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42971";>#42971</a>); 
independent drafter attention-backend selection (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39930";>#39930</a>), 
attention-group split by <code>num_heads_q</code> for drafts (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43543";>#43543</a>), 
EAGLE/MTP lookahead caching in the SWA prefix-cache mask (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44082";>#44082</a>).</li>
   <li><strong>Attention &amp; hybrid/Mamba</strong>: 
FlexAttention/FlashAttention num-blocks-first layouts (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42095";>#42095</a>), 
OOT MLA prefill backend registration (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43325";>#43325</a>), 
FlashAttention upstream sync (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44065";>#44065</a>), 
Mamba LINEAR attention-module refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43556";>#43556</a>), 
corrupted MLA + linear attention fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43961";>#43961</a>), 
KDA conv-state unification (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44539";>#44539</a>) 
and gate/cumsum fusion (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43667";>#43667</a>), 
Mamba SSD <code>do_not_specialize</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43803";>#43803<
 /a>), Qwen3.5 mixed prefill+decode split routing (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44700";>#44700</a>), 
MiniMax-M2 gate kernel (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38445";>#38445</a>).</li>
   <li><strong>KV cache &amp; scheduler</strong>: Pluggable 
<code>KVCacheSpec</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37505";>#37505</a>), 
<code>scheduler_block_size</code> threaded into KVCacheManager/Coordinator (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44165";>#44165</a>), 
<code>max_concurrent_batches</code> moved to <code>VllmConfig</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44274";>#44274</a>), 
config validation rejecting 0/negative knobs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43794";>#43794</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44057";>#44057</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44207";>#44207</a>), 
KV-cache scale boilerplate removed from weight loading (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43167";>#43167</a>).</li>
   <li><strong>Core</strong>: Freeze the garbage collector in workers after 
model init (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44363";>#44363</a>), 
sparse NCCL weight transfer for in-place updates (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40096";>#40096</a>), 
graceful spinloop ext-load failure handling (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43659";>#43659</a>), 
scheduled-function deprecations (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43358";>#43358</a>).</li>
   </ul>
   <h3>Large Scale Serving &amp; Distributed</h3>
   <ul>
   <li><strong>KV cache offloading</strong>: Object-store secondary tier (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41968";>#41968</a>), 
HMA on by default for capable connectors (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41847";>#41847</a>) 
and tiering (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44287";>#44287</a>), 
per-request offloading policy (<code>on_new_request</code>) (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43205";>#43205</a>) 
and <code>on_schedule_end()</code> hook (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44206";>#44206</a>), 
token-offset selective offload (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39983";>#39983</a>), 
skip decode-phase blocks in CPU offload (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43797";>#43797</a>), 
page-size block alignment (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43689";>#43689</a>), 
Triton fast-p
 ath for small CPU→GPU <code>swap_blocks_batch</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42212";>#42212</a>), 
stale sliding-window block fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42959";>#42959</a>).</li>
   <li><strong>KV connectors / disaggregated serving</strong>: PP-aware 
handshake aggregation and intermediate-PP output plumbing (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43720";>#43720</a>), 
multiple-async-KV-load deadlock fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44560";>#44560</a>), 
Nixl Mamba prefix-caching mode (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42554";>#42554</a>), 
NixlConnector <code>kv_both</code> role deprecation cycle (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43874";>#43874</a>), 
Mooncake fixes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43742";>#43742</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44103";>#44103</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42694";>#42694</a>), 
LMCache <code>LMCacheMPConnector</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42865";>#42865</a>), 
EC connector shutdown API (<
 a 
href="https://redirect.github.com/vllm-project/vllm/issues/42423";>#42423</a>) 
and non-blocking lookup (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41627";>#41627</a>), 
KV-transfer tokens excluded from <code>iteration_tokens_total</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43346";>#43346</a>).</li>
   <li><strong>EPLB</strong>: Async EPLB by default (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43219";>#43219</a>), 
EPLB for DeepSeek-V4 Mega-MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43339";>#43339</a>), 
Nixl zero-copy EPLB transfers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41633";>#41633</a>).</li>
   <li><strong>Data parallel</strong>: DP Ray placement groups on specific 
nodes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44669";>#44669</a>) 
and grouped-node allocation fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43998";>#43998</a>), 
SSL for the DP supervisor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43688";>#43688</a>), 
DP-coordinator startup timeout raised to 120s (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42343";>#42343</a>), 
per-GPU-worker RDMA NIC selection (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42083";>#42083</a>).</li>
   </ul>
   <h3>Hardware &amp; Performance</h3>
   <ul>
   <li><strong>NVIDIA / kernels</strong>: FP8 FlashInfer attention for ViT (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38065";>#38065</a>), 
Triton MoE backend on Hopper by default (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44220";>#44220</a>), 
CUTLASS FP8 scaled-mm padding bypass (+20%) (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43706";>#43706</a>), 
MoE-permute buffer pre-allocation (+9–14%) (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43014";>#43014</a>), 
<code>Fp8BlockScaledMM</code> <code>new_empty()</code> optimization (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43677";>#43677</a>), 
TurboQuant shared dequant buffers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40941";>#40941</a>), 
tuned <code>selective_state_update</code> for H200/RTX PRO (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44251";>#44251</a>), 
Inductor fast-path fallback for vLLM/AITER custom op
 s (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42129";>#42129</a>), 
Gemma RMS all-reduce fusion (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42646";>#42646</a>), 
NUMA auto-binding on DGX B300 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43270";>#43270</a>).</li>
   <li><strong>AMD ROCm</strong>: ROCm 7.2.3 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43136";>#43136</a>), 
AITER v0.1.13.post1 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44265";>#44265</a>), 
native W4A16 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41394";>#41394</a>) 
and fused-MoE W4A16 HIP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44075";>#44075</a>) 
kernels for RDNA3 (gfx1100), AITER top-k/top-p sampler by default (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43331";>#43331</a>), 
attention-sink support in AITER FA (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43817";>#43817</a>), 
AITER hipBLASLt GEMM online tuning (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40426";>#40426</a>), 
<code>permute_cols</code> for ROCm (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44674";>#44674</a>), 
blocks-first KV layout for AMD (<a href="https://redirect.githu
 b.com/vllm-project/vllm/issues/43660">#43660</a>), N=5 wvSplitK for spec 
decode (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40687";>#40687</a>), 
MoRI connector improvements (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43303";>#43303</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41751";>#41751</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40344";>#40344</a>).</li>
   <li><strong>Intel XPU</strong>: vllm-xpu-kernel v0.1.7 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41019";>#41019</a>), 
<code>block_fp8_moe</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42139";>#42139</a>), 
block-scaled W8A8 FP8 path (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39968";>#39968</a>), 
WNA16 oracle for GPTQ sym-int4 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41426";>#41426</a>), 
rms_norm/act quant fusions (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43963";>#43963</a>), 
GDN-attention MTP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43565";>#43565</a>), 
Triton selective-scan op (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43421";>#43421</a>), 
transparent sleep mode (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37149";>#37149</a>), 
CPU/tiering offloading on XPU (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36423
 ">#36423</a>), DeepSeek-V4 attention decode path (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42953";>#42953</a>).</li>
   <li><strong>CPU &amp; other architectures</strong>: zentorch-accelerated 
W8A8/W4A16 on AMD Zen CPUs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41813";>#41813</a>), 
CPU top-k/top-p Triton sampling (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43633";>#43633</a>), 
non-divisible GQA decode in mixed batches (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43032";>#43032</a>), 
<code>cpu_awq</code> folded into <code>awq_marlin</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43841";>#43841</a>), 
RISC-V RVV WNA16 helpers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42730";>#42730</a>), 
fused GDN gated-delta-rule kernels (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43534";>#43534</a>), 
PowerPC SHM communicator (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43754";>#43754</a>), 
arm64 CI image (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41303";>#41303</a>).<
 /li>
   <li><strong>TPU</strong>: tpu-inference upgraded to v0.20.0 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43394";>#43394</a>) 
then v0.21.0 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44621";>#44621</a>).</li>
   <li><strong>torch stable ABI</strong>: Continued migration of kernels to the 
libtorch stable ABI — merge_attn_states/mamba/sampler [8/n] (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43361";>#43361</a>), 
attention/cache kernels [9/n] (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43717";>#43717</a>), 
header files (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44013";>#44013</a>), 
cuda_view/silu_and_mul [10/n] (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44334";>#44334</a>), 
custom all-reduce/DeepSeek-V4 fused MLA/MXFP8 MoE [10b/n] (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44365";>#44365</a>); 
ROCm fallback to regular ABI (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44648";>#44648</a>), 
<code>_has_module</code> trial-import verification (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44035";>#44035</a>).</li>
   </ul>
   <h3>Quantization</h3>
   <ul>
   <li><strong>ModelOpt</strong>: LM-head quantization (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42124";>#42124</a>), 
MXFP8 non-gated MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42958";>#42958</a>).</li>
   <li><strong>compressed-tensors</strong>: WNA8O8Int linears and WNInt 
embeddings (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44340";>#44340</a>), 
asymmetric MoE WNA16 Marlin (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44025";>#44025</a>), 
single-class NVFP4 linear refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42443";>#42443</a>).</li>
   </ul>
   <!-- raw HTML omitted -->
   </blockquote>
   <p>... (truncated)</p>
   </details>
   <details>
   <summary>Commits</summary>
   <ul>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/0fc695fc6d1d82e9a5ac6835ac8e4e1c83703665";><code>0fc695f</code></a>
 [Bugfix][Frontend] Cap fastapi &lt; 0.137 to avoid 
prometheus-fastapi-instrument...</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/91df0fad4dc98a67c7659d9dbd915245d5c43d96";><code>91df0fa</code></a>
 [Bugfix][CPU] Don't build triton-cpu on arm64 release image (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/45401";>#45401</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/78743ab5bffd381e88f97e1c8ba20473b0ae6d75";><code>78743ab</code></a>
 [Docker] Fix CUTLASS DSL cu13 install order in Dockerfile (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/45204";>#45204</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/b2d7294b0f1f2f1527e2b4dd5a3cdc703c0a3440";><code>b2d7294</code></a>
 [ROCm][Bugfix] Make intermediate_pad TP-aware in rocm_aiter_fused_experts (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/4";>#4</a>...</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/741ba421d847b76225379bce2a82b04fe88cc2d4";><code>741ba42</code></a>
 [Bugfix] [DSV4] [ROCm] Pin apache-tvm-ffi version to <code>0.1.10</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/45169";>#45169</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/ac94893da37b35b19c8e72c1f91ccfcecfe37bb5";><code>ac94893</code></a>
 [ROCm][MLA][Bugfix] Reserve FP8 prefill workspace before lock for Kimi-K2.5 
(...</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/967c5c3bc38891f4465d3f4e99917ed837bb3833";><code>967c5c3</code></a>
 [ROCm][CI] Stage C mirrors (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42793";>#42793</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/54c660c3a6c30c79e6457a162a19f4ed68042743";><code>54c660c</code></a>
 [XPU][Minor] format moe kernel name and add in kernel list (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44771";>#44771</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/8fb0274415062912ec56c6629a04cbac60121fe3";><code>8fb0274</code></a>
 [MM][CG] Simplify ViT CUDA graph interfaces (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/44484";>#44484</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/eebce65756f0beb08547439728da98b5e1a7119c";><code>eebce65</code></a>
 [XPU]feat: add DeepSeek-V4 XPU attention decode path (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42953";>#42953</a>)</li>
   <li>Additional commits viewable in <a 
href="https://github.com/vllm-project/vllm/compare/v0.10.1.1...v0.23.0";>compare 
view</a></li>
   </ul>
   </details>
   <br />
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vllm&package-manager=pip&previous-version=0.10.1.1&new-version=0.23.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   <details>
   <summary>Dependabot commands and options</summary>
   <br />
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot show <dependency name> ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/beam/network/alerts).
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Bump vllm from 0.10.1.1 to 0.23.0 in /sdks/python/container/ml/py312 [beam]

Reply via email to