py311 [beam]

via GitHub Fri, 12 Jun 2026 06:39:31 -0700


dependabot[bot] opened a new pull request, #38944:
URL: https://github.com/apache/beam/pull/38944


   Bumps [vllm](https://github.com/vllm-project/vllm) from 0.10.1.1 to 0.22.0.
   <details>
   <summary>Release notes</summary>
   <p><em>Sourced from <a 
href="https://github.com/vllm-project/vllm/releases";>vllm's 
releases</a>.</em></p>
   <blockquote>
   <h2>v0.22.0</h2>
   <h2>Highlights</h2>
   <p>This release features 459 commits from 230 contributors (63 new)!</p>
   <ul>
   <li><strong>DeepSeek V4 maturity</strong>: DeepSeek V4 received a major 
hardening pass this cycle — the model was reorganized into a dedicated 
<code>vllm/models/deepseek_v4/</code> package (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43004";>#43004</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43039";>#43039</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43073";>#43073</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43077";>#43077</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43149";>#43149</a>), 
gained NVFP4 fused MoE support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42209";>#42209</a>), 
full + piecewise CUDA graph (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42604";>#42604</a>), 
and MTP speculative decoding (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43385";>#43385</a>). 
A large set of fused kernels (MegaMoE, <code>mhc</code>, Q-n
 orm, indexer, sparse MLA) and ROCm parity fixes landed alongside accuracy 
fixes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42810";>#42810</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43710";>#43710</a>).</li>
   <li><strong>Model Runner V2 advances toward default</strong>: MRv2 is now 
default for Qwen3 dense models. vLLM will fall back to MRv1 for features that 
aren't yet supported in MRv2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39337";>#39337</a>). 
sleep-mode weight reload (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42673";>#42673</a>), 
<code>update_config</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42783";>#42783</a>), 
and shared KV-cache layers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35045";>#35045</a>), 
plus many correctness fixes.</li>
   <li><strong>Experimental Rust frontend</strong>: A new Rust front-end 
integration landed (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40848";>#40848</a>), 
with the implementation moved into the tree (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43283";>#43283</a>) 
and a DP Supervisor for data-parallel serving (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40841";>#40841</a>).</li>
   <li><strong>Batch invariance, faster</strong>: Batch-invariant inference 
gained Cutlass FP8 support for a <strong>28.9% end-to-end latency 
improvement</strong> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40408";>#40408</a>), 
compile-mode support on SM80 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42456";>#42456</a>), 
and an NVFP4 Cutlass linear path (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39912";>#39912</a>).</li>
   <li><strong>Multi-tier KV cache offloading</strong>: A new multi-tier KV 
cache offloading framework (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40020";>#40020</a>) 
with a Python filesystem secondary tier (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41735";>#41735</a>), 
DSv4 support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43142";>#43142</a>), 
and Mooncake disk offloading (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42689";>#42689</a>) 
extends offloading beyond CPU memory.</li>
   </ul>
   <h3>Model Support</h3>
   <ul>
   <li>New architectures: MiniCPM-V 4.6 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41254";>#41254</a>), 
InternS2 Preview (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42705";>#42705</a>), 
OpenVLA (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42654";>#42654</a>), 
MolmoWeb <code>hf_overrides</code> docs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42163";>#42163</a>); 
EXAONE-4.5 aligned with Transformers update (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42246";>#42246</a>).</li>
   <li>Speculative decoding: custom callable proposer backend (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39487";>#39487</a>), 
post-norm EAGLE-3 speculators (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42764";>#42764</a>), 
peagle speculators (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41826";>#41826</a>), 
hybrid-attention models in <code>extract_hidden_states</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39949";>#39949</a>), 
non-MTP speculation for NemotronH (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43130";>#43130</a>), 
shared MTP weights in MRv2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42538";>#42538</a>).</li>
   <li>DeepSeek V4: NVFP4 MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42209";>#42209</a>), 
CUDA graph full/piecewise (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42604";>#42604</a>), 
MTP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43385";>#43385</a>), 
model package refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43004";>#43004</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43039";>#43039</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43073";>#43073</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43077";>#43077</a>), 
sparse MLA + compressor refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43149";>#43149</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43710";>#43710</a>), 
MegaMoE input-prep kernel move (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43632";>#43632</a>).</li>
   <li>Qwen3.5/3.6: GDN output-projection flatten (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42311";>#42311</a>), 
GatedDeltaNet Marlin TP≥2 fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/36329";>#36329</a>), 
ViT full CUDA graph (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42151";>#42151</a>), 
runai-streamer weight loading for Qwen3.5/MTP/Qwen3-VL (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42521";>#42521</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42716";>#42716</a>), 
KDA chunk-prefill exp2 semantics (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43195";>#43195</a>).</li>
   <li>Gemma3/Gemma4: mixed-resolution image co-batching crash fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42217";>#42217</a>), 
MoE routing closure fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42250";>#42250</a>), 
tool-parser float-corruption fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42128";>#42128</a>), 
batched vision encoder for image/video (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43169";>#43169</a>), 
multi-GPU fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42630";>#42630</a>).</li>
   <li>Kimi-K2.5: skip vision-tower dtype conversion under quantization (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42869";>#42869</a>), 
<code>mm_projector</code> dtype fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42081";>#42081</a>).</li>
   <li>Cohere: enable Cohere MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43143";>#43143</a>), 
pipeline parallelism for Cohere vision (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42819";>#42819</a>).</li>
   <li>Tool calling: Apertus tool parser (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41154";>#41154</a>), 
Qwen3Coder <code>anyOf</code>/<code>oneOf</code>/<code>$ref</code> resolution 
re-land (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37831";>#37831</a>), 
shared <code>coerce_to_schema_type</code> across MiniMax-M2 / DeepSeek-V3.2 / 
Seed-OSS parsers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43006";>#43006</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43019";>#43019</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43140";>#43140</a>).</li>
   <li>ViT CUDA graph: Qwen2-VL (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41736";>#41736</a>), 
Step3-VL encoder (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42224";>#42224</a>), 
Qwen3.5 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42151";>#42151</a>), 
FlashInfer metadata for Qwen2.5-VL vision attention (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42787";>#42787</a>).</li>
   </ul>
   <h3>Engine Core</h3>
   <ul>
   <li>Model Runner V2: Qwen3-dense-by-default oracle (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39337";>#39337</a>), 
sleep-mode reload weights (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42673";>#42673</a>), 
<code>update_config</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42783";>#42783</a>), 
shared KV-cache layers (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/35045";>#35045</a>), 
FP32 gumbel sampling (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41775";>#41775</a>), 
auto-fallback to MRv1 with connectors (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42955";>#42955</a>), 
<code>logprob_token_ids</code> correctness (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43125";>#43125</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41761";>#41761</a>), 
prompt-logprobs size fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42778";>#42778</a
 >).</li>
   <li>KV offloading: multi-tier framework (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40020";>#40020</a>), 
Python filesystem secondary tier (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41735";>#41735</a>), 
DSv4 support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43142";>#43142</a>), 
tier-offload follow-up (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42529";>#42529</a>), 
prefer HND layout (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41928";>#41928</a>), 
<code>reset_cache()</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41956";>#41956</a>), 
per-request tracking (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42507";>#42507</a>), 
store-deferral fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41945";>#41945</a>).</li>
   <li>MoE refactor: <code>ExpertMapManager</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41046";>#41046</a>), 
experts moved to <code>experts/</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42334";>#42334</a>), 
<code>RoutedExperts</code> alias for FusedMoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40735";>#40735</a>), 
EPLB refactoring for FusedMoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41055";>#41055</a>).</li>
   <li>Mamba: attention module refactor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41126";>#41126</a>), 
Mamba2 SSD kernel warmup (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39822";>#39822</a>), 
bf16 SSM cache (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41680";>#41680</a>), 
GPU-side state postprocessing fused kernel (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40172";>#40172</a>), 
run single-token extends as decodes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42430";>#42430</a>).</li>
   <li>KV events: emit KV cache metadata (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40984";>#40984</a>).</li>
   <li>Allocator: manual cumem allocator enable (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/33648";>#33648</a>), 
stream-aware free callback (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43020";>#43020</a>).</li>
   <li>elastic-EP: stage/commit MoE quant method on reconfigure (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40881";>#40881</a>).</li>
   </ul>
   <h3>Hardware &amp; Performance</h3>
   <ul>
   <li><strong>NVIDIA Blackwell / SM12x</strong>: FlashInfer b12x MoE + FP4 
GEMM for SM120/121 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40082";>#40082</a>), 
per-tensor FP8 CUTLASS on SM12.1 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41215";>#41215</a>), 
<code>head_dim=512</code> for FlashInfer TRTLLM attention (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38822";>#38822</a>), 
FlashInfer Blackwell GDN prefill (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40717";>#40717</a>), 
GDN prefill kernel for SM100 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43273";>#43273</a>).</li>
   <li><strong>Performance</strong>: batch-invariant Cutlass FP8 (+28.9% E2E) 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40408";>#40408</a>), 
CutlassFP8 padding pre-processing (+13.5% TTFT) (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42651";>#42651</a>), 
padded NVFP4 quant kernel (+2.4–5.7% E2E) (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42774";>#42774</a>), 
GPU&lt;-&gt;CPU sync elimination 1/n (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41429";>#41429</a>) 
and 4/n (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42347";>#42347</a>), 
fused RoPE+KVCache+q_concat for MLA (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40392";>#40392</a>), 
MLA <code>compute_prefill_context</code> / <code>_v_up_proj</code> 
optimizations (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42460";>#42460</a>, 
<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42561";>#42561</a>), 
penal
 ties Triton kernel (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40657";>#40657</a>), 
<code>do_not_specialize</code> in fused FP8 RoPE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42849";>#42849</a>), 
FULL CUDA graph capture for TRITON_MLA decode (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42885";>#42885</a>).</li>
   <li><strong>AMD ROCm</strong>: DSV4 functionality + accuracy fixes (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42810";>#42810</a>, 
<a href="https://redirect.github.com/vllm-project/vllm/issues/43679";>#43679</a> 
Tilelang MHC), flash sparse MLA Triton kernels (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41812";>#41812</a>), 
gluon paged MQA logits on gfx950/MI355X (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42062";>#42062</a>), 
RMSNorm+Quant fusion for gfx950 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41825";>#41825</a>), 
AITER FA backend cleanup (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41942";>#41942</a>), 
XGMI backend for MoRI connector (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41753";>#41753</a>), 
QuickReduce min-size override (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41675";>#41675</a>), 
DSV4 MTP (<a href="https://redirect.github.com/vllm-project/vl
 lm/issues/43385">#43385</a>).</li>
   <li><strong>CPU / RISC-V</strong>: RVV-optimized attention kernels for 
RISC-V Vector Extension (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40119";>#40119</a>) 
with VLEN=256 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42943";>#42943</a>), 
fused GDN for AMX CPU (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42707";>#42707</a>), 
MXFP4 W4A16 MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41922";>#41922</a>), 
experimental Triton + MRv2 on CPU (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43225";>#43225</a>), 
improved CPU thread utilization (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42666";>#42666</a>), 
<code>--cpu-distributed-timeout-seconds</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42968";>#42968</a>).</li>
   <li><strong>Intel XPU</strong>: GPTQ int4 support (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37844";>#37844</a>), 
mxfp8 MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41918";>#41918</a>), 
FP8 block-scaled quantization (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42952";>#42952</a>), 
custom-op collective behavior (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41354";>#41354</a>), 
multiple sparse-attention kernels (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/37888";>#37888</a>), 
MoE topk routing + MXFP4 fallback (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42951";>#42951</a>), 
CT W4A4 MXFP4 path (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/38896";>#38896</a>), 
reduced XPU MoE host overhead (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42915";>#42915</a>).</li>
   <li><strong>Kernel ABI</strong>: continued migration to libtorch stable ABI 
— 5/n (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42339";>#42339</a>), 
6/n (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42663";>#42663</a>), 
7/n (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43209";>#43209</a>).</li>
   <li><strong>Experimental</strong>: breakable CUDA graph (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42304";>#42304</a>).</li>
   </ul>
   <h3>Large Scale Serving</h3>
   <ul>
   <li>Disaggregated serving (NIXL): lease-renewal TTL for KV blocks on P (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41383";>#41383</a>), 
handshake-failure policy honoring (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40364";>#40364</a>), 
GDN support for PD with NIXL (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41869";>#41869</a>), 
multi-node TP&gt;8 fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39907";>#39907</a>), 
side-channel host-selection fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41806";>#41806</a>).</li>
   <li>Mooncake: disk offloading in MooncakeStoreConnector (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42689";>#42689</a>), 
HMA support for DSV4 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42828";>#42828</a>), 
operation metrics (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43392";>#43392</a>), 
load-failure propagation (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42788";>#42788</a>), 
block-aligned full hits (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43494";>#43494</a>), 
finish-after-preemption handling (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43281";>#43281</a>).</li>
   <li>Data parallel: DP Supervisor (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40841";>#40841</a>), 
publish request counts at engine-step start (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41626";>#41626</a>), 
forward <code>X-data-parallel-rank</code> header (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42330";>#42330</a>).</li>
   <li>EPLB: change default EPLB communicator (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43110";>#43110</a>), 
VLM-wrapper init fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39805";>#39805</a>), 
remove dead <code>torch.accelerator.synchronize()</code> (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40733";>#40733</a>).</li>
   <li>LoRA: one-shot Triton kernel for MoE LoRA (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42290";>#42290</a>), 
simultaneous 2D &amp; 3D MoE LoRA adapters (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42242";>#42242</a>), 
reduced 2D-weight memory under EP (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42737";>#42737</a>), 
MoE LoRA align-kernel grid fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/40131";>#40131</a>).</li>
   </ul>
   <h3>Quantization</h3>
   <ul>
   <li><strong>MXFP4</strong>: linear layers + compressed-tensors integration 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41664";>#41664</a>), 
CPU W4A16 MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41922";>#41922</a>), 
XPU mxfp8 MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/41918";>#41918</a>).</li>
   <li><strong>NVFP4</strong>: DeepSeek V4 fused MoE (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42209";>#42209</a>), 
ModelOpt W4A16 NVFP4 fused MoE + mixed-precision dispatch (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/42566";>#42566</a>), 
batch-invariant NVFP4 Cutlass linear (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/39912";>#39912</a>), 
FlashInfer TRTLLM NvFP4 monolithic MoE routing fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43223";>#43223</a>), 
TRTLLM NVFP4 MoE chunking fix (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43599";>#43599</a>).</li>
   </ul>
   <!-- raw HTML omitted -->
   </blockquote>
   <p>... (truncated)</p>
   </details>
   <details>
   <summary>Commits</summary>
   <ul>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/0b3ba88f165976e77ca5e6a7a3f5bba4562b80af";><code>0b3ba88</code></a>
 Revert &quot;[CPU] Experimentally enable Triton and MRV2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43225";>#43225</a>)&quot;</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/799c3afa5d5b17b676d04e0b58a5628943bb4003";><code>799c3af</code></a>
 [BugFix] Fix hard-coded timeout for multi-API-server startup (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43768";>#43768</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/64e25235c783bfccf88ae6e5164581cbceebc199";><code>64e2523</code></a>
 [Bugfix] Pass <code>routed_scaling_factor</code> to FlashInfer TRTLLM BF16 MoE 
(<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43769";>#43769</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/a147dd011505af2a266db04be622cf4979da12e0";><code>a147dd0</code></a>
 [ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43679";>#43679</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/075929351285f22530265d78fa56fa530f3fd7b3";><code>0759293</code></a>
 [Bugfix][Kernel] TRTLLM NVFP4 MoE chunking (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43599";>#43599</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/a930f5a58d101dcad3aa4a5945d3918ba350b9eb";><code>a930f5a</code></a>
 Fix RunAI streamer tensor buffer reuse during weight loading (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43464";>#43464</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/40cf0206bae58400bd20a16320ed90309eae4311";><code>40cf020</code></a>
 Fix early CUDA init (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43791";>#43791</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/8c4061336a405b84746e71f10f9cdc45e6573d3e";><code>8c40613</code></a>
 [misc] Bump cutedsl version to 4.5.2 (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43745";>#43745</a>)</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/5ebdf473c5ad4fee78f0728a75ec280dfb00482e";><code>5ebdf47</code></a>
 [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43";>#43</a>...</li>
   <li><a 
href="https://github.com/vllm-project/vllm/commit/a94cd6d98fd1709f2ad3916274f0f3a88e01485a";><code>a94cd6d</code></a>
 [MRV2][BugFix] Fix KV connector handling in spec decode case (<a 
href="https://redirect.github.com/vllm-project/vllm/issues/43719";>#43719</a>)</li>
   <li>Additional commits viewable in <a 
href="https://github.com/vllm-project/vllm/compare/v0.10.1.1...v0.22.0";>compare 
view</a></li>
   </ul>
   </details>
   <br />
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vllm&package-manager=pip&previous-version=0.10.1.1&new-version=0.22.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   <details>
   <summary>Dependabot commands and options</summary>
   <br />
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot show <dependency name> ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   You can disable automated security fix PRs for this repo from the [Security 
Alerts page](https://github.com/apache/beam/network/alerts).
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Bump vllm from 0.10.1.1 to 0.22.0 in /sdks/python/container/ml/py311 [beam]

Reply via email to