(tvm) branch refactor/tmem-pool-zero-arg-warp-guard deleted (was 0ea33f5ca5)

bohan Sat, 25 Apr 2026 15:00:48 -0700

This is an automated email from the ASF dual-hosted git repository.

spectrometerHBH pushed a change to branch refactor/tmem-pool-zero-arg-warp-guard
in repository https://gitbox.apache.org/repos/asf/tvm.git



     was 0ea33f5ca5 refactor(tirx): drop TMEMPool warp/wg id parameters via 
thread_rank guard

This change permanently discards the following revisions:

 discard 0ea33f5ca5 refactor(tirx): drop TMEMPool warp/wg id parameters via 
thread_rank guard
 discard 9e4952fb5b refactor(exec-scope): merge TIRx lowering passes + RFC v3 
ExecContext + closed-enum cleanup (#594)
 discard 7bcba3c511 fix(intrinsics): value-init tcgen05/wgmma descriptor unions 
(#593)
 discard 3d5ff14b8b chore(infra): fix repo-wide pre-commit issues
 discard a52355e673 Fix post-rebase compatibility with upstream main
 discard bd58a0c2a2 feat(op-dispatch): overhaul CUDA TMA dispatch and accept 
OOB contract on s2g (#589)
 discard b0374e7ee9 FP8 Groupgemm Tile (#586)
 discard 8730458e15 refactor(op): declarative intrinsic schema + rename 
device_native_codegen→intrinsics (#592)
 discard ca9e014aab refactor(op): kwarg-only MMA family wrappers + MMA codegen 
rule tables (#591)
 discard 7e4e9ad652 refactor(op): self-documenting kwargs + enum validation 
across PTX intrinsics (#590)
 discard 97507d353e feat(pipeline): propagate MBarrier name to underlying 
buffer (#587)
 discard 68ff46b5e5 refactor(tirx): simplify TMEMPool/SMEMPool API and move to 
tirx/lang (#584)
 discard 65208206b9 feat(pipeline): add Pipe and PipeCursor barrier pair 
abstractions (#582)
 discard 85541fd21e fix(op-dispatch): add vec2 vectorized cast to local_view 
non-full paths (#583)
 discard 1db335f7f4 feat(tirx): add wg-local helper and local-view elementwise 
dispatch (#581)
 discard 63437f32bd refactor(tirx): simplify TMEMPool and add TMEM helpers 
(#579)
 discard 4b6ef8bdb2 feat(tirx): add alloc_mma for inferred MMA shared layouts 
(#577)
 discard 5d3663226f feat(op-dispatch): support cta_group=2 M=128 (Layout B / 
2x2 datapath) in gemm_async (#576)
 discard 05409ee2bd fix(op): add missing srcs/dsts properties to PermuteDims 
(#575)
 discard d25ca30b0b feat(op-dispatch): add 
cp.async.bulk.shared::cluster.shared::cta PTX instruction and DSMEM dispatch 
(#574)
 discard 658c4791b2 fix(codegen): add #pragma unroll 1 before while loops (#573)
 discard 23f934fb84 refactor(buffer): introduce 2x2 partition/subregion API and 
eliminate _source_region hack (#569)
 discard 95b76ec722 Support cp.async and tile::gather4 alternatives for 
GroupGEMM (#568)
 discard 991b2a0e97 refactor(op-dispatch): split dispatch files into 
per-variant folder structure (#567)
 discard 7dc301c059 refactor(tirx): reorganize module structure into operator/, 
lang/, and flatten bench
 discard 8eaea35536 refactor(tirx): complete infra cleanup and stabilize tirx 
test env (#564)
 discard addfa6e30c feat(op): add cuda_copy_bytes device op for typed N-byte 
load/store (#565)
 discard 617bf50029 refactor(layout): rename s_tir Layout to SLayout, tirx 
TLayout to Layout (#563)
 discard f0da4e945d refactor(tirx): remove LetStmt backward-compat, use leaf 
Bind directly (#562)
 discard 995e7b3b69 docs(txdev): add rebase feature reflections
 discard b9d7af0a67 fix(infra): rename hgemm to fp16_bf16_gemm in tir-bench 
command
 discard df96abdceb fix(tirx): restore LetStmt backward-compat function with 
body support
 discard 473109bcd6 chore(lint): fix pre-push hook formatting and lint issues
 discard edd6ac262f refactor: rebase tirx onto apache/tvm main (tir->tirx 
namespace)
 discard ce3f3f704f chore(lint): exclude .txdev and .claude from pre-commit 
hooks (#560)
 discard 55b8a6c1a3 feat(tirx): add cuda_warp_reduce and cuda_cta_reduce 
intrinsics (#558)
 discard fb2cb63285 fix(tvmscript): handle range() at AST level in visit_for 
and reject bare list/tuple assignment (#559)
 discard eea87dceeb feat(op-dispatch): deduplicate tensormaps and smem 
descriptors at dispatch time (#557)
 discard bc2ee71978 chore(infra): add /tir-build, /tir-test, /tir-bench slash 
commands (#556)
 discard cb3a8229fe feat(bench): add ProtonContext, bench_tk(), and remove 
do_bench (#555)
 discard ad9ec65900 refactor(test): replace TIRX_KERNELS_PATH sys.path hack 
with tirx_kernels imports (#554)
 discard 9f3ae2621e feat(tvmscript): add vector annotation syntax, dtype 
shorthands, and smem/tmem aliases (#553)
 discard dfa98c3435 feat(op-dispatch): add silu, fma, warp shuffle reduce, and 
binary op extensions (#552)
 discard 209246b690 refactor: update imports for tirx-kernels module rename 
(kernels -> tirx_kernels) (#551)
 discard 6faca8971a [Op-schedule] Extend reduction op (#539)
 discard 0b7f84a453 chore: add .txdev entries for bench-baseline feature 
reflections (#550)
 discard e91a72b454 refactor(kernel): replace 35 sm100a test files with unified 
registry-based test_kernels.py (#549)
 discard b7b333eb77 fix(pipeline): remove nested elect_sync from TCGen05Bar and 
fix TMABar fallback (#548)
 discard 7c588b1029 feat(tvmscript): syntax sugar and pipeline/barrier 
improvements (#546)
 discard 146ccf954d refactor(op-dispatch): rename op-schedule stack to 
op-dispatch (#544)
 discard eac1c490c9 feat(lower-tirx): update lower_tirx golden TVMScript to 
flattened exec-scope-free IR (#540)
 discard 485c64869d feat(tvmscript): print strict scalar alloc_buffer with 
lint-clean baseline (#543)
 discard 0ab47abbd1 Abstract FP8 Gemm Tile (#542)
 discard 72a0a325f6 test(tvmscript): clarify Tx default-prefix regression for 
non-main IRModule (#541)
 discard ce3173d8c0 Refactor] Remove standalone sm100 kernels (moved to 
tirx-kernels) (#538)
 discard d84e3d390a fix(kernel): relax fp8 blockwise gemm TIR vs DeepGemm 
tolerance to 2e-3 (#537)
 discard d9f16be104 [Refactor] Move sm100a kernels to tirx-kernels submodule 
(#534)
 discard 649e3bf365 feat(op-schedule): support TMEM A operand in gemm_async 
tcgen05 (#536)
 discard 2799fd3a05 feat(op-schedule): support arbitrary tile sizes and 
transposed MMA in gemm_async (#535)
 discard 78b33ba93a docs(op-schedule): add dispatch documentation with 
before/after IR examples (#531)
 discard 7c447c213b docs(infra): require an up-to-date build before tests (#532)
 discard bb90ea1d68 refactor(op-schedule): remove maps module, inline 
registrations into unary/binary (#530)
 discard 74bd453b9c fix(infra): avoid runtime NameError in OpCall type hints
 discard 8f0a95b0bb chore(infra): fix remaining pre-commit ruff violations
 discard 53f12d28a3 chore(infra): apply pre-push clang-format update
 discard 20bb357dd2 fix(infra): finalize rebased branch integration fixes
 discard ff67739f1f fix(infra): reduce internal pytest warnings in TIRX test 
run (#528)
 discard cc89e5dd46 [Op-schedule]extend unary/binary impl (#526)
 discard cdec681455 fix(ir): traverse OpCall.config in StmtVisitor/StmtMutator 
(#B00004) (#527)
 discard 6716984339 Allow alloc_buffer for tmem scope and auto-register unknown 
axes (#524)
 discard 545c2e1c00 fix(op-schedule): update stale import paths for 
tma_shared_layout (#523)
 discard 63e1b2a9dd feat(op-schedule): implement op schedule for tcgen05 UTCCP 
copy (#521)
 discard 454a8bbf58 reflect: apply approved items (adr-0, kb-0, mistake-0) 
(#522)
 discard 7123858f60 chore(infra): fix pre-commit violations across repo (#519)
 discard dc1eae19ee docs(infra): update build and test instructions in 
CLAUDE.md (#520)
 discard ac4a05d708 [Kernel] Transcribe DeepGEMM kernels and evolve 
TK-transcribed kernels for SM100a (#514)
 discard 591a7764db refactor(logging): remove compatibility adapters and 
migrate to new logging macros (#517)
 discard 0162007893 reflect: apply approved items (adr-0, kb-0, mistake-0) 
(#516)
 discard 195619bf11 fix(rebase): restore tirx compatibility and tests after 
apache/main sync
 discard 76fde241c7 test(kernel): replace deprecated ptx fence proxy calls in 
flux tests (#513)
 discard b9932831ef fix(infra): align tirx pre-commit handling with apache/main 
(#512)
 discard c907c04bd4 reflect: add knowledge, mistakes, and decisions from 
refactor-tma-test (#511)
 discard 6b0d289947 test(op-schedule): refactor TMA copy_async tests into unit 
tests (#509)
 discard 228bbfaa3c [Kernel] Transcribe 5 TK warp-specialized kernels to TIRX 
on SM100a (#506)
 discard 6ff19f6ae4 fix(infra): add cfg to allowed file types for lint check
 discard 61629bd67f [Fix] Fix TMA store coordinates and cta_mask unicast 
codegen     (#507)
 discard 23279b48f8 feat(tvmscript): DeclBuffer syntax sugar + 
Buffer.partition() (#503)
 discard 6dea12beb2 fix(infra): harden exec scope infrastructure and unify 
warp_id injection (#504)
 discard 296578cb21 refactor(op): redesign PTX fence API to match PTX ISA 
instruction families (#502)
 discard 621f2d4cc3 fix: address critical bugs across multiple subsystems (#501)
 discard b726708794 feat(op): add exp2 operator and fix exp/exp2 semantics 
(#500)
 discard e8aca8e6a2 fix(kernel): relax fp8 blockwise GEMM reference comparison 
thresholds (#499)
 discard 8ebc06ff17 chore(infra): run pre-commit fixes across codebase (#498)
 discard 298731687f chore(kernel): format fp8 blockwise GEMM and improve 
assertions (#497)
 discard b8a8c1abab feat(tvmscript): scalar annotation syntax and T.let for 
LetStmt (#492)
 discard 1d19f08c3b fix(kernel): replace undefined F16_SIZE with DTYPE_SIZE in 
fp16/bf16 GEMM (#496)
 discard ce55e249e9 feat(tvmscript): replace T.macro with T.inline (Python LEGB 
scoping) (#495)
 discard 5bce802510 docs: add rule prohibiting force push to tirx branch (#494)
 discard 767f7b3c04 feat(layout): explicit axis registration + S[shape:stride] 
combinator API (#493)
 discard 29625e0537 feat(tvmscript): alloc_buffer returns Buffer + @meta_class 
decorator (#489)
 discard 5ada3bfb17 fix(tvmscript): fix bugs and remove dead code in parser & 
printer (#491)
 discard 39942a47c5 feat(lower-tirx): support alloc_buffer with thread-axis 
layout (#490)
 discard 58e44a5f44 feat(op-schedule): auto-infer shared memory size from 
PoolAllocator (#484)
 discard bdbd96fc04 fix(tvmscript): add TYPE_CHECKING wildcard import and 
__getattr__ annotation for Pyright/IDE support (#487)
 discard 3f5031d3a4 [Refactor] Extend pipeline.py with remote_view and remove 
manual descI (#480)
 discard 90dcac1982 docs(infra): update CLAUDE.md with testing and workflow 
guidelines (#488)
 discard 59f9faf931 docs(infra): slim down axe-layout skill to grep-based index
 discard 04866b5ff7 docs(infra): add CLAUDE.md with project overview and dev 
guidelines (#483)
 discard 71e891bb4f [Skill] Add axe-layout skill for Claude Code (#482)
 discard b5b33b4ed6 [Fix] Restore flashinfer 3rdparty submodule with CUDA 13 
compatibility (#481)
 discard 3c6cfa0d90 Modify deepgemm function to include warmup and repeat
 discard 61f28c0a62 [Kernel] Refactor SM100 GEMM kernels: unify 
pipeline/barrier abstractions (#478)
 discard 8688ca707a Support slicing on BufferRegion (#477)
 discard a5b504eff6 Introduce Tx.hint (#476)
 discard 92c90d8824 [Op] Support representing generic ops (#475)
 discard 53dab95939 fix rebase (#474)
 discard 90599a1758 rebased
 discard 8ed442dac9 [Refactor] Unify T.attr dict syntax and remove scope_attr 
(#471)
 discard 1dc3a9cc67 fix: post-rebase cleanup for kernel tests (#472)
 discard 5032c51d39 clean 3rdparty
 discard 1e44fb34e8 [Refactor] Fully remove SBlock dependency and carry 
annotations on ExecScopeStmt (#470)
 discard 879f5d7518 [Refactor] Decouple exec_scope from SBlock into independent 
ExecScopeStmt node (#469)
 discard 0305b47059 rebased
 discard c63698982b op-dispatch(gemm async): support block scale (#468)
 discard 49544b178c [LowerTIRx] Generate warp-uniform warp_id when lowering 
scope id (#467)
 discard 068a10d1ff [TVMScript] Unify T into Tx, TIRX print as Tx, refactor 
tirx tests (#465)
 discard 72826b40af [Op-Schedule] Support cast local view and use Tx.cast in 
nvfp4 GEMM epilogue (#462)
 discard 8446fa3ccb [Kernel] Fix vectorization of FA4 correction stage  (#464)
 discard a0c87c8da3 [Kernel] Rewrite gemm part of FA4 to avoid inline ptx (#463)
 discard e1982d0db3 [Kernel] use LPT scheduler for causal attention (#461)
 discard 6f2fe85f67 [Op] Register PTX instructions used in FA4 into device op 
(#460)
 discard 7404786c68 [Kernel] Support causal attention (#459)
 discard 185e332374 Rename Block to SBlock throughout codebase
 discard e90d333d0a [Kernel] Enlarge TMA size in FA4 kernel (#458)
 discard d4d68d5afd kernel(fa4): reformat (#456)
 discard b10a604180 [Kernel] Support GQA for FA4 (#455)
 discard 54deeddf6e [Op-Schedule] support more general patterns when lowering 
Tx.copy_async(#454)
 discard 09ac1189e9 layout: slice extend to swizzle (#453)
 discard 70f294647a Fix test_group_gemm.py (#452)
 discard bac7410b29 Fix broken pytest due to megakernel folder refactor (#451)
 discard 4dbaaf8d99 [MegaKernel] Fix perf regression introduced by tirx op 
(#450)
 discard 77f301bf13 [Megakernel] Refactor the megakernel lib with tirx-op (#448)
 discard 98ef70ef19 kernel lib: nvfp4 gemm (#442)
 discard 990f168a84 [Kernel] Align perf with FA4 on MHA (#447)
 discard b14afc1de6 [Op Dispatch] Add optimized reduction op dispatch for 
sm100a (#446)
 discard d887e8f2e5 [Kernel] Add SM100 attention kernel  (#445)
 discard 42a00eb273 refactor: device op codegen (#443)
 discard efd08a03fb rename
 discard 4221888345 kernel (qk_norm): upd (#441)
 discard 488e8b0810 kernel (qk_norm): upd (#440)
 discard 97a01cf84a kernel: qk rms_norm (#438)
 discard cc5ba896ba [Megakernel] Remove extra_args from user-api (#437)
 discard a261dffccb [megakernel] add extra prim_func in dep/f_init to the 
ir_module (#436)
 discard 16d073ebaa [Megakernel] Introduce alloc_event_tensor (#435)
 discard e1ab041391 [Update] Add CuteDSL benchmark for FA4 kernel (#434)
 discard 52bbf68977 [Kernel] Support FlashAttention4 Kernel (#433)
 discard 8caa131ef1 kerne lib: fp8 groupwise gemm (#431)
 discard 86a5ae6bce precommit (#430)
 discard 67723fd7e8 rebased
 discard 58c5c470f5 doc: b200 half gemm tutorial (#196)
 discard dbee4d9c2b [fix] potential dead lock (#429)
 discard 5e1553180d hotfix (#428)
 discard 2749d2f094 kernel lib: fp16/bf16 gemm (#427)
 discard 418487f48a [fix] pytest (#426)
 discard 00cac42758 [Megakernel] Robustly optimize the mega pass (#424)
 discard d4ab1d8234 [Megakernel] Init etensor on GPU  (#425)
 discard 7108369374 layout: direct sum (#423)
 discard 85d45ded87 nvshmem path (#422)
 discard d75f9c275f layout: refactor (#421)
 discard 4b16c2ba88 [Megakernel] eliminate unnecessary reg allocation when 
handing dep (#420)
 discard b36654811f [dev] support profiler in mega pass (#419)
 discard d424c6ef72 layout: slice (#418)
 discard 62dbef9a71 layout: api refactor (#417)
 discard 160017bf63 [Megakernel] refactor the megakernel pass (#416)
 discard 2ca30e5928 [fix] minor (#415)
 discard 49c0493bbc fix import error (#414)
 discard 8e2f7ef11f [MegaKernel] skip shape check on e2e experiment / Reduce 
size of attention-related tensor / fix model test  (#413)
 discard b8781eb6fc [MegaKernel] simplify attn-related event to make it 
cacheable (#407)
 discard a14365b8fd [fix] minor bugs (#406)
 discard f0a26ae567 rebased
 discard 940b00bde4 [Megakernel] Using tma-reduce to eliminate reduce tile 
(#404)
 discard 0c6b111ee8 [Megakernel] Add lm_head megakernel  (#405)
 discard 9fd72c1230 FlashInfer top-p sampling (#402)
 discard d5047e52fa [KVCache] Reduce BeginForward time (#401)
 discard fa967bd284 [dev] add deepgemm baseline (#400)
 discard 3f0b3a69ee [MegaKernel] Add llama3-1b impl (#399)
 discard b9b3b99318 [Megakernel] Add searching script for gemm config (#398)
 discard 3ac108a560 [MegaKernel] Fix M_split division in SplitKReduceTile (#397)
 discard de2a67b16a [MegaKernel] test_model for llama3-1b (#396)
 discard d9e9ff7b36 [MegaKernel] Replace hardcoded qwen3-32b config numbers 
(#395)
 discard aa124b7547 [MegaKernel] Fuse gate up proj and silu in moe (#394)
 discard 7e7c7e4156 fix pytest (#393)
 discard 8d3f0f5b62 [megakernel] add llama3.2-1b layer (#391)
 discard 952cd629c5 [dev] pass static fusion pass (squash) (#392)
 discard 232cdf840f [MegaKernel] improve moe e2e kernel performance (#390)
 discard 946fe4a86d [MegaKernel] fix group gemm illegal memory access(#389)
 discard 5f6b25a326 [MegaKernel] Support qwen3 30b a3b e2e serving (#387)
 discard d0cc990005 Fix subprocess issue (#386)
 discard aed95c5f00 [MegaKernel] Enable cudagraph for sglang and flashinfer moe 
baseline (#385)
 discard 0d172a8af8 op schedule: tcgen05 async gemm (#383)
 discard 0116b43eef op schedule: add permute_dims (#375)
 discard 5427d851ad [MegaKernel] improve TP=4 perf (#384)
 discard eaab32961f op schedule: remove event (#382)
 discard 804f7133e7 [Megakernel] improve small-batch static schedule moe (#381)
 discard 78e2d7eb39 [Megakernel] support unfused moe as baseline (#380)
 discard 6aeedfc941 infra: trim whitespace in codes (#379)
 discard 534a4b4502 kernel: use copy from reg to smem (#378)
 discard 1ce9a4259c arch (cuda): warpgroup sync (#376)
 discard 7a7c0a4c5f op schedule: copy tmem <--> reg, tmem offset (#377)
 discard 25bcca51d6 op schedule: refactor (#374)
 discard d049ae41f9 op schedule: copy tmem<->r/s (#372)
 discard 7fbd6165e7 [MegaKernel] Improve MOE perf on large batch (#373)
 discard db533705c1 [MegaKernel] Bring a faster solution of EP (#367)
 discard f156001ac6 infra: tile scheduler utils (#371)
 discard 95d62fd84e codegen: cta_sync and cluster_sync (#370)
 discard 81b1e1369d infra: tmem tensor, alloc/decl buffer semantic refactor 
(#366)
 discard b8a459acad [Megakernel] fusion between gate_up_proj and silu_multifly 
(#357)
 discard 8a4898136c [Megakernel] Add sglang and flashinfer into MOE unittest 
(#365)
 discard 5487fc5f3e [MegaKernel] Support dynamic schedule for MOE megakernel 
(#364)
 discard 5f36aa3184 [MegaKernel] Fuse topk reduce into group gemm (#363)
 discard 2ae13a42ee [MegaKernel] Use TMA reduce in MOE gating (#362)
 discard 2428752ac0 feat: cuda profiler with internal profiler_on (#359)
 discard 468b72be49 [Megakernel] fix ceildiv lowering (#361)
 discard b7c237af42 kernel: use partitioned_loop macro to simplify coding (#358)
 discard fe13c7277c Add single-gpu megakernel for MOE (#360)
 discard 69aca11c6e codegen: introduce some commly used cuda utils (#355)
 discard c7504887ed op schedule: tma copy (#354)
 discard 82dd22cceb fix: prevent warp divergence in shuffle sync (#356)
 discard 6daaf79f96 [Megakernel] pack the tile info into 32bit & add task 
pre-pushing for dynamic scheduler (#341)
 discard b01f8c82ec fix pytest warnings (#353)
 discard 300aeec2d8 Add dispatch and kwargs configs; unify TMA events; update 
CUDA TMA copy and printers (#352)
 discard 1b5c05ac76 [fix] fix the frame in dedup_tensormap & add tests (#351)
 discard 4c8dbc65c9 LowerTIRp: Deduplicate identical cuTensorMap 
initializations from Tp.copy_async + tests (#350)
 discard 307a862b36 hotfix (#346)
 discard 9537de157e feat(profiler): add CudaProfiler wrapper and refactor CUDA 
tests (#345)
 discard 93f012552b script/tir: Namespace‑aware TVMScript printer 
(T.cuda/T.ptx/T.nvshmem/T.nki) with unified op mapping and comprehensive tests 
(#344)
 discard 21090f2175 topk gating softmax (#339)
 discard 75d596bb8c op schedule: refactor  (#343)
 discard 16fb3f009f op schedule: refactor (#342)
 discard 0b2f31fc0f kernel: add CuTeDSL baseline (hgemm) (#340)
 discard 19f92cd33a [Megakernel] refactor the dynamic scheduler (#336)
 discard 6af802d94c Add group gemm kernel (#338)
 discard ec97b3254c add moe_align kernel (#337)
 discard 4780aaaf0a [Perf] optimize the smem-layout in batch-attn
 discard dede484e11 [Codegen] Add cp.reduce.async (#334)
 discard 27f26b77e5 [MegaKernel] Refactor duplicate profiler start/end (#333)
 discard 52543ceb9f [MegaKernel] Fix e2e profiler when using TP (#332)
 discard 51e8f646bc [MegaKernel] Fix hanging issue when using TP (#331)
 discard ff302d1318 [MegaKernel] support megakernel tp=4 (#330)
 discard 1a46f53962 Host-side GEMM tile size selection(#329)
 discard 35324fbbca Fuse QKV rms+rope+append in Megakernel(#328)
 discard 8ec64e5db3 [Perf] change register allocation site in megakernel (#327)
 discard 9fc248fb65 [fix] fix batch-attn prelogue & add tp e2e profiler (#325)
 discard fe237a77f8 [dev] detach batch-attn prelogue & change silu to pipeline 
(#324)
 discard 39a372abae [dev] add e2e profiler (#323)
 discard e94250563a [fix] pass static mega-model-tp (#322)
 discard 736969da54 kernel: a bit faster fused_add_rms_norm (#321)
 discard 6d962cb885 [dev] add smem prefetch (#319)
 discard 4b0d0d8789 refactor megakernel tile interface (#318)
 discard 0b76f5b6ab [dev] integrate batch attention kernel to layer (#317)
 discard 04b8e8594e alloc_buffer/event move out of block (#315)
 discard 00c9056cf6 refactor view/get API (#316)
 discard ed4361ba3e fix cell support (#314)
 discard a72ffda9e3 buffer: byte offset support (#313)
 discard b181dfd3bd let ptr_to use address_of; fix buffer_offset is not 
eliminated in LowerTIRp pass (#312)
 discard 9d97d8301c refactor: remove logical scope (#311)
 discard 0011a76f10 layout: change the default layout assignment behavior when 
defining buffers (#310)
 discard 00df6c8729 Lowering: enhance scope_id_def  (#309)
 discard 8f71c30b9e Add cp-async silu-multiply kernel (#308)
 discard c1f80b65ec transform: remove unnecessary passes (#307)
 discard 08b7370459 kernel: batch attention change per latest flashinfer API 
(#305)
 discard 0bcae2185d kernel: batch attention bx/tx configuration change (#304)
 discard 1422de2b6a fix pytest (#303)
 discard 97685c98ae [fix] unify the test script of the megakernels (#301)
 discard ba130116ac kernel: flashinfer batch attention (#286)
 discard c7fd75f4e7 use split k for gate up proj on TP=8 (#302)
 discard 54ef406a4a [fix] pass mega tests (#299)
 discard 0db6fc1489 fix correctness: ag+gemm
 discard 460ce5bea4 e2e dyn TP=1 (#296)
 discard 4ea0c44ce2 Fix TP=1 after down-proj update (#295)
 discard 0e3346da72 fix TP for split-kv (#294)
 discard a1996ea83a Revert "improve TP=8 perf (#292)" (#293)
 discard fd88d6bd26 improve TP=8 perf (#292)
 discard d5eae0c27f Fix down-proj event tensor init (#290)
 discard 972af30c9e [fix] fix support.py (#289)
 discard f477e92ba5 TP support for static scheduling (#285)
 discard 218b915a4e Prepare for megakernel TP (#288)
 discard 5e676ea3a7 [dev] optimize the performance of layer (#287)
 discard 03d6826958 Reduce event tensor init time (#284)
 discard 40c5682548 Add megakernel utils (#283)
 discard 41aae8fb98 Static execution queue cache (#282)
 discard 1c3cfc4172 Static event tensor allocation (#281)
 discard 77f151c457 codegen: cp.async, ldmatrix, stmatrix, mma (#277)
 discard 6beeed8344 [Refactor] Megakernel code reorg (#280)
 discard 1e9f8e35ee [dev] add the megakernel of qwen3 layer (#279)
 discard e5bab55b00 Enable TP for unfused kernels (#278)
 discard ed266ba056 batch decode bench (#273)
 discard 1a611724e1 fix
 discard 46dac045cf ag_gemm
 discard 0b4e67b8d9 init
 discard 1a6f56963f Update FlashInfer BatchPagedAttentionPlan (#275)
 discard 83a24b9372 Reuse FlashInfer attn planning (#274)
 discard 1d3b8e697f e2e: qwen3 kernel fix (#272)
 discard a6f7bd166a fix: rope calculation (#271)
 discard af4cf3d308 Fix batch 1 split-k-reduce kernel (#270)
 discard a572e8c498 fix: pytest (#269)
 discard 5149be8c25 e2e: add `attn_plan_info` (#268)
 discard f3d10c5ecf e2e: qwen3 rope dyn shape (#267)
 discard 15976c1039 e2e: qwen3 32b batch decode API change (#266)
 discard e1141ce770 e2e: qwen3 32b fix (#265)
 discard f595e130d5 e2e: qwen3 32b (#263)
 discard 16de2e1552 fix pytest (#262)
 discard a40aa85918 kernel: qwen3 layer megakernel unittest (#257)
 discard fe39df3b81 init
 discard f73f6648ac kernel: refactor rope (#260)
 discard cb634f8789 [Kernel] fix gemm rs kernel when K is irregular (#259)
 discard cdd0e9558d [Kernel] Fix hgemm_rs dynamic schedule kernel (#258)
 discard 755cf074ab [Kernel] fix minor bug in matmul and add_rmsnorm (#256)
 discard d4a0603f40 kernel: gemm-rs overlap, using static scheduler (#252)
 discard 8e81829a2b kernel: gemm-rs overlap, using dynamic scheduler (#251)
 discard 7db90fd2ad kernel: symbolic shape split silu mul (#255)
 discard 14df6fb3b6 kernel: symbolic shape append kv (#254)
 discard c2868e69fd [Kernel] support symbolic M in low batch gemm (#253)
 discard 5bbb5cd1ab kernel: rmsnorm (#250)
 discard cb6fb1f0aa [kernel] support GQA batch decoding & fix some bugs (#249)
 discard 644f54f0bd fix (#248)
 discard 2b8d85fa73 fix (#247)
 discard cb84c72a16 [kernel] add split-k gemm kernel (#246)
 discard 3961432e57 Finish other necessary kernels (#245)
 discard 7bc1dfd588 [kernel] add rope kernel & change to persistent kernel 
(#244)
 discard 651323c157 [kernel] finish fused_add_rmsnorm (#243)
 discard f201e65292 [kernel] refactor & finish split-kv
 discard ff334090e0 op schedule: copy_async (flashinfer batch decode) (#241)
 discard d1c3a333af op schedule: copy, cast (flashinfer batch decode) (#240)
 discard 68895c6609 [Kernel] Add batch decoding kernel (#239)
 discard 84ae5a5eae hw: smem allocator & reshape view sugar (#238)
 discard 15dc25f16b op schedule: cache hint of tma load (#237)
 discard b269e63c6d [kernel] refactor hgemm kernel  (#236)
 discard 3593d52972 event: rewrite tma store in sm100_a deepgemm kernel using 
event (#235)
 discard 84c102ea08 event: rewrite tm load in sm100_a deepgemm kernel using 
event (#232)
 discard 566d9e9769 [Kernel] Add DistGemm impl (#233)
 discard b8960a56da kernel: refactor sm100_a deepgemm (#231)
 discard a8b432d5b8 remove pipeline (#230)
 discard 6cdc23eb68 event: initial event infra (#228)
 discard 08eb84d0a7 [feat] finish deepgeem kernel with well performance (#229)
 discard 21557dd44a fix: fix tests, mainly nvshmem related (#227)
 discard f9f3db9671 [Fix] fix hgemm kernel for small K size (#225)
 discard 71bd4cdd33 add print op (#224)
 discard 73337f7bfa Introduce call_tir_device (#220)
 discard 418833c0a8 [Profiler] Enable profiler on blackwell kernels (#222)
 discard fe0834692e [Comm] Comm EP Kernel and Infra (#212)
 discard 2ce995c4f4 hw: fix list comp (#219)
 discard 371a041fa4 hw: aug assign of cells (#218)
 discard c97341c039 hw: @T.macro error reporting fix (#217)
 discard 2f17dd0a09 hw: allow define @T.macro while parsing (inside 
@T.prim_func or another @T.macro) (#216)
 discard 8b50592240 hw: buffer.ptr_to(indices) api (#215)
 discard 1fd8125532 hw: parser/printer minors (#214)
 discard c34f6c1dd8 hw: cell api (#213)
 discard f7afd76594 [Event] Add event operators (#210)
 discard e6dd7e618a hw: refactor codegen logic for wgmma ops (#211)
 discard e806999ec0 hw: refactor codegen logic for tma ops (#209)
 discard 81c115d67d hw: refactor codegen logic for mbarrier ops (#208)
 discard 26949053b8 hw: refactor codegen logic for some ops (#207)
 discard 7aa160eeb8 hw: remove cuda_barrier stuff (#206)
 discard 5ac97a8cb6 Event runtime dynamic launch kernel (#205)
 discard 764682f4fc hw: refactor codegen logic for some ops (#202)
 discard 706f4ea002 hw: rewrite op ptx_map_shared_rank logic (#201)
 discard 6e603db0bb hw: rewrite op ptx_map_shared_rank logic (#200)
 discard c4e510ab52 hw: cuda source func call (#198)
 discard 1fe8f69700 layout offset refactor (#184)
 discard 8550e1193b [Op] Add Print Support for Scalar and String (#197)
 discard f138ab1d1a [Kernel] Add Blackwell GEMM kernel (#195)
 discard 0a4abada6c doc: update intro of layout tutorial (#194)
 discard fca373aecc doc: overview (#193)
 discard 18d9a41d4f doc: layout (#192)
 discard 7b5aa4454c Sync 3rdparty with mainline-branch
 discard 25a2ff4c80 layout: enhance normalize, is_inner and is_outer (#190)
 discard e4627dd6ae layout: remove scope from data members (#189)
 discard 270a1ff4c1 layout: refactor (#187)
 discard b12d4f9eaa [Codegen] Initial Support for Blackwell Instructions (#175)
 discard 9788ed12d1 [Trn][Op] refactor instruction generation for the rest of 
op (#185)
 discard ab3a98253e [Trn][Op] refactor instruction generation (#183)
 discard 5573dbd280 op schedule: tma store (#182)
 discard ff67f4371c fix: fix test case (#181)
 discard cdf5eac764 [Fix] fix scope partition bug (#180)
 discard b47b04c7fe [Runtime] Add event runtime (#179)
 discard e0b0e3e781 fix: add b200 target, fix test cases (#178)
 discard 585bf24dd6 op schedule: tma load phase fix (#177)
 discard 8d6f2e2123 hot fix (#176)
 discard 5aee1d9060 [REFACTOR][Op][Codegen] Unify PTX Op Namespaces to Align 
with CUDA Standards (#168)
 discard 6c8591645d minor fix (#174)
 discard 383105337d [Fix] Consider BufferView and BufferGet in NaiveAllocator 
(#173)
 discard 261a8656be [Transform] Add naive allocator that does not reuse memory 
(#171)
 discard a8a060531f [Trn] Infer private alloc in attn (#170)
 discard f69844fa88 op schedule: default copy (gmem <-> smem, cta) (#169)
 discard 4e0832e086 op schedule: support other swizzle modes in tma (#155)
 discard 7d1ef05003 [Transform] [Op] PrivateTensorAlloc pass: generate private 
buffer allocation needed in op schedule (#167)
 discard ee605ed5ff parser: T.grid() accepts (PrimExpr, PrimExpr) as start and 
extent of some loop (#166)
 discard 695f6b4030 [Trn] workaround Neuron compiler bug (#165)
 discard 631071dfca [Op][IR] decouple BufferView, BufferGet & support softmax 
op schedule for logical tensor (#157)
 discard 1cbd41af0c python visitor/mutator (#164)
 discard 5c4de7dcd4 [Runtime][Infra] Added Typed Tuple Runtime Container (#161)
 discard 49cb5a803d make code more concise (#163)
 discard 2c2418bdce [Trn][Op] allow more op specifications (#162)
 discard c94db8e201 [Trn][Op] nki activation bias field must be a tensor (#160)
 discard 1f75e69773 [Trn][Layout] simplify TrainiumLayout definition (#159)
 discard 90048b076c [Trn][Kernel] add causal flash attn support (#158)
 discard a2be728134 [Op][Trn] Allow loop-var-dependent region as op 
src/dst(#156)
 discard 2235444d3d [Perf] Profiler visualizing warpgroup-level kernel 
performance (#146)
 discard 8903a1f808 coreir: add bf16 support and extend test cases (#154)
 discard 2e8c557068 [Op][Trn] Select op (#153)
 discard 38ee9735e9 coreir: fix alloc_buffer api (#152)
 discard 44287616b6 refactor: lint (#151)
 discard 41fe1b9ca8 [Op][Trn] refactor compose op (#150)
 discard cc44cf434a refactor: some code clean up (#149)
 discard daf77e1cba [Op] refactor op registry (#147)
 discard f61f5267b4 op schedule: tma test fix (#148)
 discard 7cc4b0ee5e refactor: some code clean up (#145)
 discard 98b6656752 op schedule: tma load (#136)
 discard fa522a11c1 [CoreIR] [Op] f_op_scheduler now takes OpCall as argument 
(#144)
 discard b187d1bd0d [CoreIR] Add schedule config to opcall (#143)
 discard b9a928522f [Trn][Op][Kernel] Fix op involved in flash attention and 
add flash attn kernel for Trainium (#142)
 discard bc3d0b7583 [Trn] [Op] Support more compose op (#141)
 discard 85de47ae3b [Op][Trn] Make gemm and transpose aware of workspace buffer 
(#139)
 discard 7a0e2851bb [Op] Add workspace (pre-allocated buffer) to OpCall (#138)
 discard dafee16357 [Trn] [Op] support transpose in copy operator (#137)
 discard 4aa8762cf8 [Trn][Op] Support bias and scale on activation operators 
(#135)
 discard 748d92ccbd [Trn][Op] compose binary and reduce (#134)
 discard 3a9ed08c20 [Trn] Fix gemm kernel codegen (#133)
 discard 3f94b07ef3 [Trn][Pipeline] Add Trn compilation pipeline (#132)
 discard dbfa4602e9 [Op][Trn] Introduce compose_op: activation reduce (#131)
 discard e2c95f732f [Op] compose_op parser/printer (#130)
 discard 7e80f07746 [Parser] Enable alloc_buffer under any scope (#129)
 discard 831810a819 coreir: is_tile_outer returns the inner layout (#128)
 discard 11a67df2a1 coreir: is_tile_inner returns the outer layout (#127)
 discard 4e056924b8 [Trn][Op] add reduction op (#125)
 discard 09b0fbbd30 [Trn][Kernel] improve gemm perf by specifying allocation 
and pipeline (#124)
 discard 6412f16c8c [Trn][Op] Fix edge case for binay broadcast (#123)
 discard 3080fbc0b0 [CoreIR] Support implicit slice (#122)
 discard fc336a82fd [Trn] Specify Buffer Allocation Address (#121)
 discard 05ebffee19 HGEMM kernel (#120)
 discard 1f8ef0e5c5 [Trn][Op] Impose instruction size limit in op schedule 
(#119)
 discard b3f911e4e8 [Trn][Codegen] Change codegen test into string matching 
(#118)
 discard 6fe1d06a57 [Trn][Layout][Op] Fix unary op schedule and layout 
validation (#117)
 discard 2aada17e2e [Trn][Layout] Introduce new PSUM layout (#116)
 discard 8181799302 [Trn] Fix Trn Codegen and add more nki function calls (#115)
 discard 5f8641ef6d test: add compose layout test (#114)
 discard a1ee9c6bf4 test: update hgemm kernel (#113)
 discard 0370d6064e test: update fa3 kernel, using break (#112)
 discard 3f25fbf98b test: update copy op test (#111)
 discard 8df3e6f588 [CoreIR] Break/Continue Stmt (#110)
 discard ef219d0ad2 coreir: compose layout (#108)
 discard 5d4ebbcee2 [Trn] Fix instruction selection for binary broadcast (#109)
 discard 8d8f4e5bc8 coreir: introduce tir.scope_parition to use if-then-else to 
connect exclusive scope slices (#107)
 discard a59941f031 coreir: use Variant in exec scope slice (#106)
 discard c0ab87756d [Trn] Binary and Unary Ops  (#105)
 discard 5c7cf38a9e infra: make pylint happy when use meta_var and class to do 
meta programming (#104)
 discard b9e2962de8 test: introduce scope slice in kernels (#103)
 discard a219e74756 [CoreIR] Add more scopes (#102)
 discard 97d5f01493 [CoreIR] Add test in scope slice lowering (#101)
 discard a068ed9a11 [CoreIR][Layout] Rename from_nested_tuple tp from_tuple 
(#100)
 discard 0eb5269178 [CoreIR] ExecScopeSlice lowering (#99)
 discard 9769ac5cb1 [Trn] [Op] Fix var_range_map use (#98)
 discard 7f2952bc6b [Trn] [Op] Gemm op dispatch for Trainium (#97)
 discard 94fe485fa4 [CoreIR] ExecScopeSlice supporting select_cond (#96)
 discard 282266dcc6 [Demo] Layernorm Kernel (#92)
 discard e21937262d [CoreIR][Printer&Parser] Simplify ExecScopeSlice syntax 
(#95)
 discard 788a3027ee [CoreIR] Verify ExecScopeSlice (#94)
 discard 8d0824976b [CoreIR][ExecScope] Refactor ExecScope (#93)
 discard de1a1fb769 [Op] Move CUDA op impl under a new folder (#91)
 discard d5c2740995 [Kernel] A100 4-pipeline GEMM Codegen with Tir+  (#89)
 discard 127bfa9989 [TEST] Refactor test (#90)
 discard fdefdda98a [TEST][Kernel] Improve FA3 perf (#88)
 discard fe710e19b2 [Trn] [Op] OP dispatch for copy for Trainium (#86)
 discard 9d818a595e [Trn] [Codegen] Trainuim Codegen (#85)
 discard 1d680583a6 [TEST] Add FlashInfer Benchmark (#84)
 discard 2f71d7cafe [TEST] Refactor kernel test (#83)
 discard 05bc5a3df4 [Op] Add Reduction Op (#82)
 discard 2b0a6abda9 [Op] Remove barrier op (#81)
 discard 87e5d0e81d [TEST] Skip proton under pytest (#80)
 discard 4539d9232d [Op] Add pipeline in op (#79)
 discard e1153424ee [Layout] [Trn] Add layout for Trainium (#78)
 discard 396eb0f57b [Layout] Fix layout tests (#77)
 discard 9807a223dd [Layout] Refactor Layout (#75)
 discard 9d2ac6b80b [Op] extend 2d to nd copy (#74)
 discard 5319822ac4 [Op] refactor op schedule registry (#73)
 discard a654a15157 [Kernel] H100 fp16 Flash Attn (#70)
 discard 99cad32bb2 [Kernel] H100 fp8 GEMM (#69)
 discard 27658cd5ac [Kernel] H100 GeMM no WS (#68)
 discard f7d0bbf841 [Kernel] H100 HGEMM: L2 cache opt (#67)
 discard e2729814ec [Pass] Let merge dyn SMEM pass be aware of buffer alignment 
annon (#66)
 discard b5e8e47fa9 [Kernel] H100 Gemm (#65)
 discard 20f061b6a2 [Test][Fix] Fix WGMMA test (#64)
 discard 678f35f6f1 [Codegen] Hopper codegen: WGMMA (#63)
 discard aff8258ef0 [Codegen] Hopper codegen: TMA store (#62)
 discard 206b2708ec [Refactor] Lift cuda builtins out (#61)
 discard d3dc00301e [Lowering] Fix scope id resolve (#60)
 discard 1cc32dba03 [Codegen] Hopper codegen: multicast TMA copy  (#59)
 discard 10e28dafd8 [Codegen] Hopper codegen: unicast TMA copy (#57)
 discard 6b20ed6835 [Layout] Improve layout normalization with generalization 
of partial fusion (#58)
 discard 8239064830 [Layout] Determining Optimal Vector Length for Efficient 
Copy  (#52)
 discard 10aeeaed12 [Layout] Check Layout as of a Tile of another Layout (#50)
 discard 0bbe943219 [E2E][GeMM][Ampere] rewrite kernel using copy/pipeline ops 
(#56)
 discard 4ea038cc4a [Lowering][Op Schedule] Copy op: SwizzleLayout support (#55)
 discard 26b770ce86 [Lowering][Op Schedule] copy op test (#54)
 discard 9e0050b13d [Lower][Op Schedule] Merge neighbour scopes (#53)
 discard a4944d0f81 [Async] Pipeline lowering: CUDA sm80 (#51)
 discard 80d30d4045 [Layout] Fuse layouts w/ or w/t device tree (#48)
 discard a9059db668 [Async] Barrier lowering (#49)
 discard ed30240d84 [Async] Async language structures init: Pipeline (#47)
 discard bf6a1eed75 [Async] Async language structures init: Barrier (#45)
 discard 214c8c16b2 New op print (#44)
 discard 0a25bf9746 [Op] Add tests (#42)
 discard 15ce9e767f [Op] Initial infra (#41)
 discard 8da44194d2 [Lowering][Layout] Refactor (#40)
 discard e885545c71 [Layout] Layout apply operator (#39)
 discard 21cd6aae49 [Layout] Swizzle (#38)
 discard 76e77b37ea [CoreIR][Verifier] Layout Verify (#37)
 discard 0b60407170 [CoreIR] BufferView completion (#36)
 discard 7be81d1cd4 [Layout] Layout API: Shard (#35)
 discard a30621296c [Layout] Layout API: Tile (#34)
 discard 3dc0bf4e02 [Layout] Normalize TileLayout (#33)
 discard 82ba2ff782 [Layout] Layout Printer & NestedTuple API completion (#32)
 discard bcdb17e537 [Layout] Refactor (#31)
 discard bef120ec38 [Layout] Move some codes to cpp (#30)
 discard a377ffd9da [Printer&Parser] Allow intemediate layout expr (#29)
 discard 23070df2d0 [Printer&Parser] Default logical scope for alloc_buffer 
(#28)
 discard 6d78ef9505 [Dev] Introduce from and to scope to TileLayout (#27)
 discard f797ce9a14 [Verifier] Invoke verifier in parser (#26)
 discard 4860af7b3e [Test] GeMM Ampere CuBLAS (#24)
 discard d5b16d2673 [Test] GeMM Ampere (#23)
 discard 510a8841a9 [Core IR] Lowering (#22)
 discard 338cf255f4 [Core IR] Lowering: remove logical tensor (#21)
 discard d035c9d165 [Core IR] fix view&get printer & parser (#20)
 discard 73a3e066cf [Core IR] Fix scopeId printer & parser (#19)
 discard fb22bec9e8 [Core IR] Add tests to verifier (#18)
 discard 2e701053e5 [Core IR] T.view and T.get (#17)
 discard 1627824d4e [ExecScope] add verifier tests (#16)
 discard 5ab1a5798a [ExecScope] printer & parser fix (#15)
 discard e08e0c5f26 [Layout] Introduce BufferGet (#14)
 discard de8b3795a2 Nested tuple (#13)
 discard d0cae89faa [Layout] Naive Printer and Parser (#12)
 discard 34143aa19c [Layout] Fix (#11)
 discard 3b866403dc [Layout] Introduce BufferView (#10)
 discard 5a76c85a67 [Layout] refactor (#9)
 discard b7930a333f [Layout] Init (#8)
 discard 9be64f9046 [ExecScope] Add cur scope name to ScopeIdDef (#7)
 discard 1d7aabb1cf [Layout] Introduce logical scope to buffers (#5)
 discard 6753148980 [ExecScope] refactor (#4)
 discard 55e5d29e71 [ExecScope] initialization
 discard 2084df6ba8 trigger ci
 discard c1ea97fcd2 Add missing backtick (#15968)
 discard a0463a1891 [cherry-pick][ARITH][BUGFIX] Fix a bug of iter map 
floormod(x,2) simplify (#14704)
 discard c48f60b2d7 [CI] Pin mypy version (#8329)
 discard 5e2230056f [µTVM] Fix paths in the reference VM tutorial and add 
vbguest recommendation (#7015)
 discard 71117b4153 [Runtime][Object] expose runtime::String to Python (#5212)
 discard 9225316465 Fix global var in prelude (#3405)

(tvm) branch refactor/tmem-pool-zero-arg-warp-guard deleted (was 0ea33f5ca5)

Reply via email to