kimm240 opened a new pull request, #18418:
URL: https://github.com/apache/tvm/pull/18418

   Currently it is not possible to fuse an epilogue operation (e.g., bias 
addition) into a reduction block's initialization statement. This limitation 
prevents leveraging hardware-specific instructions that support bias 
accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) 
instructions.
   
   This commit implements a new schedule primitive 'fuse_reduction_epilogue' 
that addresses the problem described in:
   
https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066
   
   The primitive transforms the following pattern:
   
     Before:
       for i, j, k in T.grid(M, N, K):
           with T.block("matmul"):
               with T.init():
                   temp[vi, vj] = 0
               temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk]
   
       for i, j in T.grid(M, N):
           with T.block("bias_add"):
               D[vi, vj] = temp[vi, vj] + C[vi, vj]
   
     After:
       for i, j, k in T.grid(M, N, K):
           with T.block("matmul"):
               T.reads(C[vi, vj], A[vi, vk], B[vj, vk])
               T.writes(D[vi, vj])
               with T.init():
                   D[vi, vj] = C[vi, vj]  # Fused epilogue into init
               D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk]
   
   The transformation removes the intermediate temp buffer and the separate 
epilogue block, enabling better tensorization opportunities for hardware with 
bias accumulation support.
   
   Implementation:
   - ReductionEpilogueFuser class for pattern validation and IR transformation
     - BodyPatternAllowFusion: Validates epilogue can be fused
     - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C)
     - ExtractEpilogueInfo: Extracts buffer and region information
     - CreateFusedReductionBlock: Creates single block with modified T.init()
   - SingleBlockFusionReplacer: Replaces blocks and removes temp buffer
   - Variable mapping between epilogue and reduction block iter vars
   - Proper buffer and region updates with correct read/write ordering
   - FFI bindings and Python API following TVM conventions
   
   Changes:
   - src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 
lines)
   - src/tir/schedule/primitive.h: Function declaration
   - include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode
   - src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode 
implementation
   - src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation
   - src/tir/schedule/schedule.cc: FFI binding registration
   - python/tvm/tir/schedule/schedule.py: Python API with documentation
   - tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: 
Comprehensive tests including basic fusion, float32 variant, numerical 
correctness verification, and trace roundtrip validation
   
   Run tests with:
     pytest 
tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py -v


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to