cbalint13 commented on PR #18182:
URL: https://github.com/apache/tvm/pull/18182#issuecomment-3220162168

   Further, investigated the corectness of the proposed tensorization kernels.
   The proposed ```multimvul``` does multiple dotproducts yielding highest 
benefits inside RVV.
   
   All tests here needs https://github.com/apache/tvm/pull/18232
   
   ---
   
   * Proposed kernels looks wrong, implementation also produce bad numerical:
   
[riscv64-rvv-kernels-pr18182.py.gz](https://github.com/user-attachments/files/21968923/riscv64-rvv-kernels-pr18182.py.gz)
   ```
   $ ./riscv64-rvv-kernels-pr18182.py 64
   Testing rvv_float32_multivmul_8_64_m8
   C (output): (8,) [float32]
   [1363.    0.    0.    0.    0.    0.    0.    0.]
   Output (kernel) [1363.    0.    0.    0.    0.    0.    0.    0.]
   Output (numpy) [1363. 1407. 1460. 1388. 1504. 1373. 1268. 1270.]
   
   $ ./riscv64-rvv-kernels-pr18182.py 32
   Testing rvv_float32_multivmul_8_32_m8
   C (output): (8,) [float32]
   [699.   0.   0.   0.   0.   0.   0.   0.]
   Output (kernel) [699.   0.   0.   0.   0.   0.   0.   0.]
   Output (numpy) [699. 493. 671. 707. 635. 639. 764. 611.]
   
   $ ./riscv64-rvv-kernels-pr18182.py 16
   Testing rvv_float32_multivmul_8_16_m8
   C (output): (8,) [float32]
   [425.   0.   0.   0.   0.   0.   0.   0.]
   Output (kernel) [425.   0.   0.   0.   0.   0.   0.   0.]
   Output (numpy) [425. 192. 382. 464. 465. 382. 438. 202.]
   {...}
   
   ```
   
   * Here is a working reference fp32 kernel leveraging one-hot [full RVV 
occupancy](https://github.com/cbalint13/rvv-kernels/blob/main/dot_fp32_kernel.ir).
   
[riscv64-rvv-full-fp32_kern.py.gz](https://github.com/user-attachments/files/21968922/riscv64-rvv-full-fp32_kern.py.gz)
   ```
   $ ./riscv64-rvv-full-fp32_kern.py
   DEBUG:pydot:pydot initializing
   DEBUG:pydot:pydot 3.0.1
   DEBUG:pydot.core:pydot core module initializing
   DEBUG:pydot.dot_parser:pydot dot_parser module initializing
   # from tvm.script import ir as I
   # from tvm.script import tir as T
   
   @I.ir_module
   class Module:
       @T.prim_func
       def main(A_handle: T.handle, B_handle: T.handle, C_handle: T.handle):
           T.func_attr({"global_symbol": "rvv_dot_4f32_4x4f32_2f32"})
           A = T.match_buffer(A_handle, (4,), align=4, offset_factor=1)
           B = T.match_buffer(B_handle, (4, 4), strides=(4, 1), align=4, 
offset_factor=1)
           C = T.match_buffer(C_handle, (4,), align=4, offset_factor=1)
           with T.block("root"):
               T.reads(A[0:4], B[0:4, 0:4])
               T.writes(C[0:4])
               zero: T.float32xvscalex2 = 
T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfmv.v.f", 
T.Broadcast(T.float32(0.0), T.vscale() * 2), C[0], T.uint64(1))
               vec_A: T.float32xvscalex4 = 
T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", 
T.Broadcast(T.float32(0.0), T.vscale() * 4), 
T.tvm_access_ptr(T.type_annotation("float32"), A.data, 0, 4, 1), T.int64(4))
               for i in range(4):
                   with T.block("reduction"):
                       vi = T.axis.spatial(4, i)
                       T.reads(B[0:4, 0:4])
                       T.writes(C[vi])
                       vec_B: T.float32xvscalex4 = 
T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", 
T.Broadcast(T.float32(0.0), T.vscale() * 4), 
T.tvm_access_ptr(T.type_annotation("float32"), B.data, vi * 4, 4, 1), 
T.int64(4))
                       product: T.float32xvscalex4 = 
T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vfmul", 
T.Broadcast(T.float32(0.0), T.vscale() * 4), vec_A, vec_B, T.uint64(7), 
T.uint64(4))
                       reduction_result_vec: T.float32xvscalex2 = 
T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfredusum", 
T.Broadcast(T.float32(0.0), T.vscale() * 2), product, zero, T.uint64(7), 
T.uint64(4))
                       C[vi] = T.call_llvm_intrin("float32", 
"llvm.riscv.vfmv.f.s", reduction_result_vec)
   
   [6. 6. 9. 3.]
   [[3. 7. 7. 7.]
    [0. 2. 5. 7.]
    [3. 9. 5. 7.]
    [9. 3. 6. 1.]]
   Output (kernel) [144.  78. 138. 129.]
   Output (numpy) [144.  78. 138. 129.]
   ```
    For this working sample, 4 x (4x4) -> 4xlanes for VLEN=256 @ fp32 case is 
[the 
maximum](https://github.com/cbalint13/rvv-kernels/blob/f9ae5903b557c31df90eaafab644d9598a08e0cb/rvv-dot-kernel-gen.py#L37-L40)
 for a fully occupied RVV machine.
   
   ---
   
   Now,
   
   beside the matching template issues due to relax flow (exemplified with a 
working dense/matmul testcase), the numerical implementation of the kernels 
itself are also wrong and personally I don't see how they fully exploit the RVV 
machine (also provided a working testcase).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to