csullivan opened a new pull request, #17082:
URL: https://github.com/apache/tvm/pull/17082

   Change to use 16x32 spatial x reduction thread extents regardless of 
workload size. This works around a lowering bug which I haven't tracked down 
yet. 
   
   Currently when the spatial dimension is larger than the reduction dimension, 
it uses a 4x64 thread layout. This implies two warps in the reduction dimension 
corresponding to blockDim.x=64. An illegal cuda instruction is encountered in 
the second warp during the __shfl_down_sync for the remainder portion of the 
computation (as a result of the rfactor, I believe). AFAICT the mask 
calculation used for this remainder shfl is incorrect and is causing the error. 
Specifically it occurs on the first thread of the second warp (two warps along 
x since blockDim.x = 64)
   
   This is the relevant cuda causing the error:
   ```
   if (((int)threadIdx.x) < 2) {
       red_buf0[0] = red_buf_staging[((((int)threadIdx.y) * 2) + 
((int)threadIdx.x))];
     }
     mask[0] = (__activemask() & ((uint)(3 << (((int)threadIdx.y) * 2)))); // 
<<< likely the problem
     t0[0] = __shfl_down_sync(mask[0], red_buf0[0], 1, 32);
     red_buf0[0] = (red_buf0[0] + t0[0]);
     if (((int)threadIdx.x) == 0) {
       ((volatile half*)red_result)[((int)threadIdx.y)] = red_buf0[0];
     }
     ```
   The corresponding sass where the illegal instruction occurs:
   ```
      0x00007d9e97b92490 <+1936>:  WARPSYNC.ALL
      0x00007d9e97b924a0 <+1952>:  BAR.SYNC.DEFER_BLOCKING 0x0
      0x00007d9e97b924b0 <+1968>:  @!P1 VIADD R13, R5, 0x8
      0x00007d9e97b924c0 <+1984>:  @!P1 LEA R7, R17, R14, 0x1
      0x00007d9e97b924d0 <+2000>:  @!P1 PRMT R6, R2, 0x654, R13
      0x00007d9e97b924e0 <+2016>:  @!P1 LEA R7, R7, R6, 0x1
      0x00007d9e97b924f0 <+2032>:  @!P1 LDS.U16 R16, [R7]
      0x00007d9e97b92500 <+2048>:  IMAD.MOV.U32 R6, RZ, RZ, 0x3
      0x00007d9e97b92510 <+2064>:  SHF.L.U32 R17, R17, 0x1, RZ
      0x00007d9e97b92520 <+2080>:  VOTEU.ANY UR4, UPT, PT
      0x00007d9e97b92530 <+2096>:  SHF.L.U32 R3, R6, R17, RZ
      0x00007d9e97b92540 <+2112>:  LOP3.LUT R3, R3, UR4, RZ, 0xc0, !PT
      0x00007d9e97b92550 <+2128>:  ISETP.NE.AND P0, PT, R14, RZ, PT
      0x00007d9e97b92560 <+2144>:  PRMT R2, R2, 0x654, R5
      0x00007d9e97b92570 <+2160>:  PRMT R4, R16, 0x5410, R16
   *> 0x00007d9e97b92580 <+2176>:  WARPSYNC R3
   => 0x00007d9e97b92590 <+2192>:  SHFL.DOWN PT, R3, R4, 0x1, 0x1f
      0x00007d9e97b925a0 <+2208>:  IMAD.IADD R17, R17, 0x1, R2
      0x00007d9e97b925b0 <+2224>:  HADD2 R16, R16.H0_H0, R3.H0_H0
      0x00007d9e97b925c0 <+2240>:  @!P0 STS.U16 [R17], R16
      0x00007d9e97b925d0 <+2256>:  WARPSYNC.ALL
      0x00007d9e97b925e0 <+2272>:  BAR.SYNC.DEFER_BLOCKING 0x0
   ```
     
   Changing the thread extents to 16x32 (one warp along the reduction 
dimension) works around the issue. It also improves performance for the tested 
shapes by ~10%.
   
   ```
   Utilizing (8, 2048, 4096) to avoid the error,
   
   # 4x64
    Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max 
(ns)  StdDev (ns)             Name
    --------  ---------------  ---------  --------  --------  --------  
--------  -----------  --------------------------
        81.5           612214        101    6061.5    6048.0      5920      
7872        188.5  moe_dequantize_gemv_kernel
   
   # 16x32
    Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max 
(ns)  StdDev (ns)             Name
    --------  ---------------  ---------  --------  --------  --------  
--------  -----------  --------------------------
        79.9           555901        101    5504.0    5472.0      5439      
6880        142.7  moe_dequantize_gemv_kernel
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to