roastduck edited a comment on issue #5307: [TIR] Make lower_warp_memory support 
extent(threadIdx.x) < warp_size
URL: https://github.com/apache/incubator-tvm/pull/5307#issuecomment-612567383
 
 
   One way to perform a half-warp shuffle while keeping `extent(threadIdx.x) == 
warp size` is like this:
   
   Suppose we have `extent(threadIdx.x) == 16`, we first split `threadIdx.y` 
with factor 2, and then fuse that `2` with `threadIdx.x`, so the extent of 
`threadIdx.x` becomes 32.
   2. Then we can perform any lowering procedure with the assumption of 
`threadIdx.x == 32`.
   
   Either we require users to write such a schedule, otherwise we modify the 
schedule for them. I think either way is not intuitive enough for users. It may 
also hinder debugging, and debugging in TVM is already difficult. Note that we 
have to modify `threadIdx.x` in more than one scopes, in order to keep the 
thread index consistent. Here is a more complex example, which is simplified 
from the algorithm I am currently working on.
   
   ```
   // a is shaped (n)
   // b is shaped (16)
   // c is shaped (n, 16)
   // extent of threadIdx.x == 16
   for (i.outer = 0; i.outer < n; i.outer += 16) {
       if (i.outer + threadIdx.x < n) {
           a.warp[i.outer + threadIdx.x] = a[i.outer + threadIdx.x]; // (1)
       }
       for (i.inner = 0; i.inner < min(16, n - i.outer); i.inner++) {
           c[i.outer + i.inner, threadIdx.x] += a.warp[i.outer + i.inner] * 
b[threadIdx.x]; // (2)
       }
   }
   ```
   
   `threadIdx.x` in both statement `(1)` and `(2)` will be fused to 32, which 
is a major modification to the schedule. Users may meet difficulties with this 
schedule.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to