roastduck edited a comment on issue #5307: [TIR] Make lower_warp_memory support extent(threadIdx.x) < warp_size URL: https://github.com/apache/incubator-tvm/pull/5307#issuecomment-612567383 One way to perform a half-warp shuffle while keeping `extent(threadIdx.x) == warp size` is like this: Suppose we have `extent(threadIdx.x) == 16`, we first split `threadIdx.y` with factor 2, and then fuse that `2` with `threadIdx.x`, so the extent of `threadIdx.x` becomes 32. 2. Then we can perform any lowering procedure with the assumption of `threadIdx.x == 32`. Either we require users to write such a schedule, otherwise we modify the schedule for them. I think either way is not intuitive enough for users. It may also hinder debugging, and debugging in TVM is already difficult. Note that we have to modify `threadIdx.x` in more than one scopes, in order to keep the thread index consistent. Here is a more complex example, which is simplified from the algorithm I am currently working on. ``` // a is shaped (n) // b is shaped (16) // c is shaped (n, 16) // extent of threadIdx.x == 16 for (i.outer = 0; i.outer < n; i.outer += 16) { if (i.outer + threadIdx.x < n) { a.warp[i.outer + threadIdx.x] = a[i.outer + threadIdx.x]; // (1) } for (i.inner = 0; i.inner < min(16, n - i.outer); i.inner++) { c[i.outer + i.inner, threadIdx.x] += a.warp[i.outer + i.inner] * b[threadIdx.x]; // (2) } } ``` `threadIdx.x` in both statement `(1)` and `(2)` will be fused to 32, which is a major modification to the schedule. Users may meet difficulties with this schedule.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
