[GitHub] [incubator-tvm] roastduck commented on issue #5307: [TIR] Make lower_warp_memory support extent(threadIdx.x) < warp_size

GitBox Sat, 11 Apr 2020 17:57:47 -0700

roastduck commented on issue #5307: [TIR] Make lower_warp_memory support 
extent(threadIdx.x) < warp_size
URL: https://github.com/apache/incubator-tvm/pull/5307#issuecomment-612545516
 
 
   > Thanks @roastduck. I wonder if we can also discuss the alternative 
abstractions. Right now the abstraction seems to suggest that conceptually the 
size of the warp is reduced to half(as the shuffle size). However, another way 
to view it would be to keep the size of the warp to be fixed(32), but support 
the index access pattern of the subgroups, for example, the canonical form 
below describes a shuffle in the group of 4
   > 
   > ```c++
   > A[wi] = B[(wi/4)*4+ ((wi % 4) +1) %4]
   > ```
   
   In the alternative approach, `__shfl(x, (threadIdx.x + 1) % 4, 4)` becomes 
`__shfl(x, threadIdx.y * 4 + (threadIdx.x + 1) % 4)` (and `threadIdx.z` might 
also be involved). Is my understanding right?
   
   I think the good thing is better compatibility for OpenCL. By this approach, 
we can support OpenCL using the old 2-parameter intrinsic.
   
   And I think the bad thing comes with CUDA's new shuffle API. `__shfl` has 
actually been 
[deprecated](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions)
 by CUDA, and we will have to switch to the new `__shfl_sync` API. The new API 
requires an explicit argument `mask` to specify which threads are active during 
this shuffle. In the approach of this PR, we will only need to calculate the 
activeness within a partial (say 1/2, 1/4, etc) warp, which can be calculated 
from the `if` nest, given the thread indices of the current partial warp. But 
in the alternative approach, we will need to calculate the activeness within a 
whole warp, which means other threads outside the current partial warp will be 
involved. It may bring a lot of complexity, and even run time overhead when 
there are dynamic conditions.
   
   To better support both CUDA and OpenCL, maybe we can use both of the 
approaches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-tvm] roastduck commented on issue #5307: [TIR] Make lower_warp_memory support extent(threadIdx.x) < warp_size

Reply via email to