[GitHub] [tvm-rfcs] tqchen commented on pull request #104: [RFC] Scalable vectors in TIR

via GitHub Wed, 30 Aug 2023 06:25:31 -0700


tqchen commented on PR #104:
URL: https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1699173201


   it might be useful also bring some discussions to forums. here is a quick 
related sketch of GPU related models
   
   ```python
   for y in range(64):
     for x in range(64):
         C[y, x] = A[y, x] * (B[y] + 1)
   ```
   Say we are interested in the original program. In a normal GPU programming 
terminology, we will map the compute of x to "threads", there `tid` is the 
thread index. In GPU programming there is also different memory scopes (i am 
using cuda terminology here):
   - local: the variable is local to each thread
   - shared: the variable is "shared" across threads, concurrent writing 
different values to the same shared variable is somewhat undefined.
   - warp shuffle: sometimes we might need to exchange data(e.g. take sum) 
across the threads, and it is done through shuffle instructions(like 
warp.all_reduce).
   
   ```python
   for y in range(64):
     for x in range(64 // n):
       for tid in T.scalable_vectorized_as_threads(n):
         a0: local = A[y, tid + n * x]
         b0: shared = B[y]
         b1: shared =  b0 + 1
         c0: local = a0 * b0
         C[y, tid + n * 4 * i] = c0
   ````
   
   The above code is a rough sketch of what it might looks like. Now, it might 
also be possible to produce a similar more "vector-view" version using the 
following rule:
   - local <=> vector<vscale>
   - shared <=> normal register
   
   ```python
   # note vscale = n
   for y in range(64):
     for x in range(64 // n):
       with T.sve_scope(n):
         a0: vector<vscale> = A[y, tid + n * x]
         b0: scalar = B[y]
         b1: vector<vscale> =  b0 + 1
         c0: scalar = a0 * b0
         C[y, tid + n * 4 * i] = c0
   ```
   
   They are not that different. But one thing is true: we do need to be able to 
identify the vector dtype differently from the scalar dtype(or in the case of 
GPU programming local from shared). Being able to mark a dtype as 
ScalableVectorMark seems to serve that purpose.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tvm-rfcs] tqchen commented on pull request #104: [RFC] Scalable vectors in TIR

Reply via email to