roastduck opened a new issue #5303: [TIR] Buggy LoopPartition
URL: https://github.com/apache/incubator-tvm/issues/5303
 
 
   Pass `LoopPartition` may be wrong when there are variables unknown at 
compile time.
   
   Here's the example.
   
   ```python
   import tvm
   import topi
   import numpy as np
   
   dtype = "float32"
   target = "cuda"
   m = 32
   n_max = 1024
   
   n = tvm.te.placeholder((), name="n", dtype="int32")
   a = tvm.te.placeholder((n_max,), name="a", dtype=dtype)
   def f(i):
       j = tvm.te.reduce_axis((0, n), name="j")
       return tvm.te.sum(i * a[j], axis=[j])
   b = tvm.te.compute((m,), f)
   
   with tvm.target.cuda():
       s = tvm.te.create_schedule(b.op)
       blk_x = tvm.te.thread_axis("blockIdx.x")
       th_x = tvm.te.thread_axis("threadIdx.x")
   
       i, = b.op.axis
       j, = b.op.reduce_axis
       i_outer, i_inner = s[b].split(i, nparts=1)
       j_outer, j_inner = s[b].split(j, factor=m)
       s[b].reorder(i_outer, j_outer, i_inner, j_inner)
       s[b].bind(i_outer, blk_x)
       s[b].bind(i_inner, th_x)
   
       a_cache = s.cache_read(a, "shared", [b])
       s[a_cache].compute_at(s[b], j_outer)
       a_axis, = a_cache.op.axis
       a_axis, _ = s[a_cache].split(a_axis, factor=1) # Workaround TVM Discuss 
Question 4826
       s[a_cache].bind(a_axis, th_x)
   
   print(tvm.lower(s, [n, a, b], target, simple_mode=True))
   compute = tvm.build(s, [n, a, b], target, name="run")
   print(compute.imported_modules[0].get_source())
   ```
   
   In this example, we tile `a` with factor 32, and cache it to the shared 
memory. The major characteristic of this example is that the reduce length `n` 
is unknown at compile time, so TVM will emit some boundary checkings.
   
   We first print the IR, and then print the generated CUDA code. The output is 
as follows:
   
   ```
   produce compute {
     // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1
     // attr [a.shared] storage_scope = "shared"
     allocate a.shared[float32 * 32]
     // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
     compute[threadIdx.x] = 0f
     for (j.outer, 0, floordiv((n[0] + 31), 32)) {
       produce a.shared {
         // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
         if (likely((((j.outer*32) + threadIdx.x) < 1024))) { // LOOK AT HERE
           a.shared[threadIdx.x] = a[((j.outer*32) + threadIdx.x)]
         }
       }
       // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32
       for (j.inner, 0, 32) {
         if (likely((((j.outer*32) + j.inner) < n[0]))) {
           compute[threadIdx.x] = (compute[threadIdx.x] + 
(float32(threadIdx.x)*a.shared[j.inner]))
         }
       }
     }
   }
   
   [18:24:47] /home/rd/src/incubator-tvm/src/tir/pass/loop_partition.cc:533: 
Cannot prove:((((floordiv((n[0] + 31), 32) - 1) - 32) + 1) >= 0), when 
generating the post doubt loop
   extern "C" __global__ void run_kernel0(void* __restrict__ compute, void* 
__restrict__ n, void* __restrict__ a) {
     __shared__ float a_shared[32];
     (( float*)compute)[(((int)threadIdx.x))] = 0.000000e+00f;
     for (int j_outer = 0; j_outer < min(32, (((( int*)n)[(0)] + 31) >> 5)); 
++j_outer) { // LOOK AT HERE
       __syncthreads();
       a_shared[(((int)threadIdx.x))] = (( float*)a)[(((j_outer * 32) + 
((int)threadIdx.x)))]; // LOOK AT HERE
       __syncthreads();
       for (int j_inner = 0; j_inner < 32; ++j_inner) {
         if (((j_outer * 32) + j_inner) < (( int*)n)[(0)]) {
           (( float*)compute)[(((int)threadIdx.x))] = ((( 
float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * 
a_shared[(j_inner)]));
         }
       }
     }
     for (int j_outer1 = 0; j_outer1 < max(((((( int*)n)[(0)] + 31) >> 5) - 
32), 0); ++j_outer1) {
       __syncthreads();
       if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + 
((int)threadIdx.x)) < 1024) {
         a_shared[(((int)threadIdx.x))] = (( float*)a)[((((min(32, (((( 
int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + ((int)threadIdx.x)))];
       }
       __syncthreads();
       for (int j_inner1 = 0; j_inner1 < 32; ++j_inner1) {
         if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) 
+ j_inner1) < (( int*)n)[(0)]) {
           (( float*)compute)[(((int)threadIdx.x))] = ((( 
float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * 
a_shared[(j_inner1)]));
         }
       }
     }
   }
   ```
   
   Note that there's a `if (likely((((j.outer*32) + threadIdx.x) < 1024)))` 
checking in the IR for `a.shared` to guard its boundary. Then, because of the 
`LoopPartition` Pass, this checking divides the surrounding loop into two parts 
in the CUDA code.
   
   Here's the point. In the CUDA code, the first loop is of length 32, in which 
the checking has been removed. However, **`(j.outer*32) + threadIdx.x) < 1024` 
is not always true, because `(j.outer*32) + threadIdx.x) < 32 * 32 + 
threadIdx.x == 1024 + threadIdx.x`. The upper bound of `threadIdx.x` should be 
deduced from 32.** The loop should be shorter, or the checking cannot be 
removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to