roastduck opened a new issue #5303: [TIR] Buggy LoopPartition URL: https://github.com/apache/incubator-tvm/issues/5303 Pass `LoopPartition` may be wrong when there are variables unknown at compile time. Here's the example. ```python import tvm import topi import numpy as np dtype = "float32" target = "cuda" m = 32 n_max = 1024 n = tvm.te.placeholder((), name="n", dtype="int32") a = tvm.te.placeholder((n_max,), name="a", dtype=dtype) def f(i): j = tvm.te.reduce_axis((0, n), name="j") return tvm.te.sum(i * a[j], axis=[j]) b = tvm.te.compute((m,), f) with tvm.target.cuda(): s = tvm.te.create_schedule(b.op) blk_x = tvm.te.thread_axis("blockIdx.x") th_x = tvm.te.thread_axis("threadIdx.x") i, = b.op.axis j, = b.op.reduce_axis i_outer, i_inner = s[b].split(i, nparts=1) j_outer, j_inner = s[b].split(j, factor=m) s[b].reorder(i_outer, j_outer, i_inner, j_inner) s[b].bind(i_outer, blk_x) s[b].bind(i_inner, th_x) a_cache = s.cache_read(a, "shared", [b]) s[a_cache].compute_at(s[b], j_outer) a_axis, = a_cache.op.axis a_axis, _ = s[a_cache].split(a_axis, factor=1) # Workaround TVM Discuss Question 4826 s[a_cache].bind(a_axis, th_x) print(tvm.lower(s, [n, a, b], target, simple_mode=True)) compute = tvm.build(s, [n, a, b], target, name="run") print(compute.imported_modules[0].get_source()) ``` In this example, we tile `a` with factor 32, and cache it to the shared memory. The major characteristic of this example is that the reduce length `n` is unknown at compile time, so TVM will emit some boundary checkings. We first print the IR, and then print the generated CUDA code. The output is as follows: ``` produce compute { // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 1 // attr [a.shared] storage_scope = "shared" allocate a.shared[float32 * 32] // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32 compute[threadIdx.x] = 0f for (j.outer, 0, floordiv((n[0] + 31), 32)) { produce a.shared { // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32 if (likely((((j.outer*32) + threadIdx.x) < 1024))) { // LOOK AT HERE a.shared[threadIdx.x] = a[((j.outer*32) + threadIdx.x)] } } // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 32 for (j.inner, 0, 32) { if (likely((((j.outer*32) + j.inner) < n[0]))) { compute[threadIdx.x] = (compute[threadIdx.x] + (float32(threadIdx.x)*a.shared[j.inner])) } } } } [18:24:47] /home/rd/src/incubator-tvm/src/tir/pass/loop_partition.cc:533: Cannot prove:((((floordiv((n[0] + 31), 32) - 1) - 32) + 1) >= 0), when generating the post doubt loop extern "C" __global__ void run_kernel0(void* __restrict__ compute, void* __restrict__ n, void* __restrict__ a) { __shared__ float a_shared[32]; (( float*)compute)[(((int)threadIdx.x))] = 0.000000e+00f; for (int j_outer = 0; j_outer < min(32, (((( int*)n)[(0)] + 31) >> 5)); ++j_outer) { // LOOK AT HERE __syncthreads(); a_shared[(((int)threadIdx.x))] = (( float*)a)[(((j_outer * 32) + ((int)threadIdx.x)))]; // LOOK AT HERE __syncthreads(); for (int j_inner = 0; j_inner < 32; ++j_inner) { if (((j_outer * 32) + j_inner) < (( int*)n)[(0)]) { (( float*)compute)[(((int)threadIdx.x))] = ((( float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * a_shared[(j_inner)])); } } } for (int j_outer1 = 0; j_outer1 < max(((((( int*)n)[(0)] + 31) >> 5) - 32), 0); ++j_outer1) { __syncthreads(); if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + ((int)threadIdx.x)) < 1024) { a_shared[(((int)threadIdx.x))] = (( float*)a)[((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + ((int)threadIdx.x)))]; } __syncthreads(); for (int j_inner1 = 0; j_inner1 < 32; ++j_inner1) { if ((((min(32, (((( int*)n)[(0)] + 31) >> 5)) * 32) + (j_outer1 * 32)) + j_inner1) < (( int*)n)[(0)]) { (( float*)compute)[(((int)threadIdx.x))] = ((( float*)compute)[(((int)threadIdx.x))] + (((float)((int)threadIdx.x)) * a_shared[(j_inner1)])); } } } } ``` Note that there's a `if (likely((((j.outer*32) + threadIdx.x) < 1024)))` checking in the IR for `a.shared` to guard its boundary. Then, because of the `LoopPartition` Pass, this checking divides the surrounding loop into two parts in the CUDA code. Here's the point. In the CUDA code, the first loop is of length 32, in which the checking has been removed. However, **`(j.outer*32) + threadIdx.x) < 1024` is not always true, because `(j.outer*32) + threadIdx.x) < 32 * 32 + threadIdx.x == 1024 + threadIdx.x`. The upper bound of `threadIdx.x` should be deduced from 32.** The loop should be shorter, or the checking cannot be removed.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
