TE allows multiple IterVars on one attaching path (i.e. two nested loops) 
binding to one thread IterVar.  It seems necessary for some schedules that 
model stages having different thread allocation to have correct bounds to be 
inferred.  These schedules certainly represent valid and useful use cases.  
However, binding two IterVars on one path to one thread looks unnatural in a 
schedule tree.

A schedule tree is just another way to represent a loop nest.  Binding an 
IterVar to a threadIdx is replacing a loop with one dimension of thread launch 
bound and moving the loop up to the host/device boundary.  Binding two IterVars 
on two different attaching paths to the same threadIdx moves both loops up 
their hierarchies independently, merges them when they meet, and keeps moving 
up to be part of the launch bound.  Both cases are well explained with loop 
nest manipulation.  However, binding two IterVars on one path to the same 
threadIdx is difficult to explain with loop nest manipulation – two nested 
loops becomes one via neither fusion nor unrolling, before moving up the 
hierarchy to be a launch bound.

I feel I miss something here.

The following example is from 
tests/python/unittest/test_te_schedule_bound_inference.py, a schedule for a 
1024x1024 by 1024x204 matrix multiplication in three stages with 256 blocks and 
64 threads per block.  Threads in a block are allocated in two ways: 
1.      each thread loads 1 data from each input matrix to shared memory in 
iteration in A.shared and B.shared, respectively;
2.      each thread is responsible for a 8x8 tile in the output matrix, doing 
multiplication and accumulation on local memory in CC.local stage and storing 
result to global memory in CC stage; 

In the TE schedule (shown in the tree below), two highlighted IterVars on the 
same attaching path bind to threadIdx.x. Same situation happens to the path 
from B.shared and threadIdx.y.

![double-thread-binding|312x500](upload://kuSl30XbJfRkRWbi6zyGfLCDcrs.png) 

I need to modify the schedule to the tree below to explain it with loop 
manipulation.  This schedule however is not supported by current TE, generating 
wrong code.  The problem comes from bound inference.  In order to achieve the 
desired code, the four IterVars with green stars must relax in both schedule 
trees.  It seems that the two IterVars in stage CC have to be on attaching path 
to be relaxed for A.shared’s bounds in the current TE implementation, because 
InferRootBound checks IterVars for relaxation only in two conditions: (i) 
IterVars in the consumer stages (CC is not A.shared’s immediate consumer in 
this case) and (ii) IterVars on A.shared’s attaching path (the two IterVars are 
not in the modified schedule tree).

![single-thread-binding|309x500](upload://fMYUrRsOcOz7GIGxzHTRmQDgiTP.png)
 
After I modify InferRootBound to check all downstream stages’ IterVars for 
relaxation, the modified schedule tree generates the same desired code as the 
one having two IterVars on one path binding to the same threadIdx.

My question is whether allowing binding two IterVars on one attaching path to 
the same threadIdx is necessary.  There might be other use cases that I am not 
aware of.  Please share them with me.  If it is not necessary, but still a 
language feature, I would also like to have a better understanding.

The following code generates the above schedule trees.  Notice there are two 
lines to switch between them.

    import tvm
    from tvm import te
    import numpy as np
    import time
    
    
    nn = 1024
    n = tvm.runtime.convert(nn)
    A = te.placeholder((n, n), name='A')
    B = te.placeholder((n, n), name='B')
    k = te.reduce_axis((0, n), name='k')
    C = te.compute(
        (n, n),
        lambda ii, jj: te.sum(A[ii, k] * B[jj, k], axis=k),
        name='CC')
    # schedule
    s = te.create_schedule(C.op)
    xtile, ytile = 32, 32
    scale = 8
    num_thread = 8
    block_factor = scale * num_thread
    block_x = te.thread_axis("blockIdx.x")
    thread_x = te.thread_axis("threadIdx.x")
    block_y = te.thread_axis("blockIdx.y")
    thread_y = te.thread_axis("threadIdx.y")
    CC = s.cache_write(C, "local")
    AA = s.cache_read(A, "shared", [CC])
    BB = s.cache_read(B, "shared", [CC])
    by, yi = s[C].split(C.op.axis[0], factor=block_factor)
    bx, xi = s[C].split(C.op.axis[1], factor=block_factor)
    s[C].reorder(by, bx, yi, xi)
    s[C].bind(by, block_y)
    s[C].bind(bx, block_x)
    ty, yi = s[C].split(yi, nparts=num_thread)
    tx, xi = s[C].split(xi, nparts=num_thread)
    s[C].reorder(ty, tx, yi, xi)
    s[C].bind(ty, thread_y)
    s[C].bind(tx, thread_x)
    yo, xo = CC.op.axis
    s[CC].reorder(k, yo, xo)
    
    # Switch for the two schedule trees starts here
    s[CC].compute_at(s[C], tx)  # generate the first schedule tree
    # s[CC].compute_at(s[C], bx)  # generate the second schedule tree
    # Switch for the two schedule trees ends here
    
    s[AA].compute_at(s[CC], k)
    s[BB].compute_at(s[CC], k)
    ty, xi = s[AA].split(s[AA].op.axis[0], nparts=num_thread)
    tx, xi = s[AA].split(xi, nparts=num_thread)
    s[AA].bind(ty, thread_y)
    s[AA].bind(tx, thread_x)
    ty, xi = s[BB].split(s[BB].op.axis[0], nparts=num_thread)
    tx, xi = s[BB].split(xi, nparts=num_thread)
    s[BB].bind(ty, thread_y)
    s[BB].bind(tx, thread_x)
    ctx = tvm.context("cuda", 0)
    func = tvm.build(s, [A, B, C], target="cuda", name='tid')
    assert func
    print(func.imported_modules[0].get_source())
    
    from tvm.contrib import tedd
    tedd.viz_dataflow_graph(s, False, '/tmp/dfg.dot')
    tedd.viz_schedule_tree(s, False, '/tmp/scheduletree.dot')
    tedd.viz_itervar_relationship_graph(s, False, '/tmp/itervar.dot')
        
    # Random generated tensor for testing
    dtype = "float32"
    a = tvm.nd.array(np.random.rand(A.shape[0].value, 
A.shape[1].value).astype(dtype), ctx)
    b = tvm.nd.array(np.random.rand(B.shape[0].value, 
B.shape[1].value).astype(dtype), ctx)
    c = tvm.nd.array(np.random.rand(C.shape[0].value, 
C.shape[1].value).astype(dtype), ctx)
    
    func(a, b, c)
    result = c.asnumpy()
    answer = np.matmul(a.asnumpy(), b.asnumpy().transpose())
    tvm.testing.assert_allclose(result, answer, rtol=1e-2)





---
[Visit 
Topic](https://discuss.tvm.ai/t/binding-two-itervars-in-one-loop-nest-to-the-same-thread/6693/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/56be5dfafb8b8b0349e253f8964c2dbe7ea5f49f4f07df2dd3248b1fa26d08df).

Reply via email to