tqchen commented on PR #77:
URL: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1170294348

   Thanks @Lunderberg for the update, I think we are moving towards positive 
direction of overall IR design. Some additional feedbacks:
   
   ## Keep Schedule Decisions Local to PrimFunc then Compose
   
   On schedule primitives, to be pragmatic, it would be helpful to have some of 
the cross PrimFunc re-flowing done in two steps. Specifically, some of your 
`transform_layout` example of the functions touches buffers that involves 
input. One approach is of course to trace up to its producers and then rewrite 
the producers function as well (or trace down to consumers functions). However, 
the complication here is that:
   
   - There can be multiple consumers/producer TIR functions
   - In certain cases producer/consumer may not have consistent requirements.
   - The producer/consumer themselves can have their own local layout 
preferences that needs to be consolidated.
   
   In general it is helpful to first keep schedule decision local, e.g. 
introducing a caching stage (AC, BC in the example), the compose with another 
reflowing pass to bring the decision to consumer/producers. This is mainly to 
reduce the overall complexity in implementing such transformations, and also 
makes things more modular.
   
   ```
   @T.prim_func
   def grow(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
       AC = T.alloc_buffer([4, 4], "int32")
       BC = T.alloc_buffer([4, 4], "int32")
   
       for io, ii in T.grid(4, 4):
            with T.block():
              T.block_attr("preproc", "pad")
                 AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)
   
       for i, j in T.grid(4, 4):
           BC[i, j] = 2 * AC[i, j]
   
       for io, ii in T.grid(14):
           with T.block():
               T.block_attr("postproc", ["crop", 0])
               B[io, ii] = BC[4 * io + ii]
   
   @T.prim_func
   def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]):
       for i in T.grid(14):
           B[i] = A[i] + 1
   
   @R.func
   def main(A: T.Tensor[14, "int32"]):
        lv0 = call_tir(grow, [A], (14))
        # an intermdiate stage to show non-local reflowing
        lv1 = call_tir(addone, [lv0], (14))
        lv2 = call_tir(grow, [lv1], (14))
        ...
   ```
   
   ## Use IfThenElse expression for Padding.
   
   While it is possible to express padding with a loop and another loop that 
writes the padded value, it is harder to schedule the resulting blocks as there 
are more than one producers. Having a single loop and use `T.if_then_else ` 
will express such pattern in a single shot and makes future rewriting easier.
   
   
   ```python
       for io, ii in T.grid(4, 4):
            with T.block():
              T.block_attr("preproc", "pad")
                 AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0)
   ```
   
   ## Propagate Padding Decisions from the End.
    
   Some of the complications of duplicated condition(and their simplification) 
roots from the fact that we do layout transform of output and input 
separately(each introducing their own conditions which then needs to be 
simplified). It might be helpful to do a global transformation, usually driven 
from the output, then "backprop" the implication of that decisions to the 
input. Doing such transformation at a single shot will likely alleviate the 
need of generating extra conditions then simplifying them.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to