csullivan commented on PR #77:
URL: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1153227651

   Thanks for sharing the contextual pointers for the community @vinx13. Agreed 
the approaches discussed are both valid. I would actually like to argue the 
stronger point that they are complimentary and are only appearing to be 
contrary because we are considering too narrow of a scope. 
   
   It can be helpful to share an overview of common handlings of layout 
transformations in ML compilers. Most of my arguments against A0 (local 
cancellation in the graph only) being sufficient stem from prior experiences in 
graph layout optimizations. The evolution of the graph approaches for 
optimizing layouts I've seen followed the below trajectory:
    
   1) The first graph approach that is often taken is what has been argued so 
far in A0, local back to back cancellation. It works reasonably when the data 
flow and op variety in a model are simple.
   
   Local-only cancellation tends to fail in models which still have simple data 
flow, but more variety in the sequence of operators, each with different valid 
implementations. Consider, 
   
   ```
   -> transformX -> conv2d -> (inv_transform_X) -> pool -> (transformX) -> 
conv2d -> inv_transform_X
   ```
   In this case `pool` can be replaced by any sequence of operations that are 
layout agnostic or for which there exists multiple implementations, and so the 
choice of layout is unconstrained. In this case these operators are layout 
unconstrained, whereas the convolutions are layout constrained. As you can see 
even for a simple model, the approach discussed in A0 already needs to be 
modified to support non-local layout analysis and folding. 
   
   2) The typical second approach is then to still utilize A0, but to first 
apply a pre-processing pass that sinks layout transforming operations in the 
graph along the path of data flow 
[[1](https://github.com/pytorch/glow/blob/56249602c9ec93fa586cea2ce8ab315003478eed/lib/Optimizer/GraphOptimizerPipeline/FunctionPassPipeline.cpp#L69),
 
[2](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L542)].
 The above case then becomes, 
   
   ```
   -> transformX -> conv2d -> pool ->  (inv_transform_X -> transformX) -> 
conv2d -> inv_transform_X
   ```
   Then apply the method discussed in A0 and do local cancellation. 
   
   The above method works well for models with relatively simple data flow but 
for models with more branching the method has limitations. A simple 
consideration is sinking a transform through an operation with multiple inputs. 
The process of doing so requires materialization of the inverse transform on 
the other operands. 
   
   For the sake of simplicity consider matrix multiplication: ${A^{T}}B = 
{(B^{T}A)}^T$, in this case the final state of sinking the transpose on A was 
to materialize two transposes rather than one, one on B and one on the matmul. 
Sinking-alone isn't sufficient to guarantee a globally optimal layout because 
it still only treats the propagation of transforms locally/greedily.
   
   3) A modification to sinking (downward along data flow) is to introduce 
upward flowing 
[[3](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L156)].
 It can help by flowing transforms along the poisoned operands (e.g. B in the 
above matrix multiply) by propagating the transform up as far as possible, 
hopefully to a constant where it can be folded. 
   
   For inference graphs I've seen this approach work well. But the approach is 
still greedy and suboptimal choices can occur. For training graphs this 
approach works less well due to the data flow complexity involved with 
branching from the forward to backward graph and the optimizers in place update 
of weights. I omit a specific example in this case for brevity, but encourage 
the review of of the graphs from @t-vi application of TVM to pytorch training 
for Bert and the long chains of transpose and reshapes that occur within the 
forward and backward m-h attention layers 
[[3](https://github.com/apache/tvm-site/blob/85c7e4ebf6d9ed221075e38e5e5e1a0052693acc/_posts/2020-07-14-bert-pytorch-tvm.md)].
 
   
   4) Finally, to arrive at a closer to globally optimal solution for layout, 
different constraint-based approaches are considered. Constraints from 
operations which are layout constrained can be flowed across unconstrained 
parts of the graph until an approximate global optimum is reached. 
   
   An example implementation I have seen included layout sources (e.g. 
operators like conv2d on an NPU with distinct layout constraints) and layout 
sinks (e.g. operations which involve data movement by DMA engines or in-memory 
compute which allow zero-cost data layout rearrangement during store). A 
constraint solver in this case flows layout constraints from sources toward 
sinks that can absorb aggregated/merged layout transform constraints. 
   ____
   
   Coming back to the present discussion, I believe our design should be 
focused on ensuring that one or more of the non-local approaches discussed 
above in 2-4 are achievable. Any of these cases require the following 
components:
   
   C0) The ability to track constraints on a buffer.
   
   C1) The ability to roundtrip between an IR representation and the 
producer/consumer constraint representations.
   
   C2) The ability to merge/fold constraints - flowing is just merging a 
constraint with an unconstraint. 
   
   Even for the pure local (back-to-back) case discussed in A0, components C1 
and C2 are helpful with the caveat that the inferred constraints from the IR 
only exists within the local context of a single producer consumer pair in a 
pass. 
   
   Thus both A0 and A1 can benefit from these components, and the delta that 
exists between A0 and A1 is clearer:
   
   * Delta 1: In A1 buffer constraints are maintained per buffer in the graph 
globally (non-local); and therefore can be optimized by any of the methods 2-4 
discussed. 
   
   * Delta 2: In addition to inferring buffer constraints from IR (one half of 
C1), A1 proposes for constraint expression about the memory during scheduling 
to be maintained for some time.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to