[GitHub] [tvm-rfcs] csullivan commented on pull request #77: [RFC] Buffer Layout Padding

GitBox Sun, 12 Jun 2022 09:34:43 -0700


csullivan commented on PR #77:
URL: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1153227651

Thanks for sharing the contextual pointers for the community @vinx13. Agreed
the approaches discussed are both valid. I would actually like to argue the
stronger point that they are complimentary and are only appearing to be
contrary because we are considering too narrow of a scope.

It can be helpful to share an overview of common handlings of layout
transformations in ML compilers. Most of my arguments against A0 (local
cancellation in the graph only) being sufficient stem from prior experiences in
graph layout optimizations. The evolution of the graph approaches for
optimizing layouts I've seen followed the below trajectory:

1) The first graph approach that is often taken is what has been argued so
far in A0, local back to back cancellation. It works reasonably when the data
flow and op variety in a model are simple.

Local-only cancellation tends to fail in models which still have simple data
flow, but more variety in the sequence of operators, each with different valid
implementations. Consider,

```
-> transformX -> conv2d -> (inv_transform_X) -> pool -> (transformX) ->
conv2d -> inv_transform_X
```
In this case `pool` can be replaced by any sequence of operations that are
layout agnostic or for which there exists multiple implementations, and so the
choice of layout is unconstrained. In this case these operators are layout
unconstrained, whereas the convolutions are layout constrained. As you can see
even for a simple model, the approach discussed in A0 already needs to be
modified to support non-local layout analysis and folding.

2) The typical second approach is then to still utilize A0, but to first
apply a pre-processing pass that sinks layout transforming operations in the
graph along the path of data flow
[[1](https://github.com/pytorch/glow/blob/56249602c9ec93fa586cea2ce8ab315003478eed/lib/Optimizer/GraphOptimizerPipeline/FunctionPassPipeline.cpp#L69),

[2](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L542)].
The above case then becomes,

```
-> transformX -> conv2d -> pool -> (inv_transform_X -> transformX) ->
conv2d -> inv_transform_X
```
Then apply the method discussed in A0 and do local cancellation.

The above method works well for models with relatively simple data flow but
for models with more branching the method has limitations. A simple
consideration is sinking a transform through an operation with multiple inputs.
The process of doing so requires materialization of the inverse transform on
the other operands.

For the sake of simplicity consider matrix multiplication: ${A^{T}}B =
{(B^{T}A)}^T$, in this case the final state of sinking the transpose on A was
to materialize two transposes rather than one, one on B and one on the matmul.
Sinking-alone isn't sufficient to guarantee a globally optimal layout because
it still only treats the propagation of transforms locally/greedily.

3) A modification to sinking (downward along data flow) is to introduce
upward flowing
[[3](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L156)].
It can help by flowing transforms along the poisoned operands (e.g. B in the
above matrix multiply) by propagating the transform up as far as possible,
hopefully to a constant where it can be folded.

For inference graphs I've seen this approach work well. But the approach is
still greedy and suboptimal choices can occur. For training graphs this
approach works less well due to the data flow complexity involved with
branching from the forward to backward graph and the optimizers in place update
of weights. I omit a specific example in this case for brevity, but encourage
the review of of the graphs from @t-vi application of TVM to pytorch training
for Bert and the long chains of transpose and reshapes that occur within the
forward and backward m-h attention layers
[[3](https://github.com/apache/tvm-site/blob/85c7e4ebf6d9ed221075e38e5e5e1a0052693acc/_posts/2020-07-14-bert-pytorch-tvm.md)].

4) Finally, to arrive at a closer to globally optimal solution for layout,
different constraint-based approaches are considered. Constraints from
operations which are layout constrained can be flowed across unconstrained
parts of the graph until an approximate global optimum is reached.

An example implementation I have seen included layout sources (e.g.
operators like conv2d on an NPU with distinct layout constraints) and layout
sinks (e.g. operations which involve data movement by DMA engines or in-memory
compute which allow zero-cost data layout rearrangement during store). A
constraint solver in this case flows layout constraints from sources toward
sinks that can absorb aggregated/merged layout transform constraints.
____

Coming back to the present discussion, I believe our design should be
focused on ensuring that one or more of the non-local approaches discussed
above in 2-4 are achievable. Any of these cases require the following
components:

C0) The ability to track constraints on a buffer.

C1) The ability to roundtrip between an IR representation and the
producer/consumer constraint representations.

C2) The ability to merge/fold constraints - flowing is just merging a
constraint with an unconstraint.

Even for the pure local (back-to-back) case discussed in A0, components C1
and C2 are helpful with the caveat that the inferred constraints from the IR
only exists within the local context of a single producer consumer pair in a
pass.

Thus both A0 and A1 can benefit from these components, and the delta that
exists between A0 and A1 is clearer:

* Delta 1: In A1 buffer constraints are maintained per buffer in the graph
globally (non-local); and therefore can be optimized by any of the methods 2-4
discussed.

* Delta 2: In addition to inferring buffer constraints from IR (one half of
C1), A1 proposes for constraint expression about the memory during scheduling
to be maintained for some time.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tvm-rfcs] csullivan commented on pull request #77: [RFC] Buffer Layout Padding

Reply via email to