[GitHub] [incubator-tvm] comaniac commented on a change in pull request #6097: [DOCS][REFACTOR] Organize Design and Architectures

GitBox Mon, 20 Jul 2020 18:30:24 -0700


comaniac commented on a change in pull request #6097:
URL: https://github.com/apache/incubator-tvm/pull/6097#discussion_r457773380




##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.

Review comment:
       Worth to mention if this part is backend dependent or not.

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,

Review comment:
       - `if you are familiar with the computational graph terminology in deep 
learning systems` seems can be removed.
   - Add a period.

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,328 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is for developers who want to understand the
+architectures of the TVM stack and help to develop the project.
+We organize this page as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section contains a guide to walk 
you through the components used during a compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.

Review comment:
       This may be a bit off. Could we list some (e.g., 1-5) names in each 
component so that people could know who should be consulted for component 
specific issues. Of course, it's not necessary for the named person to solve 
all issues, but since the community is getting large, many issues/discuss 
topics are silent because the no one is tagged.
   
   Alternatively, we might specify authors in "Ho Tos" pages.
   

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.

Review comment:
       ```suggestion
   One of the best ways to design and understand a complex system is to 
identify the key data structures and APIs that manipulate (transforms) these 
data structures. Once we identified the key data structures, we can then 
breakdown a system into logical components that either define a collection of 
key data structures or transformations among the data structures.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the 
transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized 
version.
+- lowering: transform a program to a lower-level representation that is closer 
to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. 
The optimizations include common program optimizations such as constant folding 
and dead-code elimination, and tensor-computation specific passes such as 
layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) 
to break the end to end function(e.g. mobilenet) into sub-function(e.g. 
conv2d-relu) segments. We call these segments primitive functions. This process 
helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the 
generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For 
specific targets, we may also directly go to the target translation phase and 
use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the 
overall execution problem. For simple models with known shapes and no control 
flow, we can lower to a graph runtime that stores the execution structure in a 
graph. We also support a virtual machine backend for dynamic executions. 
Finally, we plan to support ahead of time compilation that compiles the 
high-level execution structure into the executable and generated primitive 
functions. All of these execution modes are encapsulated by a unified 
**runtime.Module** interface, which we will discuss in the latter part of the 
guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many 
tir passes serve the purpose of lowering. For example, there are passes to 
flatten multi-dimensional access to one-dimensional pointer access, to expand 
the intrinsics into target-specific ones, and to decorate the function entry to 
meet the runtime calling convention. Of course, there are also optimizations 
passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, 
CUDA C, and other target compilers. As a result, we leave low-level 
optimizations such as register allocation to the downstream compilers and only 
focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and 
rule-based. One design goal of the TVM stack is to support high-performance 
code optimizations for different hardware platforms. To do so, we will need to 
investigate as many optimizations choices as possible, including but not 
limited to, multi-dimensional tensor access, loop tiling behavior, special 
accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will 
take a search and learning-based approach. We first define a collection of 
actions we can take to transform a program. Example actions include loop 
transformations, inlining, vectorization. We call these actions **scheduling 
primitives**. The collection of scheduling primitives defines a search space of 
possible optimizations we can make to a program. The system will use then 
searches over different possible scheduling combinations to pick the best one. 
The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is 
completed. The compiler can then just lookup the best schedule sequence and 
apply it to the program. Notably, this schedule application phase **exactly 
like** the rule-based transformations, enabling us to share the same interface 
convention with tradition passes.

Review comment:
       - Be consistent with `schedule sequence` or `schedule combination`.
   - `the best schedule sequence for an operator`
   
   If we would like to focus on AutoTVM, then `schedule combination` with `for 
an operator` are precise. If we would like to include auto_scheduler, then 
`schedule sequence` with `for a subgraph` might be more general.
   

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the 
transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized 
version.
+- lowering: transform a program to a lower-level representation that is closer 
to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. 
The optimizations include common program optimizations such as constant folding 
and dead-code elimination, and tensor-computation specific passes such as 
layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) 
to break the end to end function(e.g. mobilenet) into sub-function(e.g. 
conv2d-relu) segments. We call these segments primitive functions. This process 
helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the 
generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For 
specific targets, we may also directly go to the target translation phase and 
use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the 
overall execution problem. For simple models with known shapes and no control 
flow, we can lower to a graph runtime that stores the execution structure in a 
graph. We also support a virtual machine backend for dynamic executions. 
Finally, we plan to support ahead of time compilation that compiles the 
high-level execution structure into the executable and generated primitive 
functions. All of these execution modes are encapsulated by a unified 
**runtime.Module** interface, which we will discuss in the latter part of the 
guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many 
tir passes serve the purpose of lowering. For example, there are passes to 
flatten multi-dimensional access to one-dimensional pointer access, to expand 
the intrinsics into target-specific ones, and to decorate the function entry to 
meet the runtime calling convention. Of course, there are also optimizations 
passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, 
CUDA C, and other target compilers. As a result, we leave low-level 
optimizations such as register allocation to the downstream compilers and only 
focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and 
rule-based. One design goal of the TVM stack is to support high-performance 
code optimizations for different hardware platforms. To do so, we will need to 
investigate as many optimizations choices as possible, including but not 
limited to, multi-dimensional tensor access, loop tiling behavior, special 
accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will 
take a search and learning-based approach. We first define a collection of 
actions we can take to transform a program. Example actions include loop 
transformations, inlining, vectorization. We call these actions **scheduling 
primitives**. The collection of scheduling primitives defines a search space of 
possible optimizations we can make to a program. The system will use then 
searches over different possible scheduling combinations to pick the best one. 
The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is 
completed. The compiler can then just lookup the best schedule sequence and 
apply it to the program. Notably, this schedule application phase **exactly 
like** the rule-based transformations, enabling us to share the same interface 
convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function 
generation problem. This part of the module is called AutoTVM(auto_scheduler). 
We expect to expand the learning-based transformations to more areas as we 
continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+The target translation phase transforms an IRModule to the corresponding 
target executable format. For backends such as x86, ARM, we will use the LLVM 
IRBuilder to build in-memory LLVM IR. We can also generate source-level 
languages such as CUDA C and OpenCL. Finally, we support the direct translation 
of a Relay function (sub-graph) for external code generators. Importantly, the 
final code generation phase should be lightweight as possible with the vast 
majority of transformations and lowering performed before target translation.

Review comment:
       s/x86, ARM/x86 and ARM/

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the 
transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized 
version.
+- lowering: transform a program to a lower-level representation that is closer 
to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. 
The optimizations include common program optimizations such as constant folding 
and dead-code elimination, and tensor-computation specific passes such as 
layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) 
to break the end to end function(e.g. mobilenet) into sub-function(e.g. 
conv2d-relu) segments. We call these segments primitive functions. This process 
helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the 
generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For 
specific targets, we may also directly go to the target translation phase and 
use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the 
overall execution problem. For simple models with known shapes and no control 
flow, we can lower to a graph runtime that stores the execution structure in a 
graph. We also support a virtual machine backend for dynamic executions. 
Finally, we plan to support ahead of time compilation that compiles the 
high-level execution structure into the executable and generated primitive 
functions. All of these execution modes are encapsulated by a unified 
**runtime.Module** interface, which we will discuss in the latter part of the 
guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many 
tir passes serve the purpose of lowering. For example, there are passes to 
flatten multi-dimensional access to one-dimensional pointer access, to expand 
the intrinsics into target-specific ones, and to decorate the function entry to 
meet the runtime calling convention. Of course, there are also optimizations 
passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, 
CUDA C, and other target compilers. As a result, we leave low-level 
optimizations such as register allocation to the downstream compilers and only 
focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and 
rule-based. One design goal of the TVM stack is to support high-performance 
code optimizations for different hardware platforms. To do so, we will need to 
investigate as many optimizations choices as possible, including but not 
limited to, multi-dimensional tensor access, loop tiling behavior, special 
accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will 
take a search and learning-based approach. We first define a collection of 
actions we can take to transform a program. Example actions include loop 
transformations, inlining, vectorization. We call these actions **scheduling 
primitives**. The collection of scheduling primitives defines a search space of 
possible optimizations we can make to a program. The system will use then 
searches over different possible scheduling combinations to pick the best one. 
The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is 
completed. The compiler can then just lookup the best schedule sequence and 
apply it to the program. Notably, this schedule application phase **exactly 
like** the rule-based transformations, enabling us to share the same interface 
convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function 
generation problem. This part of the module is called AutoTVM(auto_scheduler). 
We expect to expand the learning-based transformations to more areas as we 
continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+The target translation phase transforms an IRModule to the corresponding 
target executable format. For backends such as x86, ARM, we will use the LLVM 
IRBuilder to build in-memory LLVM IR. We can also generate source-level 
languages such as CUDA C and OpenCL. Finally, we support the direct translation 
of a Relay function (sub-graph) for external code generators. Importantly, the 
final code generation phase should be lightweight as possible with the vast 
majority of transformations and lowering performed before target translation.
+We also provide a Target structure to specify the compilation target. The 
transformations before the target translation phase can also be affected by the 
target — for example, a target's vector length would change the vectorization 
behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of TVM's runtime is to provide a minimal API for loading and 
executing the compiled artifact in a language of their choice, including 
Python, C++, Rust, Go, Java, and JavaScript. The code snippet below shows such 
an example in Python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A 
runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for 
both the generated functions. A runtime.PackedFunc can take arguments and 
return values with the following types: POD types(int, float), string, 
runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of 
runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are 
powerful mechanisms to modularize the runtime. For example, to get the above 
`addone` function on CUDA, we can use LLVM to generate the host-side code to 
compute the launching parameters(e.g. size of the thread groups) and then call 
into another PackedFunc from a CUDAModule that is backed by the CUDA driver 
API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet 
below gives an example of an end to end model execution using the same 
interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are 
sufficient to encapsulate both operator level programs(such as addone), as well 
as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data 
structures.

Review comment:
       s/Most of part/Most parts/

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the 
transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized 
version.
+- lowering: transform a program to a lower-level representation that is closer 
to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. 
The optimizations include common program optimizations such as constant folding 
and dead-code elimination, and tensor-computation specific passes such as 
layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) 
to break the end to end function(e.g. mobilenet) into sub-function(e.g. 
conv2d-relu) segments. We call these segments primitive functions. This process 
helps us to divide the original problem into two sub-problems:

Review comment:
       Should we mentioned the primitive functions we are talking about here is 
exactly `tir::PrimFunc`?

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the 
transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized 
version.
+- lowering: transform a program to a lower-level representation that is closer 
to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. 
The optimizations include common program optimizations such as constant folding 
and dead-code elimination, and tensor-computation specific passes such as 
layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) 
to break the end to end function(e.g. mobilenet) into sub-function(e.g. 
conv2d-relu) segments. We call these segments primitive functions. This process 
helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the 
generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For 
specific targets, we may also directly go to the target translation phase and 
use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the 
overall execution problem. For simple models with known shapes and no control 
flow, we can lower to a graph runtime that stores the execution structure in a 
graph. We also support a virtual machine backend for dynamic executions. 
Finally, we plan to support ahead of time compilation that compiles the 
high-level execution structure into the executable and generated primitive 
functions. All of these execution modes are encapsulated by a unified 
**runtime.Module** interface, which we will discuss in the latter part of the 
guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many 
tir passes serve the purpose of lowering. For example, there are passes to 
flatten multi-dimensional access to one-dimensional pointer access, to expand 
the intrinsics into target-specific ones, and to decorate the function entry to 
meet the runtime calling convention. Of course, there are also optimizations 
passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, 
CUDA C, and other target compilers. As a result, we leave low-level 
optimizations such as register allocation to the downstream compilers and only 
focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and 
rule-based. One design goal of the TVM stack is to support high-performance 
code optimizations for different hardware platforms. To do so, we will need to 
investigate as many optimizations choices as possible, including but not 
limited to, multi-dimensional tensor access, loop tiling behavior, special 
accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will 
take a search and learning-based approach. We first define a collection of 
actions we can take to transform a program. Example actions include loop 
transformations, inlining, vectorization. We call these actions **scheduling 
primitives**. The collection of scheduling primitives defines a search space of 
possible optimizations we can make to a program. The system will use then 
searches over different possible scheduling combinations to pick the best one. 
The search procedure is usually guided by a machine learning algorithm.

Review comment:
       ```suggestion
   It is hard to define a heuristic to make all of the choices. Instead, we 
will take a search and learning-based approach. We first define a collection of 
actions we can take to transform a program. Example actions include loop 
transformations, inlining, vectorization. We call these actions **scheduling 
primitives**. The collection of scheduling primitives defines a search space of 
possible optimizations we can make to a program. The system will then search 
the search space to find the best scheduling combination. The search procedure 
is usually guided by a machine learning algorithm.
   ```

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.

Review comment:
       IMHO, a better way to describe these two functions is to emphasize their 
relationship. For example, a relay.Function may be lowered to multiple 
tir.PrimFunc functions.
   
   In addition, it would be good to have an example. We might have a Relay 
graph with `2 * conv2d->add->relu` (2 layers), and we show the corresponding 
TIR graph.
   

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the 
transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized 
version.
+- lowering: transform a program to a lower-level representation that is closer 
to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. 
The optimizations include common program optimizations such as constant folding 
and dead-code elimination, and tensor-computation specific passes such as 
layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) 
to break the end to end function(e.g. mobilenet) into sub-function(e.g. 
conv2d-relu) segments. We call these segments primitive functions. This process 
helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the 
generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For 
specific targets, we may also directly go to the target translation phase and 
use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the 
overall execution problem. For simple models with known shapes and no control 
flow, we can lower to a graph runtime that stores the execution structure in a 
graph. We also support a virtual machine backend for dynamic executions. 
Finally, we plan to support ahead of time compilation that compiles the 
high-level execution structure into the executable and generated primitive 
functions. All of these execution modes are encapsulated by a unified 
**runtime.Module** interface, which we will discuss in the latter part of the 
guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many 
tir passes serve the purpose of lowering. For example, there are passes to 
flatten multi-dimensional access to one-dimensional pointer access, to expand 
the intrinsics into target-specific ones, and to decorate the function entry to 
meet the runtime calling convention. Of course, there are also optimizations 
passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, 
CUDA C, and other target compilers. As a result, we leave low-level 
optimizations such as register allocation to the downstream compilers and only 
focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and 
rule-based. One design goal of the TVM stack is to support high-performance 
code optimizations for different hardware platforms. To do so, we will need to 
investigate as many optimizations choices as possible, including but not 
limited to, multi-dimensional tensor access, loop tiling behavior, special 
accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will 
take a search and learning-based approach. We first define a collection of 
actions we can take to transform a program. Example actions include loop 
transformations, inlining, vectorization. We call these actions **scheduling 
primitives**. The collection of scheduling primitives defines a search space of 
possible optimizations we can make to a program. The system will use then 
searches over different possible scheduling combinations to pick the best one. 
The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is 
completed. The compiler can then just lookup the best schedule sequence and 
apply it to the program. Notably, this schedule application phase **exactly 
like** the rule-based transformations, enabling us to share the same interface 
convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function 
generation problem. This part of the module is called AutoTVM(auto_scheduler). 
We expect to expand the learning-based transformations to more areas as we 
continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+The target translation phase transforms an IRModule to the corresponding 
target executable format. For backends such as x86, ARM, we will use the LLVM 
IRBuilder to build in-memory LLVM IR. We can also generate source-level 
languages such as CUDA C and OpenCL. Finally, we support the direct translation 
of a Relay function (sub-graph) for external code generators. Importantly, the 
final code generation phase should be lightweight as possible with the vast 
majority of transformations and lowering performed before target translation.
+We also provide a Target structure to specify the compilation target. The 
transformations before the target translation phase can also be affected by the 
target — for example, a target's vector length would change the vectorization 
behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of TVM's runtime is to provide a minimal API for loading and 
executing the compiled artifact in a language of their choice, including 
Python, C++, Rust, Go, Java, and JavaScript. The code snippet below shows such 
an example in Python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A 
runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for 
both the generated functions. A runtime.PackedFunc can take arguments and 
return values with the following types: POD types(int, float), string, 
runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of 
runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are 
powerful mechanisms to modularize the runtime. For example, to get the above 
`addone` function on CUDA, we can use LLVM to generate the host-side code to 
compute the launching parameters(e.g. size of the thread groups) and then call 
into another PackedFunc from a CUDAModule that is backed by the CUDA driver 
API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet 
below gives an example of an end to end model execution using the same 
interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are 
sufficient to encapsulate both operator level programs(such as addone), as well 
as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data 
structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM 
stack. We expose these key data structures and transformations to python and 
C++ APIs. As a result, you can use TVM just like the way you use numpy, except 
that the data structure of interest changes from the numpy.ndarray to 
tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.

Review comment:
       Never mention hybrid script in this document. Might need a pointer or 
one sentance explanation.

##########
File path: docs/dev/index.rst
##########
@@ -15,28 +15,327 @@
     specific language governing permissions and limitations
     under the License.
 
-Design and Developer Guide
-==========================
+Design and Architecture
+=======================
+
+This document is intended for developers who want to understand the
+architecture of TVM and/or actively develop on the project.
+This page is organized as follows:
+
+- The `Example Compilation Flow Walkthrough`_ section is a walk through of a 
typical compilation flow explaining each component used during compilation.
+- The `Logical Architecture Components`_ section describes the logical 
components.
+  The sections after are specific guides focused on each logical component, 
organized
+  by the component's name.
+- The `How Tos`_ section contains useful tutorials to solve specific 
development problems.
+
+This guide provides a few complementary views of the architecture.
+First, we review a single end to end compilation flow and discuss the key data 
structures and the transformations.
+This runtime-based view focuses on the interactions of each components when 
running the compiler.
+Then we will review the logical modules of the codebase and their 
relationship. This part provides a static overarching view of the design.
+
+To get started, please read the `Example Compilation Flow Walkthrough`_  
section first for the runtime-based view.
+You can then refer to the architecture diagram in `Logical Architecture 
Components`_.
+Each architecture component section contains a short introduction to the 
corresponding component
+and links to detailed guides that you can dive into.
+Feel free to browse the `How Tos`_ to useful development tips.
+
+
+Example Compilation Flow Walkthrough
+------------------------------------
+
+In this guide, we will study an example compilation flow in the compiler. The 
figure below shows the flow. At a high-level, it contains several steps:
+
+- Import: The frontend component ingests a model into an IRModule, which 
contains a collection of functions that internally represent the model.
+- Transformation: The compiler transforms an IRModule to another functionally 
equivalent or approximately equivalent(e.g. in the case of quantization) 
IRModule.
+- Target Translation: The compiler translate(codegen) the IRModule to an 
executable format specified by the target.
+  The target translation result is encapsulated as a `runtime.Module` that can 
be exported, loaded, and executed on the target runtime environment.
+- Runtime Execution: the user loads back a `runtime.Module` and runs the 
compiled functions in the supported runtime environment.
+
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_dyn_workflow.svg
+   :align: center
+   :width: 85%
+
+
+Key data structures
+~~~~~~~~~~~~~~~~~~~
+
+One of the best ways to design and understand a complex system is to identify 
the key data structures and APIs that transform these data structures. One we 
identified the key data structures, we can then de-couple a system into logical 
components that either define a collection of key data structures or 
transformations among the data structures.
+
+**IRModule** is the primary data structure used across the entire stack. An 
IRModule (intermediate representation module) contains a collection of 
functions. Currently, we support two primary variants of functions.
+
+- **relay::Function** is a high-level functional program representation. A 
relay.Function usually corresponds to an end to end model. You can view a 
relay.Function as a computational graph with additional support for 
control-flow, recursion, and complex data structures, if you are familiar with 
the computational graph terminology in deep learning systems,
+- **tir:PrimFunc** is a low-level program representation that contains 
elements including loop-nest choices, multi-dimensional load/store, threading, 
and vector/tensor instructions. It is usually used to represent an operator 
program that executes a (possibly-fused) layer in a model.
+
+Transformations
+~~~~~~~~~~~~~~~
+
+Now that we have covered the key data structures, let us talk about the 
transformations. Each transformation could serve one of the following purposes:
+
+- optimization: transform a program to an equivalent, possibly more optimized 
version.
+- lowering: transform a program to a lower-level representation that is closer 
to the target.
+
+**relay/transform** contains a collection of passes that optimize the model. 
The optimizations include common program optimizations such as constant folding 
and dead-code elimination, and tensor-computation specific passes such as 
layout transformation and scaling factor folding.
+
+Near the end of the relay optimization pipeline, we will run a pass(FuseOps) 
to break the end to end function(e.g. mobilenet) into sub-function(e.g. 
conv2d-relu) segments. We call these segments primitive functions. This process 
helps us to divide the original problem into two sub-problems:
+
+- Compilation and optimization for each primitive function.
+- Overall execution structure: we need to do a sequence of calls into the 
generated primitive functions to execute the whole model.
+
+We use the low-level tir phase to compile and optimize each sub-functions. For 
specific targets, we may also directly go to the target translation phase and 
use external code generators.
+
+There are a few different ways(in relay/backend) to handle the calls into the 
overall execution problem. For simple models with known shapes and no control 
flow, we can lower to a graph runtime that stores the execution structure in a 
graph. We also support a virtual machine backend for dynamic executions. 
Finally, we plan to support ahead of time compilation that compiles the 
high-level execution structure into the executable and generated primitive 
functions. All of these execution modes are encapsulated by a unified 
**runtime.Module** interface, which we will discuss in the latter part of the 
guide.
+
+**tir/transform** contains transformation passes for TIR level functions. Many 
tir passes serve the purpose of lowering. For example, there are passes to 
flatten multi-dimensional access to one-dimensional pointer access, to expand 
the intrinsics into target-specific ones, and to decorate the function entry to 
meet the runtime calling convention. Of course, there are also optimizations 
passes, such as access index simplification and dead code elimination.
+
+Many low-level optimizations can be handled in the target phase by the LLVM, 
CUDA C, and other target compilers. As a result, we leave low-level 
optimizations such as register allocation to the downstream compilers and only 
focus on optimizations that are not covered by them.
+
+Search-space and Learning-based Transformations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The transformation passes we described so far are deterministic and 
rule-based. One design goal of the TVM stack is to support high-performance 
code optimizations for different hardware platforms. To do so, we will need to 
investigate as many optimizations choices as possible, including but not 
limited to, multi-dimensional tensor access, loop tiling behavior, special 
accelerator memory hierarchy, and threading.
+
+It is hard to define a heuristic to make all of the choices. Instead, we will 
take a search and learning-based approach. We first define a collection of 
actions we can take to transform a program. Example actions include loop 
transformations, inlining, vectorization. We call these actions **scheduling 
primitives**. The collection of scheduling primitives defines a search space of 
possible optimizations we can make to a program. The system will use then 
searches over different possible scheduling combinations to pick the best one. 
The search procedure is usually guided by a machine learning algorithm.
+
+We can record the best schedule sequence for an operator once the search is 
completed. The compiler can then just lookup the best schedule sequence and 
apply it to the program. Notably, this schedule application phase **exactly 
like** the rule-based transformations, enabling us to share the same interface 
convention with tradition passes.
+
+We use search based optimizations to handle the initial tir function 
generation problem. This part of the module is called AutoTVM(auto_scheduler). 
We expect to expand the learning-based transformations to more areas as we 
continue to develop the TVM stack.
+
+Target Translation
+~~~~~~~~~~~~~~~~~~
+
+The target translation phase transforms an IRModule to the corresponding 
target executable format. For backends such as x86, ARM, we will use the LLVM 
IRBuilder to build in-memory LLVM IR. We can also generate source-level 
languages such as CUDA C and OpenCL. Finally, we support the direct translation 
of a Relay function (sub-graph) for external code generators. Importantly, the 
final code generation phase should be lightweight as possible with the vast 
majority of transformations and lowering performed before target translation.
+We also provide a Target structure to specify the compilation target. The 
transformations before the target translation phase can also be affected by the 
target — for example, a target's vector length would change the vectorization 
behavior.
+
+Runtime Execution
+~~~~~~~~~~~~~~~~~
+
+The main goal of TVM's runtime is to provide a minimal API for loading and 
executing the compiled artifact in a language of their choice, including 
Python, C++, Rust, Go, Java, and JavaScript. The code snippet below shows such 
an example in Python:
+
+.. code-block:: python
+
+    import tvm
+    # Example runtime execution program in python, with type annotated
+    mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
+    arr: tvm.runtime.NDArray = tvm.nd.array([1, 2, 3], ctx=tvm.gpu(0))
+    fun: tvm.runtime.PackedFunc = mod["addone"]
+    fun(a)
+    print(a.asnumpy())
+
+
+:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A 
runtime.Module contains a GetFunction method to obtain PackedFuncs by name.
+
+:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for 
both the generated functions. A runtime.PackedFunc can take arguments and 
return values with the following types: POD types(int, float), string, 
runtime.PackedFunc, runtime.Module, runtime.NDArray, sub-classes of 
runtime.Object.
+
+:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are 
powerful mechanisms to modularize the runtime. For example, to get the above 
`addone` function on CUDA, we can use LLVM to generate the host-side code to 
compute the launching parameters(e.g. size of the thread groups) and then call 
into another PackedFunc from a CUDAModule that is backed by the CUDA driver 
API. The same mechanism can be used for OpenCL kernels.
+
+The above example only deals with a simple `addone` function. The code snippet 
below gives an example of an end to end model execution using the same 
interface:
+
+.. code-block:: python
+
+   import tvm
+   # Example runtime execution program in python, with type annotated
+   factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
+   # Create a stateful graph execution module for resnet18 on gpu(0)
+   gmod: tvm.runtime.Module = factory["resnet18"](tvm.gpu(0))
+   data: tvm.runtime.NDArray = get_input_data()
+   # set input
+   gmod["set_input"](0, data)
+   # execute the model
+   gmod["run"]()
+   # get the output
+   result = gmod["get_output"](0).asnumpy()
+
+The main take away is that the runtime.Module and runtime.PackedFunc are 
sufficient to encapsulate both operator level programs(such as addone), as well 
as the end to end models.
+
+Summary and Discussions
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In summary, the key data structures in the compilation flows are:
+
+- IRModule: contains relay.Function and tir.PrimFunc
+- runtime.Module: contains runtime.PackedFunc
+
+Most of part of the compilation are transformations among the key data 
structures.
+
+- relay/transform and tir/transform are determinstic rule-based transformations
+- auto_scheduler and autotvm contains the search-based transformations
+
+Finally, the compilation flow example is only a typical use-case of the TVM 
stack. We expose these key data structures and transformations to python and 
C++ APIs. As a result, you can use TVM just like the way you use numpy, except 
that the data structure of interest changes from the numpy.ndarray to 
tvm.IRModule. Here are some example use-cases:
+
+- Directly construct IRModule using hybrid script for compilations.
+- Compose a custom set of transformations(e.g. customize quantization).
+- Manipulate the IR directly using tvm's python API.
+
+
+
+Logical Architecture Components
+-------------------------------
+
+.. figure:: 
https://raw.githubusercontent.com/tvmai/web-data/master/images/design/tvm_static_overview.svg
+   :align: center
+   :width: 85%
+
+   TVM Architecture Diagram
+
+tvm/support
+-----------
+The support module contains the most common utilities for the infrastructure, 
such as generic arena allocator, socket, and logging.
+
+
+tvm/runtime
+-----------
+
+The runtime serves as the foundation of the TVM stack. It provides the 
mechanism to load and execute compiled artifacts. The runtime defines a stable 
standard set of C API to interface with frontend languages such as python and 
rust.
+
+`runtime::Object` is one of the primary data structures in TVM runtime besides 
the `runtime::PackedFunc`. It is a reference-counted base class with type index 
to support runtime type checking and downcasting. The object system allows the 
developer to introduce new data structures to the runtime, such as Array, Map, 
and new IR data structures.
+
+Besides the deployment use-cases, the compiler itself also makes heavy use of 
the tvm's runtime mechanism. All of the IR data structures are subclasses of 
`runtime::Object`, as a result, they can be directly accessed and manipulated 
from the python frontend. We expose the PackedFunc to expose various APIs to 
the frontend.

Review comment:
       - s/tvm/TVM/
   - `All of the IR data structures are subclasses of runtime::Object`. 
(period) As a result, ...
   - s/python/Python/
   - Not clear about this sentence`We expose the PackedFunc to expose various 
APIs to the frontend.` Maybe like "We expose various PackedFunc as Python 
frontend APIs to access and manipulate IR data structures"?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-tvm] comaniac commented on a change in pull request #6097: [DOCS][REFACTOR] Organize Design and Architectures

Reply via email to