areusch commented on a change in pull request #7164:
URL: https://github.com/apache/tvm/pull/7164#discussion_r551656029



##########
File path: docs/dev/microtvm_design.rst
##########
@@ -0,0 +1,340 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+..    http://www.apache.org/licenses/LICENSE-2.0
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+**************************
+microTVM Design Document
+**************************
+
+.. contents:: Table of Contents
+    :depth: 3
+
+Background
+===========
+
+TVM is a model deployment framework that has demonstrated good performance 
across a wide range of
+models on traditional operating systems. Given TVM's layered approach to 
compilation, it is a
+natural extension to target bare metal devices. While most of the compilation 
flow does not need to
+change for a proof-of-concept implementation on such devices, the runtime 
cannot depend on:
+
+* **Virtual Memory**, and by extension any system-provided ``malloc``. 
Additionally, bare metal
+  devices typically have very limited memory (measured in KB). Because of 
this, libraries designed
+  for such platforms typically need to be more judicious in using memory, and 
need to release
+  memory when it is not in use.
+* Traditional OS abstractions, such as **files**, **libraries**, and **kernel 
functions**. Some
+  projects implement support for these, but they are by no means standard.
+* Support for programming languages other than **C**.
+
+Such changes require a different appraoch from the TVM C++ runtime typically 
used on traditional
+Operating Systems.
+
+Typical Use
+===========
+
+This section discusses our vision of the "typical" microTVM use case. Each 
component used to achieve
+this typical use case is intended to be designed for flexibility, but this 
unifying vision serves to
+motivate the inclusion of each part of the design.
+
+.. image:: microtvm_workflow.svg
+
+The parts of this process are described below:
+
+#. **Model Import**. The user imports an existing model or describes a new 
model to TVM, producing a
+   *Relay module*.
+
+#. **Model Transformations**. The user can apply transformations, such as 
quantization, to the
+   model. After each transformation, the user should still have a Relay module.
+
+#. **Compilation** (Scheduling and Code Generation). TVM implements each 
operator into Tensor IR by
+   assigning a schedule and schedule configuration to each Relay operator. 
Then, code (C source or
+   compiled object) is generated for each operator.
+
+#. **Integration**. The generated code is integrated along with the TVM C 
Runtime library into a
+   user-supplied binary project. In some cases (such as when the project is 
standardized across
+   multiple SoC/development boards), this process is handled automatically.
+
+#. **Deployment**. The project is built and the residual firmware binary is 
flashed onto the device.
+   Model inference is driven either by TVM using an on-device RPC server, or 
on the device using the
+   on-device Graph Runtime.
+
+Design Goals
+============
+
+microTVM aims to achieve these design goals:
+
+1. **Portable Code**. microTVM can translate any Relay model into C code that 
can compile with only
+   a C standard library.
+2. **Minimal Overhead**. microTVM generates target-specific, highly optimized 
code. As much overhead
+   from the runtime should be removed.
+3. **Accessible Code**. microTVM considers C source code as a first-class 
output mechanism so that
+   it is easier for a firmware engineer to understand and tweak. microTVM
+
+Overview
+========
+
+microTVM requires changes at all levels of the TVM compiler stack. The 
following sub-sections enumerate
+these changes at a high level, and follow-on sections discuss the specifics in 
more detail.
+
+Modeling Target Platforms
+-------------------------
+
+TVM's search-based optimization approach allows it to largely avoid 
system-level modeling of targets
+in favor of experimental results. However, some modelling is necessary in 
order to ensure TVM is
+comparing apples-to-apples search results, and to avoid wasting time during 
the search by attempting
+to compile invalid code for a target.
+
+microTVM models these parts of the target:
+
+* The CPU used, through the ``-mcpu`` and ``-march`` target flags.
+* The presence or absence of accelerators, through the device components of 
the target (Currently
+  only the absence of accelerators can be expressed, but this mechanism should 
extend well).
+
+microTVM aims to model these parts of the target in the future:
+
+* Memory, modeled as a set of disjoint memory spaces, each with a label and 
size and prefetch/flush
+  behavior. Some memory may be shared with accelerators.
+* Target runtime configuration (i.e. clock tree configuration, clock speed, 
etc). This is intended
+  only to contribute to the AutoTVM schedule key and not for any other use.
+
+At this time, TVM does not intend to model:
+
+* Size, type, or relationship of caches, with the exception of prefetching or 
cache flushing.
+
+
+TVM Targets for microTVM
+-------------------------
+
+A central data structure in the compilation process is the 
``tvm::target::Target`` class. TVM uses
+Target to decide which TIR schedules to enable and how to configure the code 
generator. The Target
+class should also uniquely identify the generated code for a particular 
operator, as autotuning
+logs use it to rank measured performance (but see Future Work).
+
+Targets are currently represented as strings structured similarly to 
command-line arguments. An
+example target is shown below:
+
+    ``c -keys=arm_cpu -mcpu=cortex-m7 -link-params -model=stm32f746xx 
-runtime=c -system-lib=1``
+
+The relevant parts to microTVM are:
+
+ * Code generator (``llvm`` or ``c``)
+ * ``-mcpu=cortex-m7``: used by TOPI to enable Cortex-M schedules, and, when 
the C source code
+   generator is selected, included in the output as a comment to help identify 
the code and
+   configure the downstream C compiler.
+ * ``-link-params``: include parameters as global constants to load from flash.
+ * ``-runtime=c``: build glue code to allow operators to work with the C 
runtime
+ * ``-system-lib=1``: emit a system library (i.e. which can be loaded by 
calling the PackedFunc
+   ``runtime.SystemLib``.
+
+Writing Schedules for microTVM
+------------------------------
+
+For operations scheduled on the CPU, microTVM initially plans to make use of 
specialized
+instructions and extern (i.e. hand-optimized) functions to achieve good 
performance. In TVM, this
+appraoch is generally accomplished through tensorization, in which TVM breaks 
a computation into
+small pieces, and a TIR extern function accelerates each small piece.
+
+TVM currently accomodates both approaches using ``tir.call_extern``. First, a 
pragma is attached to
+the schedule defining the extern function in portable C.
+
+    ``sched[output].pragma(n, "import_c", "void call_asm(int32_t* a, int32_t* 
b) { /* ... */ }")``
+
+Next, ``tensorize`` is used to split the computation.
+
+    ``sched[output].tensorize(owi, gemm)``
+
+There are a couple of caveats to this approach, all which could be resolved by 
linking generated
+code against external libraries:
+
+* Inline assembly is compiler-specific. While Clang and GCC have standardized 
on one syntax, this
+  may not be portable to other compilers. SDKs solve this by conditionally 
including a header file
+  depending on the compiler being used. However, taking this approach means 
that the generated code
+  needs additional compiler flags (i.e. ``-Isystempath/to/header``).
+* It may be helpful to reference helper functions from the generated code 
(e.g. to inline common
+  sequences of hand-optimized assembly).
+* Finally, the extern function invoked may be wholly written in an external 
library. If those
+  functions can be wholly inlined, this caveat is the same as the previous. If 
not, then additional
+  C code needs to be compiled and linked against the operator.
+
+At present, microTVM presumes that all eligible schedules can be compiled. 
This means that the user-
+supplied project (see next section) must include all libraries that are used 
by the generated code.
+When not using autotuning, TVM randomly chooses a fallback schedule, so all 
libraries would need to
+be supported. When using autotuning, TVM selects the best-performing schedule, 
so only that library
+is needed. There isn't currently a way to force TVM to pick a particular 
schedule outside of
+autotuning logs, but that would be a good addition.
+
+Finally, when using the ``llvm`` backend, the process is similar except that 
LLVM bitcode is included
+in the generated code (with an ``import_llvm`` pragma). LLVM bitcode provides 
a portable way to call
+inline assembly. However, it may be more complex to call external C functions, 
and helper functions
+are of course not easy to use from LLVM bitcode.
+
+Executing Models
+----------------
+
+The TVM compiler traditionally outputs 3 pieces:
+1. Model operator implementations, as discussed above.
+2. A model execution graph, encoded as JSON
+3. Simplified parameters
+
+To correctly execute the model, a Graph Runtime needs to reconstruct the graph 
in memory, load the
+parameters, and then invoke the operator implementations in the correct order.
+
+microTVM supports two ways to do this:
+
+1. **Host-Driven**. The Graph Runtime can run on the host and carry out 
execution by issuing
+   commands to the device using an RPC link with a UART-like transport.
+2. **Standalone**. A C Graph Runtime is available to compiled on-device, but 
it is not particularly

Review comment:
       fixed

##########
File path: docs/dev/microtvm_design.rst
##########
@@ -0,0 +1,340 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+..    http://www.apache.org/licenses/LICENSE-2.0
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+**************************
+microTVM Design Document
+**************************
+
+.. contents:: Table of Contents
+    :depth: 3
+
+Background
+===========
+
+TVM is a model deployment framework that has demonstrated good performance 
across a wide range of
+models on traditional operating systems. Given TVM's layered approach to 
compilation, it is a
+natural extension to target bare metal devices. While most of the compilation 
flow does not need to
+change for a proof-of-concept implementation on such devices, the runtime 
cannot depend on:
+
+* **Virtual Memory**, and by extension any system-provided ``malloc``. 
Additionally, bare metal
+  devices typically have very limited memory (measured in KB). Because of 
this, libraries designed
+  for such platforms typically need to be more judicious in using memory, and 
need to release
+  memory when it is not in use.
+* Traditional OS abstractions, such as **files**, **libraries**, and **kernel 
functions**. Some
+  projects implement support for these, but they are by no means standard.
+* Support for programming languages other than **C**.
+
+Such changes require a different appraoch from the TVM C++ runtime typically 
used on traditional
+Operating Systems.
+
+Typical Use
+===========
+
+This section discusses our vision of the "typical" microTVM use case. Each 
component used to achieve
+this typical use case is intended to be designed for flexibility, but this 
unifying vision serves to
+motivate the inclusion of each part of the design.
+
+.. image:: microtvm_workflow.svg
+
+The parts of this process are described below:
+
+#. **Model Import**. The user imports an existing model or describes a new 
model to TVM, producing a
+   *Relay module*.
+
+#. **Model Transformations**. The user can apply transformations, such as 
quantization, to the
+   model. After each transformation, the user should still have a Relay module.
+
+#. **Compilation** (Scheduling and Code Generation). TVM implements each 
operator into Tensor IR by
+   assigning a schedule and schedule configuration to each Relay operator. 
Then, code (C source or
+   compiled object) is generated for each operator.
+
+#. **Integration**. The generated code is integrated along with the TVM C 
Runtime library into a
+   user-supplied binary project. In some cases (such as when the project is 
standardized across
+   multiple SoC/development boards), this process is handled automatically.
+
+#. **Deployment**. The project is built and the residual firmware binary is 
flashed onto the device.
+   Model inference is driven either by TVM using an on-device RPC server, or 
on the device using the
+   on-device Graph Runtime.
+
+Design Goals
+============
+
+microTVM aims to achieve these design goals:
+
+1. **Portable Code**. microTVM can translate any Relay model into C code that 
can compile with only
+   a C standard library.
+2. **Minimal Overhead**. microTVM generates target-specific, highly optimized 
code. As much overhead
+   from the runtime should be removed.
+3. **Accessible Code**. microTVM considers C source code as a first-class 
output mechanism so that
+   it is easier for a firmware engineer to understand and tweak. microTVM
+
+Overview
+========
+
+microTVM requires changes at all levels of the TVM compiler stack. The 
following sub-sections enumerate
+these changes at a high level, and follow-on sections discuss the specifics in 
more detail.
+
+Modeling Target Platforms
+-------------------------
+
+TVM's search-based optimization approach allows it to largely avoid 
system-level modeling of targets
+in favor of experimental results. However, some modelling is necessary in 
order to ensure TVM is
+comparing apples-to-apples search results, and to avoid wasting time during 
the search by attempting
+to compile invalid code for a target.
+
+microTVM models these parts of the target:
+
+* The CPU used, through the ``-mcpu`` and ``-march`` target flags.
+* The presence or absence of accelerators, through the device components of 
the target (Currently
+  only the absence of accelerators can be expressed, but this mechanism should 
extend well).
+
+microTVM aims to model these parts of the target in the future:
+
+* Memory, modeled as a set of disjoint memory spaces, each with a label and 
size and prefetch/flush
+  behavior. Some memory may be shared with accelerators.
+* Target runtime configuration (i.e. clock tree configuration, clock speed, 
etc). This is intended
+  only to contribute to the AutoTVM schedule key and not for any other use.
+
+At this time, TVM does not intend to model:
+
+* Size, type, or relationship of caches, with the exception of prefetching or 
cache flushing.
+
+
+TVM Targets for microTVM
+-------------------------
+
+A central data structure in the compilation process is the 
``tvm::target::Target`` class. TVM uses
+Target to decide which TIR schedules to enable and how to configure the code 
generator. The Target
+class should also uniquely identify the generated code for a particular 
operator, as autotuning
+logs use it to rank measured performance (but see Future Work).
+
+Targets are currently represented as strings structured similarly to 
command-line arguments. An
+example target is shown below:
+
+    ``c -keys=arm_cpu -mcpu=cortex-m7 -link-params -model=stm32f746xx 
-runtime=c -system-lib=1``
+
+The relevant parts to microTVM are:
+
+ * Code generator (``llvm`` or ``c``)
+ * ``-mcpu=cortex-m7``: used by TOPI to enable Cortex-M schedules, and, when 
the C source code
+   generator is selected, included in the output as a comment to help identify 
the code and
+   configure the downstream C compiler.
+ * ``-link-params``: include parameters as global constants to load from flash.
+ * ``-runtime=c``: build glue code to allow operators to work with the C 
runtime
+ * ``-system-lib=1``: emit a system library (i.e. which can be loaded by 
calling the PackedFunc
+   ``runtime.SystemLib``.
+
+Writing Schedules for microTVM
+------------------------------
+
+For operations scheduled on the CPU, microTVM initially plans to make use of 
specialized
+instructions and extern (i.e. hand-optimized) functions to achieve good 
performance. In TVM, this
+appraoch is generally accomplished through tensorization, in which TVM breaks 
a computation into
+small pieces, and a TIR extern function accelerates each small piece.
+
+TVM currently accomodates both approaches using ``tir.call_extern``. First, a 
pragma is attached to
+the schedule defining the extern function in portable C.
+
+    ``sched[output].pragma(n, "import_c", "void call_asm(int32_t* a, int32_t* 
b) { /* ... */ }")``
+
+Next, ``tensorize`` is used to split the computation.
+
+    ``sched[output].tensorize(owi, gemm)``
+
+There are a couple of caveats to this approach, all which could be resolved by 
linking generated
+code against external libraries:
+
+* Inline assembly is compiler-specific. While Clang and GCC have standardized 
on one syntax, this
+  may not be portable to other compilers. SDKs solve this by conditionally 
including a header file
+  depending on the compiler being used. However, taking this approach means 
that the generated code
+  needs additional compiler flags (i.e. ``-Isystempath/to/header``).
+* It may be helpful to reference helper functions from the generated code 
(e.g. to inline common
+  sequences of hand-optimized assembly).
+* Finally, the extern function invoked may be wholly written in an external 
library. If those
+  functions can be wholly inlined, this caveat is the same as the previous. If 
not, then additional
+  C code needs to be compiled and linked against the operator.
+
+At present, microTVM presumes that all eligible schedules can be compiled. 
This means that the user-
+supplied project (see next section) must include all libraries that are used 
by the generated code.
+When not using autotuning, TVM randomly chooses a fallback schedule, so all 
libraries would need to
+be supported. When using autotuning, TVM selects the best-performing schedule, 
so only that library
+is needed. There isn't currently a way to force TVM to pick a particular 
schedule outside of
+autotuning logs, but that would be a good addition.
+
+Finally, when using the ``llvm`` backend, the process is similar except that 
LLVM bitcode is included
+in the generated code (with an ``import_llvm`` pragma). LLVM bitcode provides 
a portable way to call
+inline assembly. However, it may be more complex to call external C functions, 
and helper functions
+are of course not easy to use from LLVM bitcode.
+
+Executing Models
+----------------
+
+The TVM compiler traditionally outputs 3 pieces:
+1. Model operator implementations, as discussed above.
+2. A model execution graph, encoded as JSON
+3. Simplified parameters
+
+To correctly execute the model, a Graph Runtime needs to reconstruct the graph 
in memory, load the
+parameters, and then invoke the operator implementations in the correct order.
+
+microTVM supports two ways to do this:
+
+1. **Host-Driven**. The Graph Runtime can run on the host and carry out 
execution by issuing
+   commands to the device using an RPC link with a UART-like transport.
+2. **Standalone**. A C Graph Runtime is available to compiled on-device, but 
it is not particularly
+   memory efficient. This way enables standalone execution without any 
attached host.
+
+Host-Driven is designed for experimenting with models on-device and, like 
AutoTVM, uses the RPC server to
+drive computation on-device. Standalone is intended for deployment.
+
+Host-Driven Execution
+^^^^^^^^^^^^^^^^^^^^
+
+In Host-Driven execution, the firmware binary is the following:
+
+1. Generated operator implementations from TVM
+2. The TVM C runtime
+3. SoC-specific initialization.
+4. The TVM RPC server.
+5. (optional) Simplified Parameters
+
+This firmware image is flashed onto the device and a GraphRuntime instance is 
created on the host.
+The GraphRuntime drives execution by sending RPC commands over a UART:
+
+.. image:: microtvm_host_driven.svg
+
+Standalone Execution
+^^^^^^^^^^^^^^^^^^^^
+
+In Standalone execution, the GraphRuntime is instantiated on device:
+
+.. image:: microtvm_standalone.svg
+
+microTVM Firmware
+------------------
+
+We can now discuss how microTVM firmware should behave. An important task 
common to both model
+execution strategies is configuring the SoC to match the way it performs in 
production. microTVM
+considers this task project- and SoC-dependent. Whether for AutoTVM, 
host-driven model inference, or
+in standalone deployment, the user is expected to supply a project whose 
main() does the following:
+
+1. Configure the SoC to match deployment performance.
+2. Initialize the TVM C Runtime.
+
+When configuring for host-driven inference or AutoTVM, the remaining tasks are 
well-defined:
+
+3. Initialize a transport (i.e. a UART) for use with the TVM RPC server.
+4. Launch the TVM RPC Server.
+
+When configuring for standalone deployment, the firmware needs to:
+
+1. Instantiate the system library by calling the ``runtime.SystemLib`` 
PackedFunc.
+2. Instantiate a GraphRuntime passing the system library module.
+3. Configure parameters and inputs as needed.
+4. Run the model.
+
+Parts of a microTVM Binary
+------------------------
+
+To summarize, a microTVM firwmare binary image must contain these parts:
+
+1. Operator implementations, produced by TVM.
+2. The TVM C runtime library, supplied by TVM as a static library.
+3. SoC Initialization, supplied by the user.
+
+For Host-driven model execution, firmware also needs:
+
+4. The TVM RPC Server library.
+
+For Standalone model execution, firmware also needs:
+
+4. The TVM C GraphRuntime library, supplied by TVM as a static library.
+5. The remaining compiler outputs (Simplified Parameters and Graph JSON).
+
+The Automated Build Flow
+-------------------------
+
+Once code generation is complete, ``tvm.relay.build`` returns a 
``tvm.runtime.Module`` and the
+user can save the generated C source or binary object to a ``.c`` or ``.o`` 
file. From this point, TVM
+can theoretically step back and the user can compile and run the code 
separately.
+
+However, for AutoTVM, TVM needs some automated flow to handle the following 
tasks:
+
+1. Integrate operator implementations, the TVM C Runtime library, and the TVM 
RPC Server library into the
+   firmware project containing user-supplied SoC Initialization.
+2. Build the resulting project.
+3. Program the built firmware onto a (specific) attached device.
+4. Identify the serial port or other transport to be used by TVM to drive 
remote execution.
+
+At present, TVM expects the user to supply an implementation of the 
``tvm.micro.Compiler``,
+``tvm.micro.Flasher``, and ``tvm.micro.Transport`` interfaces. TVM then:
+
+1. Builds each piece separately as a library
+2. Builds the libraries into a binary firmware image.
+3. Programs the firmware image onto an attached device.
+4. Opens a serial port to serve as the RPC server transport.
+
+This design was chosen to reduce build times for microTVM (the common 
libraries need to be build
+only once per candidate operator implemmentation). In practice, these projects 
are extremely small
+and compile relatively quickly. Compared with the added complexity of this 
tighter build integration
+with TVM, the performance gains are likely not worth it. A future design will 
consolidate the build
+tasks into a single step and narrow the interface to provide a better 
integration.
+
+Measuring operator performance
+------------------------------
+
+The TVM C runtime depends on user-supplied functions to measure time 
on-device. Users should implement
+``TVMPlatformTimerStart`` and ``TVMPlatformTimerStop``. These functions should 
measure wall time, so there

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to