masahi commented on a change in pull request #4602: [Docs] Bring Your Own 
Codegen Guide -- Part 1
URL: https://github.com/apache/incubator-tvm/pull/4602#discussion_r363173362
 
 

 ##########
 File path: docs/dev/relay_bring_your_own_codegen.rst
 ##########
 @@ -0,0 +1,514 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+=============================
+Bring Your Own Codegen To TVM
+=============================
+**Author**: `Zhi Chen <https://github.com/zhiics>`_, `Cody Hao Yu 
<https:://github.com/comaniac>`_
+
+As the number of hardware devices targeted by deep learning workloads keeps 
increasing, the required knowledge for users to achieve high performance on 
various devices keeps increasing as well. To free data scientists from worrying 
about the performance when developing a new model, hardware vendors either 
provide libraries such as MKLDNN or cuDNN with many commonly used deep learning 
operators, or provide frameworks such as TensorRT to let users describe their 
models in a certain way to achieve high performance. However, users have to 
learn a new programming interface when they attempt to work on a new library or 
device. As a result, the demand for a unified programming interface becomes 
more and more important to 1) let all users and hardware vendors stand on the 
same page, and 2) provide a feasible solution to allow specialized hardware or 
library to only support widely used operators with extremely high performance, 
but fallback unsupported operators to general devices like CPU/GPU.
+
+In this develop guide, we demonstrate how a hardware vendor can easily 
implement your own codegen and register it as a Relay backend compiler to 
support your hardware device/library. This guide covers two types of codegen 
based on different graph representations you need:
+
+**1. You want to generate C code.**
+
+If your hardware already has a well-optimized C/C++ library, such as Intel 
CBLAS/MKL to CPU and NVIDIA CUBLAS to GPU, then this is what you are looking 
for. Fortunately, C source code module is fully compatible with TVM runtime 
module, which means the generated code could be compiled by any C/C++ compiler 
with proper compilation flags, so the only task you have is to implement a 
codegen that generates C code for subgraphs and a C source module to integrate 
into TVM runtime module. We will demonstrate how to implement a C code 
generator for your hardware in the following section.
+
+**2. You want to generate any other graph representations.**
+
+Your hardware may require other forms of graph representation, such as JSON. 
In this case, you need to implement not only a codegen but a customized TVM 
runtime module to let TVM runtime know how this graph representation should be 
executed. If you already have a complete graph execution engine for your 
hardware, such as TensorRT for GPU, then this is a solution you can consider.
+
+After you finished the codegen and runtime, you can then let your customers 
annotate their models with your customized tag to make use of them. The 
tutorial for end-users to annotate and launch a specific codegen is **here 
(TBA)**.
+
+*********************
+Implement a C Codegen
+*********************
+
+In this part, we demonstrate how to implement a codegen that generates C code 
with pre-implemented operator functions. To simplify, our example codegen does 
not depend on third-party libraries. Instead, we manually implement two 
function macros in C:
+
+.. code-block:: c++
+
+    #define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)         \
+        extern "C" void p_ID_(float* a, float* b, float* out) { \
+            for (int64_t i = 0; i < p_DIM1_; ++i) {             \
+                out[i] = a[i] p_OP_ b[i];                       \
+            }                                                   \
+        }
+
+    #define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
+        extern "C" void p_ID_(float* a, float* b, float* out) {   \
+            for (int64_t i = 0; i < p_DIM1_; ++i) {               \
+                for (int64_t j = 0; j < p_DIM2_; ++j) {           \
+                    int64_t k = i * p_DIM2_ + j;                  \
+                    out[k] = a[k] p_OP_ b[k];                     \
+                }                                                 \
+            }                                                     \
+        }
+
+With the two macros, we can generate binary operators for 1-D and 2-D tensors. 
For example, given a subgraph as follows. Assuming all inputs are 2-D tensors 
with shape (10, 10).
+
+::
+
+       gcc_input0
+           |
+          add <-- gcc_input1
+           |
+        subtract <-- gcc_input2
+           |
+        multiply <-- gcc_input3
+           |
+          out
+
+Our goal is to generate the following compilable code to execute the subgraph:
+
+.. code-block:: c++
+
+    #include <tvm/runtime/c_runtime_api.h>
+    #include <dlpack/dlpack.h>
+    #include <cstdint>
+    #include <cstring>
+    #include <iostream>
+
+    #define GCC_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)           \
+      extern "C" void p_ID_(float* a, float* b, float* out) { \
+        for (int64_t i = 0; i < p_DIM1_; ++i) {               \
+          out[i] = a[i] p_OP_ b[i];                           \
+        }                                                     \
+      }
+
+    #define GCC_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
+      extern "C" void p_ID_(float* a, float* b, float* out) { \
+        for (int64_t i = 0; i < p_DIM1_; ++i) {               \
+          for (int64_t j = 0; j < p_DIM2_; ++j) {             \
+            int64_t k = i * p_DIM2_ + j;                      \
+            out[k] = a[k] p_OP_ b[k];                         \
+          }                                                   \
+        }                                                     \
+      }
+
+    // Note 1
+    GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);
+    GCC_BINARY_OP_2D(gcc_0_1, -, 10, 10);
+    GCC_BINARY_OP_2D(gcc_0_2, +, 10, 10);
+
+    // Note 2
+    extern "C" void gcc_0_(float* gcc_input0, float* gcc_input1,
+                           float* gcc_input2, float* gcc_input3, float* out) {
+      float* buf_0 = (float*)malloc(4 * 100);
+      float* buf_1 = (float*)malloc(4 * 100);
+      gcc_0_2(gcc_input0, gcc_input1, buf_0);
+      gcc_0_1(buf_0, gcc_input2, buf_1);
+      gcc_0_0(buf_1, gcc_input3, out);
+      free(buf_0);
+      free(buf_1);
+    }
+
+    // Note 3
+    extern "C" int gcc_0(TVMValue* value, int* type_code, int nargs) {
+      if (nargs != 5) {
+        printf("Expect 5 args, but get %d", nargs);
+        return 1;
+      }
+      DLTensor* arg0 = static_cast<DLTensor*>(value[0].v_handle);
+      DLTensor* arg1 = static_cast<DLTensor*>(value[1].v_handle);
+      DLTensor* arg2 = static_cast<DLTensor*>(value[2].v_handle);
+      DLTensor* arg3 = static_cast<DLTensor*>(value[3].v_handle);
+      DLTensor* out = static_cast<DLTensor*>(value[4].v_handle);
+      gcc_0_(static_cast<float*>(arg0->data), static_cast<float*>(arg1->data),
+             static_cast<float*>(arg2->data), static_cast<float*>(arg3->data),
+             static_cast<float*>(out->data));
+      return 0;
+    }
+
+Here we highlight the notes marked in the above code:
+
+* **Note 1** is the function implementation for the three nodes in the 
subgraph.
+
+* **Note 2** is a function to execute the subgraph by allocating intermediate 
buffers and invoking corresponding functions.
+
+* **Note 3** is a TVM runtime compatible wrapper function with unified 
arguments. It accepts and unpacks the packed data ``TVMValue``, and invokes the 
corresponding function to execute the subgraph. Due to the unified function 
arguments, the TVM runtime can directly invoke ``gcc_0`` to execute the 
subgraph without additional efforts. With the above code generated, TVM is able 
to compile it along with the rest parts of the graph and export a single 
library for deployment.
+
+In the rest of this section, we will implement a codegen step-by-step to 
generate the above code. Your own codegen has to be located at 
``src/relay/backend/contrib/<your-codegen-name>/``. In our example, we name our 
codegen "codegen_c" and put it under 
``src/relay/backend/contrib/codegen_c/codegen.cc``. Feel free to check this 
file for a complete implementation.
+
+Specifically, we are going to implement two classes in this file and here is 
their relationship:
+
+::
+
+                       subgraph                                subgraph
+  TVM backend -----------------------------> CSourceCodegen -------------> 
CodegenC
+         ^                                       |    ^                       |
+         |                                       |    |                       |
+         ----------------------------------------      ------------------------
+            generated C source runtime module              generated C code
+
+When TVM backend finds a function (subgraph) in a Relay graph is annotated 
with the registered compiler tag (``ccompiler`` in this example), TVM backend 
invokes ``CSourceCodegen`` and passes the subgraph. ``CSourceCodegen``'s member 
function ``CreateCSourceModule`` will 1) generate C code for the subgraph, and 
2) wrap the generated C code to a C source runtime module for TVM backend to 
compile and deploy. In particular, the C code generation is transparent to 
``CodegenC`` class because it provides many useful utilities to ease the code 
generation implementation. The following sections will implement these two 
classes in bottom-up order.
+
+Implement CodegenC
+==================
+
+In ``src/relay/backend/contrib/codegen_c/codegen.cc``, we first create a 
codegen class skeleton under the namespace of ``tvm.relay.contrib``:
+
+.. code-block:: c++
+
+    #include <tvm/relay/expr_functor.h>
+    #include <tvm/relay/transform.h>
+    #include <tvm/relay/type.h>
+    #include <tvm/runtime/module.h>
+    #include <tvm/runtime/object.h>
+
+    #include <fstream>
+    #include <sstream>
+
+    #include "codegen_c.h"
+
+    namespace tvm {
+    namespace relay {
+    namespace contrib {
+
+    class CodegenC : public ExprVisitor, public CodegenCBase {
+      public:
+        explicit CodegenC(const std::string& id) { this->ext_func_id_ = id; }  
  
+
+        void VisitExpr_(const VarNode* node) { ; }
+        void VisitExpr_(const CallNode* call) final { ; }
+        std::string JIT() { ; }
+
+      private:
+        /*! \brief The function id that represents a C source function. */
+        std::string ext_func_id_ = "";
+        /*! \brief The index of a wrapped C function. */
+        int func_idx = 0;
+        /*! \brief The index of allocated buffers. */
+        int buf_idx_ = 0;
+        /*! \brief The arguments of a C compiler compatible function. */
+        std::vector<std::string> ext_func_args_;
+        /*! \brief The statements of a C compiler compatible function. */
+        std::vector<std::string> ext_func_body;
+        /*! \brief The declaration statements of a C compiler compatible 
function. */
+        std::vector<std::string> func_decl_;
+        /*! \brief The declaration statements of buffers. */
+        std::vector<std::string> buf_decl_;
+        /*! \brief The name and index pairs for output. */
+        std::vector<std::pair<std::string, int>> out_;        
+    }
+
+The ``CodegenC`` class inherits two classes: ``ExprVisitor`` provides 
abilities to traverse subgraphs and collects the required information and 
generate subgraph functions such as ``gcc_0_``; ``CodegenCBase`` provides 
abilities and utilities to generate wrapper functions such as ``gcc_0`` in the 
above example. As can be seen, we only need to implement three functions in 
this codegen class to make it work.
+
+Code Generation for Operators
+-----------------------------
+
+We first implement ``VisitExpr_(const CallNode* call)``. This function visits 
all call nodes when traversing the subgraph. Each call node contains an 
operator that we want to offload to your hardware. As a result, we need to 
generate the corresponding C code with correct operators in topological order. 
We implement this function step-by-step as follows.
+
+**1. Generate the function declaration**
+
+Example Result: ``GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);``
+
+To generate the function declaration, as shown above, we need 1) a function 
name (e.g., ``gcc_0_0``), 2) the type of operator (e.g., ``*``), and 3) the 
input tensor shape (e.g., ``(10, 10)``). Fortunately, this information can be 
obtained easily from ``CallNode``:
+
+.. code-block:: c++
+
+    std::ostringstream macro_stream;
+    std::ostringstream decl_stream;
+    std::ostringstream buf_stream;
+
+    // Generate a unique function name you like.
+    std::string func_name = ext_func_id_ + "_" + std::to_string(func_idx++);
+
+    // Make function declaration string.
+    macro_stream << "CSOURCE_BINARY_OP_" << call->args.size() << "D(" << 
func_name << ", ";
+
+    // Check the operator type.
+    if (IsOp(call, "add")) {
+      macro_stream << "+";
+    } else if (IsOp(call, "subtract")) {
+      macro_stream << "-";
+    } else if (IsOp(call, "multiply")) {
+      macro_stream << "*";
+    } else {
+      LOG(FATAL) << "Unrecognized op";
+    }
+
+    // Extract the input tensor shape.
+    auto in_shape = GetShape(call->args[0]->checked_type());
+    for (size_t i = 0; i < in_shape.size(); ++i) {
+      macro_stream << ", " << in_shape[i];
+    }
+    macro_stream << ");";
+    func_decl_.push_back(macro_stream.str());
+
+As can be seen, we push the generated code to class member variables 
``func_decl_``. It means after we finish traversing the entire subgraph, we 
have collected all required function declarations and the only thing we need to 
do is writing them out to be compiled. The rest implementations of 
``VisitExpr_(const CallNode* call)`` also follow this concept.
 
 Review comment:
   the only thing we need to do is have them compiled by GCC.
   The rest of implementation

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to