(tvm-ffi) branch main updated: doc: Polish QuickStart Guide (#166)

tqchen Fri, 17 Oct 2025 16:02:13 -0700

This is an automated email from the ASF dual-hosted git repository.

tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git



The following commit(s) were added to refs/heads/main by this push:
     new 6e448cf  doc: Polish QuickStart Guide (#166)
6e448cf is described below

commit 6e448cffb1873afed0d626e7f92ddf2d9a2435bb
Author: Junru Shao <[email protected]>
AuthorDate: Fri Oct 17 15:54:41 2025 -0700

    doc: Polish QuickStart Guide (#166)
    
    This PR rewrote the quickstart guide, making it smoother to read.
---
 docs/get_started/quick_start.md | 230 ----------------------------
 docs/get_started/quickstart.rst | 329 ++++++++++++++++++++++++++++++++++++++++
 docs/guides/python_guide.md     |   2 +-
 docs/index.rst                  |   2 +-
 4 files changed, 331 insertions(+), 232 deletions(-)

diff --git a/docs/get_started/quick_start.md b/docs/get_started/quick_start.md
deleted file mode 100644
index 59995f9..0000000
--- a/docs/get_started/quick_start.md
+++ /dev/null
@@ -1,230 +0,0 @@
-<!--- Licensed to the Apache Software Foundation (ASF) under one -->
-<!--- or more contributor license agreements.  See the NOTICE file -->
-<!--- distributed with this work for additional information -->
-<!--- regarding copyright ownership.  The ASF licenses this file -->
-<!--- to you under the Apache License, Version 2.0 (the -->
-<!--- "License"); you may not use this file except in compliance -->
-<!--- with the License.  You may obtain a copy of the License at -->
-
-<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
-
-<!--- Unless required by applicable law or agreed to in writing, -->
-<!--- software distributed under the License is distributed on an -->
-<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
-<!--- KIND, either express or implied.  See the License for the -->
-<!--- specific language governing permissions and limitations -->
-<!--- under the License. -->
-# Quick Start
-
-This is a quick start guide explaining the basic features and usage of tvm-ffi.
-The source code can be found at `examples/quick_start` in the project source.
-
-## Build and Run the Example
-
-Let us first get started by build and run the example. The example will show 
us:
-
-- How to expose c++ functions as tvm ffi ABI function
-- How to load and run tvm-ffi based library from python
-- How to load and run tvm-ffi based library from c++
-
-Before starting, ensure you have:
-
-- TVM FFI installed
-- C++ compiler with C++17 support
-- CMake 3.18 or later
-- (Optional) Ninja build system (the quick-start uses Ninja for fast 
incremental builds)
-- (Optional) CUDA toolkit for GPU examples
-- (Optional) PyTorch for checking torch integrations
-
-Then obtain a copy of the tvm-ffi source code.
-
-```bash
-git clone https://github.com/apache/tvm-ffi --recursive
-cd tvm-ffi
-```
-
-The examples are now in the example folder, you can quickly build
-the example using the following command.
-
-```bash
-cd examples/quick_start
-
-# with ninja or omit -G Ninja to use default generator
-cmake --fresh -G Ninja -B build -S .
-cmake --build build --parallel
-```
-
-After the build finishes, you can run the python examples by
-
-```bash
-python run_example.py
-```
-
-You can also run the c++ example
-
-```bash
-./build/run_example
-```
-
-If the CUDA toolkit is available, the GPU demo binary is built alongside the 
CPU sample:
-
-```bash
-./build/run_example_cuda
-```
-
-## Walk through the Example
-
-Now we have quickly try things out. Let us now walk through the details of the 
example.
-Specifically, in this example, we create a simple "add one" operation that 
adds 1 to each element of an input
-tensor and expose that function as TVM FFI compatible function. The key file 
structures are as follows:
-
-```text
-examples/quick_start/
-├── src/
-│   ├── add_one_cpu.cc      # CPU implementation
-│   ├── add_one_c.c         # A low-level C based implementation
-│   ├── add_one_cuda.cu     # CUDA implementation
-│   ├── run_example.cc      # C++ usage example
-│   └── run_example_cuda.cc # C++ with CUDA kernel usage example
-├── run_example.py          # Python usage example
-├── run_example.sh          # Build and run script
-└── CMakeLists.txt          # Build configuration
-```
-
-### CPU Implementation
-
-```cpp
-#include <tvm/ffi/dtype.h>
-#include <tvm/ffi/error.h>
-#include <tvm/ffi/function.h>
-#include <tvm/ffi/container/tensor.h>
-
-namespace tvm_ffi_example {
-
-namespace ffi = tvm::ffi;
-
-void AddOne(ffi::TensorView x, ffi::TensorView y) {
-  // Validate inputs
-  TVM_FFI_ICHECK(x.ndim() == 1) << "x must be a 1D tensor";
-  DLDataType f32_dtype{kDLFloat, 32, 1};
-  TVM_FFI_ICHECK(x.dtype() == f32_dtype) << "x must be a float tensor";
-  TVM_FFI_ICHECK(y.ndim() == 1) << "y must be a 1D tensor";
-  TVM_FFI_ICHECK(y.dtype() == f32_dtype) << "y must be a float tensor";
-  TVM_FFI_ICHECK(x.size(0) == y.size(0)) << "x and y must have the same shape";
-
-  // Perform the computation
-  for (int i = 0; i < x.size(0); ++i) {
-    static_cast<float*>(y.data_ptr())[i] = 
static_cast<float*>(x.data_ptr())[i] + 1;
-  }
-}
-
-// Expose the function through TVM FFI
-TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cpu, tvm_ffi_example::AddOne);
-}
-```
-
-**Key Points:**
-
-- Functions take `tvm::ffi::Tensor` parameters for cross-language compatibility
-- The `TVM_FFI_DLL_EXPORT_TYPED_FUNC` macro exposes the function with a given 
name
-
-### CUDA Implementation
-
-```cpp
-void AddOneCUDA(ffi::TensorView x, ffi::TensorView y) {
-  // Validation (same as CPU version)
-  // ...
-
-  int64_t n = x.size(0);
-  int64_t nthread_per_block = 256;
-  int64_t nblock = (n + nthread_per_block - 1) / nthread_per_block;
-
-  // Get current CUDA stream from environment
-  cudaStream_t stream = static_cast<cudaStream_t>(
-      TVMFFIEnvGetStream(x.device().device_type, x.device().device_id));
-
-  // Launch kernel
-  AddOneKernel<<<nblock, nthread_per_block, 0, stream>>>(
-      static_cast<float*>(x.data_ptr()), static_cast<float*>(y.data_ptr()), n);
-}
-
-TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cuda, tvm_ffi_example::AddOneCUDA);
-```
-
-**Key Points:**
-
-- We use `TVMFFIEnvGetStream` to obtain the current stream from the 
environement
-- When invoking ffi Function from python end with PyTorch tensor as argument,
-  the stream will be populated with torch's current stream.
-
-### Working with PyTorch
-
-Atfer build, we will create library such as `build/add_one_cuda.so`, that can 
be loaded by
-with api {py:func}`tvm_ffi.load_module` that returns a 
{py:class}`tvm_ffi.Module`
-Then the function will become available as property of the loaded module.
-The tensor arguments in the ffi functions automatically consumes 
`torch.Tensor`. The following code shows how
-to use the function in torch.
-
-```python
-import torch
-import tvm_ffi
-
-if torch.cuda.is_available():
-    mod = tvm_ffi.load_module("build/add_one_cuda.so")
-
-    x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, device="cuda")
-    y = torch.empty_like(x)
-
-    # TVM FFI automatically handles CUDA streams
-    stream = torch.cuda.Stream()
-    with torch.cuda.stream(stream):
-        mod.add_one_cuda(x, y)
-    stream.synchronize()
-```
-
-### Working with Python Data Arrays
-
-TVM FFI functions works automaticaly with python data arrays that are 
compatible with dlpack.
-The following examples how to use the function with numpy.
-
-```python
-import tvm_ffi
-import numpy as np
-
-# Load the compiled module
-mod = tvm_ffi.load_module("build/add_one_cpu.so")
-
-# Create input and output arrays
-x = np.array([1, 2, 3, 4, 5], dtype=np.float32)
-y = np.empty_like(x)
-
-# Call the function
-mod.add_one_cpu(x, y)
-print("Result:", y)  # [2, 3, 4, 5, 6]
-```
-
-### Working with C++
-
-One important design goal of tvm-ffi is to be universally portable.
-As a result, the result libraries do not have explicit dependencies in python
-and can be loaded in other language environments, such as c++. The following 
code
-shows how to run the example exported function in C++.
-
-```cpp
-#include <tvm/ffi/container/tensor.h>
-#include <tvm/ffi/extra/module.h>
-
-namespace ffi = tvm::ffi;
-
-void CallAddOne(ffi::TensorView x, ffi::TensorView y) {
-  ffi::Module mod = ffi::Module::LoadFromFile("build/add_one_cpu.so");
-  ffi::Function add_one_cpu = mod->GetFunction("add_one_cpu").value();
-  add_one_cpu(x, y);
-}
-```
-
-## Summary Key Concepts
-
-- **TVM_FFI_DLL_EXPORT_TYPED_FUNC** exposes a c++ function into tvm-ffi C ABI
-- **ffi::Tensor** is a universal tensor structure that enables zero-copy 
exchange of array data
-- **Module loading** is provided by tvm ffi APIs in multiple languages.
diff --git a/docs/get_started/quickstart.rst b/docs/get_started/quickstart.rst
new file mode 100644
index 0000000..8a03328
--- /dev/null
+++ b/docs/get_started/quickstart.rst
@@ -0,0 +1,329 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..   http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Quick Start
+===========
+
+This guide walks through shipping a minimal ``add_one`` function that computes
+``y = x + 1`` in C++ and CUDA.
+
+TVM-FFI's Open ABI and FFI makes it possible to **build once, ship 
everywhere**. That said,
+a single shared library works across:
+
+- **ML frameworks**, e.g. PyTorch, JAX, NumPy, CuPy, etc., and
+- **languages**, e.g. C++, Python, Rust, etc.
+
+.. admonition:: Prerequisite
+   :class: hint
+   :name: prerequisite
+
+   - Python: 3.9 or newer
+   - Compiler: C++17-capable toolchain (GCC/Clang/MSVC)
+   - Optional ML frameworks for testing: NumPy, PyTorch, JAX, CuPy
+   - CUDA: Any modern version if you want to try the CUDA part
+   - TVM-FFI installed via
+
+     .. code-block:: bash
+
+        pip install --reinstall --upgrade apache-tvm-ffi
+
+
+Write a Simple ``add_one``
+--------------------------
+
+.. _sec-cpp-source-code:
+
+Source Code
+~~~~~~~~~~~
+
+Suppose we implement a C++ function ``AddOne`` that performs elementwise ``y = 
x + 1`` for a 1-D ``float32`` vector. The source code (C++, CUDA) is:
+
+.. tabs::
+
+  .. group-tab:: C++
+
+    .. code-block:: cpp
+      :emphasize-lines: 8, 17
+
+      // File: main.cc
+      #include <tvm/ffi/container/tensor.h>
+      #include <tvm/ffi/function.h>
+
+      namespace tvm_ffi_example_cpp {
+
+      /*! \brief Perform vector add one: y = x + 1 (1-D float32) */
+      void AddOne(tvm::ffi::TensorView x, tvm::ffi::TensorView y) {
+        int64_t n = x.shape()[0];
+        float* x_data = static_cast<float *>(x.data_ptr());
+        float* y_data = static_cast<float *>(y.data_ptr());
+        for (int64_t i = 0; i < n; ++i) {
+          y_data[i] = x_data[i] + 1;
+        }
+      }
+
+      TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cpp::AddOne);
+      }
+
+
+  .. group-tab:: CUDA
+
+    .. code-block:: cpp
+      :emphasize-lines: 15, 22, 26
+
+      // File: main.cu
+      #include <tvm/ffi/container/tensor.h>
+      #include <tvm/ffi/extra/c_env_api.h>
+      #include <tvm/ffi/function.h>
+
+      namespace tvm_ffi_example_cuda {
+
+      __global__ void AddOneKernel(float* x, float* y, int n) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < n) {
+          y[idx] = x[idx] + 1;
+        }
+      }
+
+      void AddOne(tvm::ffi::TensorView x, tvm::ffi::TensorView y) {
+        int64_t n = x.shape()[0];
+        float* x_data = static_cast<float *>(x.data_ptr());
+        float* y_data = static_cast<float *>(y.data_ptr());
+        int64_t threads = 256;
+        int64_t blocks = (n + threads - 1) / threads;
+        cudaStream_t stream = static_cast<cudaStream_t>(
+          TVMFFIEnvGetStream(x.device().device_type, x.device().device_id));
+        AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);
+      }
+
+      TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cuda::AddOne);
+      }
+
+
+
+Macro :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` exports the C++ function 
``AddOne`` with public name ``add_one`` in the resulting library.
+TVM-FFI looks it up at runtime to make the function available across languages.
+
+Class :cpp:class:`tvm::ffi::TensorView` allows zero-copy interop with tensors 
from different ML frameworks:
+
+- NumPy, CuPy,
+- PyTorch, JAX, or
+- any array type that supports the standard `DLPack protocol 
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
+
+Finally, :cpp:func:`TVMFFIEnvGetStream` used in CUDA code makes it possible to 
launch a kernel on caller's stream.
+
+.. _sec-cpp-compile-with-tvm-ffi:
+
+Compile with TVM-FFI
+~~~~~~~~~~~~~~~~~~~~
+
+**Raw command.** Basic command to compile the source code can be as concise as 
below:
+
+.. tabs::
+
+  .. group-tab:: C++
+
+    .. code-block:: bash
+
+      g++ -shared -O3 main.cc                   \
+          -fPIC -fvisibility=hidden             \
+          `tvm-ffi-config --cxxflags`           \
+          `tvm-ffi-config --ldflags`            \
+          `tvm-ffi-config --libs`               \
+          -o libmain.so
+
+  .. group-tab:: CUDA
+
+    .. code-block:: bash
+
+      nvcc -shared -O3 main.cu                  \
+        --compiler-options -fPIC                \
+        --compiler-options -fvisibility=hidden  \
+        `tvm-ffi-config --cxxflags`             \
+        `tvm-ffi-config --ldflags`              \
+        `tvm-ffi-config --libs`                 \
+        -o libmain.so
+
+This produces a shared library ``libmain.so``. TVM-FFI automatically embeds 
the metadata needed to call the function across language and framework 
boundaries.
+
+**CMake.** As the preferred approach to build across platforms, CMake relies 
on CMake package ``tvm_ffi``, which can be found via ``tvm-ffi-config 
--cmakedir``.
+
+.. tabs::
+
+  .. group-tab:: C++
+
+    .. code-block:: cmake
+
+      # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
+      find_package(Python COMPONENTS Interpreter REQUIRED)
+      execute_process(
+        COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+        OUTPUT_STRIP_TRAILING_WHITESPACE
+        OUTPUT_VARIABLE tvm_ffi_ROOT
+      )
+      find_package(tvm_ffi CONFIG REQUIRED)
+      # Create C++ target `add_one_cpp`
+      add_library(add_one_cpp SHARED main.cc)
+      target_link_libraries(add_one_cpp PRIVATE tvm_ffi_header)
+      target_link_libraries(add_one_cpp PRIVATE tvm_ffi_shared)
+
+  .. group-tab:: CUDA
+
+    .. code-block:: cmake
+
+      # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
+      find_package(Python COMPONENTS Interpreter REQUIRED)
+      execute_process(
+        COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+        OUTPUT_STRIP_TRAILING_WHITESPACE
+        OUTPUT_VARIABLE tvm_ffi_ROOT
+      )
+      find_package(tvm_ffi CONFIG REQUIRED)
+      # Create C++ target `add_one_cuda`
+      enable_language(CUDA)
+      add_library(add_one_cuda SHARED main.cu)
+      target_link_libraries(add_one_cuda PRIVATE tvm_ffi_header)
+      target_link_libraries(add_one_cuda PRIVATE tvm_ffi_shared)
+
+.. hint::
+
+   For a single-file C++/CUDA, a convenient method 
:py:func:`tvm_ffi.cpp.load_inline`
+   is provided to minimize boilerplate code in compilation, linking and 
loading.
+
+Note that ``libmain.so`` is neutral and agnostic to:
+
+- Python version/ABI, because it is pure C++ and not compiled or linked 
against Python
+- C++ ABI, because TVM-FFI interacts with the artifact only via stable C APIs
+- Frontend languages, which can be C++, Rust, Python, TypeScript, etc.
+
+.. _sec-use-across-framework:
+
+Ship Across ML Frameworks
+-------------------------
+
+TVM FFI's Python package provides :py:func:`tvm_ffi.load_module`, which loads 
either C++ or CUDA's ``libmain.so`` into :py:class:`tvm_ffi.Module`.
+
+.. code-block:: python
+
+   import tvm_ffi
+   mod  : tvm_ffi.Module   = tvm_ffi.load_module("libmain.so")
+   func : tvm_ffi.Function = mod.add_one
+
+``mod["add_one"]`` retrieves a callable :py:class:`tvm_ffi.Function` that 
accepts tensors from host frameworks directly, which can be zero-copy 
incorporated in all popular ML frameworks. This process is done seamlessly 
without any boilerplate code, and with ultra low latency.
+
+.. tab-set::
+
+    .. tab-item:: PyTorch (C++/CUDA)
+
+        .. code-block:: python
+
+          import torch
+          device = "cpu" # or "cuda"
+          x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, device=device)
+          y = torch.empty_like(x)
+          func(x, y)
+          print(y)
+
+    .. tab-item:: JAX (C++/CUDA)
+
+        Upcoming. See `jax-tvm-ffi <https://github.com/nvidia/jax-tvm-ffi>`_ 
for preview.
+
+    .. tab-item:: NumPy (C++)
+
+        .. code-block:: python
+
+          import numpy as np
+          x = np.array([1, 2, 3, 4, 5], dtype=np.float32)
+          y = np.empty_like(x)
+          func(x, y)
+          print(y)
+
+    .. tab-item:: CuPy (CUDA)
+
+        .. code-block:: python
+
+          import cupy as cp
+          x = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
+          y = cp.empty_like(x)
+          func(x, y)
+          print(y)
+
+
+Ship Across Languages
+---------------------
+
+TVM-FFI's core loading mechanism is ABI stable and works across language 
boundaries.
+That said, a single artifact can be loaded in every language TVM-FFI supports,
+without having to recompile different artifact targeting different ABIs or 
languages.
+
+
+Python
+~~~~~~
+
+As shown in the :ref:`previous section<sec-use-across-framework>`, 
:py:func:`tvm_ffi.load_module` loads a language- and framework-neutral 
``libmain.so`` and supports incorporating it into all Python frameworks that 
implements the standard `DLPack protocol 
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
+
+C++
+~~~
+
+TVM-FFI's C++ API :cpp:func:`tvm::ffi::Module::LoadFromFile` loads 
``libmain.so`` and can be used directly in C/C++ with no Python dependency. 
Note that it is also ABI stable and can be used without having to worry about 
C++ compilers and ABIs.
+
+.. code-block:: cpp
+
+   // File: test_load.cc
+   #include <tvm/ffi/extra/module.h>
+
+   int main() {
+     namespace ffi = tvm::ffi;
+     ffi::Module   mod  = ffi::Module::LoadFromFile("libmain.so");
+     ffi::Function func = mod->GetFunction("add_one").value();
+     return 0;
+   }
+
+Compile it with:
+
+.. code-block:: bash
+
+    g++ -fvisibility=hidden -O3               \
+        test_load.cc                          \
+        `tvm-ffi-config --cxxflags`           \
+        `tvm-ffi-config --ldflags`            \
+        `tvm-ffi-config --libs`               \
+        -Wl,-rpath,`tvm-ffi-config --libdir`  \
+        -o test_load
+
+    ./test_load
+
+
+Rust
+~~~~
+
+TVM-FFI's Rust API ``tvm_ffi::Module::load_from_file`` loads ``libmain.so``, 
and then retrieves a function ``add_one`` from it. This procedure is strictly 
identical to C++ and Python:
+
+.. code-block:: rust
+
+    fn load_add_one() -> Result<tvm_ffi::Function> {
+        let module: tvm_ffi::Module = 
tvm_ffi::Module::load_from_file("libmain.so")?;
+        let result: tvm_ffi::Function = module.get_function("add_one")?;
+        Ok(result)
+    }
+
+
+Troubleshooting
+---------------
+
+- ``OSError: cannot open shared object file``: Add an rpath (Linux/macOS) or 
ensure the DLL is on ``PATH`` (Windows). Example run-path: 
``-Wl,-rpath,`tvm-ffi-config --libdir```.
+- ``undefined symbol: __tvm_ffi_add_one``: Ensure you used 
``TVM_FFI_DLL_EXPORT_TYPED_FUNC`` and compiled with default symbol visibility 
(``-fvisibility=hidden`` is fine; the macro ensures export).
+- ``CUDA error: invalid device function``: Rebuild with the right 
``-arch=sm_XX`` for your GPU, or include multiple ``-gencode`` entries.
diff --git a/docs/guides/python_guide.md b/docs/guides/python_guide.md
index 434f3ce..7086576 100644
--- a/docs/guides/python_guide.md
+++ b/docs/guides/python_guide.md
@@ -29,7 +29,7 @@ If so, we will also briefly copy snippets that show the 
corresponding C++ behavi
 ## Load and Run Module
 
 The most common use case of TVM FFI is to load a runnable module and run the 
corresponding function.
-You can follow the [quick start guide](../get_started/quick_start.md) for 
details on building the
+You can follow the [quickstart guide](../get_started/quickstart.rst) for 
details on building the
 library `build/add_one_cpu.so`. Let's walk through the load and run example 
again for NumPy
 
 ```python
diff --git a/docs/index.rst b/docs/index.rst
index ecea8e6..5e8fd0c 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -39,7 +39,7 @@ Table of Contents
    :maxdepth: 1
    :caption: Get Started
 
-   get_started/quick_start.md
+   get_started/quickstart.rst
 
 .. toctree::
    :maxdepth: 1

(tvm-ffi) branch main updated: doc: Polish QuickStart Guide (#166)

Reply via email to