This is an automated email from the ASF dual-hosted git repository.
tqchen pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tvm-ffi.git
The following commit(s) were added to refs/heads/main by this push:
new 6e448cf doc: Polish QuickStart Guide (#166)
6e448cf is described below
commit 6e448cffb1873afed0d626e7f92ddf2d9a2435bb
Author: Junru Shao <[email protected]>
AuthorDate: Fri Oct 17 15:54:41 2025 -0700
doc: Polish QuickStart Guide (#166)
This PR rewrote the quickstart guide, making it smoother to read.
---
docs/get_started/quick_start.md | 230 ----------------------------
docs/get_started/quickstart.rst | 329 ++++++++++++++++++++++++++++++++++++++++
docs/guides/python_guide.md | 2 +-
docs/index.rst | 2 +-
4 files changed, 331 insertions(+), 232 deletions(-)
diff --git a/docs/get_started/quick_start.md b/docs/get_started/quick_start.md
deleted file mode 100644
index 59995f9..0000000
--- a/docs/get_started/quick_start.md
+++ /dev/null
@@ -1,230 +0,0 @@
-<!--- Licensed to the Apache Software Foundation (ASF) under one -->
-<!--- or more contributor license agreements. See the NOTICE file -->
-<!--- distributed with this work for additional information -->
-<!--- regarding copyright ownership. The ASF licenses this file -->
-<!--- to you under the Apache License, Version 2.0 (the -->
-<!--- "License"); you may not use this file except in compliance -->
-<!--- with the License. You may obtain a copy of the License at -->
-
-<!--- http://www.apache.org/licenses/LICENSE-2.0 -->
-
-<!--- Unless required by applicable law or agreed to in writing, -->
-<!--- software distributed under the License is distributed on an -->
-<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
-<!--- KIND, either express or implied. See the License for the -->
-<!--- specific language governing permissions and limitations -->
-<!--- under the License. -->
-# Quick Start
-
-This is a quick start guide explaining the basic features and usage of tvm-ffi.
-The source code can be found at `examples/quick_start` in the project source.
-
-## Build and Run the Example
-
-Let us first get started by build and run the example. The example will show
us:
-
-- How to expose c++ functions as tvm ffi ABI function
-- How to load and run tvm-ffi based library from python
-- How to load and run tvm-ffi based library from c++
-
-Before starting, ensure you have:
-
-- TVM FFI installed
-- C++ compiler with C++17 support
-- CMake 3.18 or later
-- (Optional) Ninja build system (the quick-start uses Ninja for fast
incremental builds)
-- (Optional) CUDA toolkit for GPU examples
-- (Optional) PyTorch for checking torch integrations
-
-Then obtain a copy of the tvm-ffi source code.
-
-```bash
-git clone https://github.com/apache/tvm-ffi --recursive
-cd tvm-ffi
-```
-
-The examples are now in the example folder, you can quickly build
-the example using the following command.
-
-```bash
-cd examples/quick_start
-
-# with ninja or omit -G Ninja to use default generator
-cmake --fresh -G Ninja -B build -S .
-cmake --build build --parallel
-```
-
-After the build finishes, you can run the python examples by
-
-```bash
-python run_example.py
-```
-
-You can also run the c++ example
-
-```bash
-./build/run_example
-```
-
-If the CUDA toolkit is available, the GPU demo binary is built alongside the
CPU sample:
-
-```bash
-./build/run_example_cuda
-```
-
-## Walk through the Example
-
-Now we have quickly try things out. Let us now walk through the details of the
example.
-Specifically, in this example, we create a simple "add one" operation that
adds 1 to each element of an input
-tensor and expose that function as TVM FFI compatible function. The key file
structures are as follows:
-
-```text
-examples/quick_start/
-├── src/
-│ ├── add_one_cpu.cc # CPU implementation
-│ ├── add_one_c.c # A low-level C based implementation
-│ ├── add_one_cuda.cu # CUDA implementation
-│ ├── run_example.cc # C++ usage example
-│ └── run_example_cuda.cc # C++ with CUDA kernel usage example
-├── run_example.py # Python usage example
-├── run_example.sh # Build and run script
-└── CMakeLists.txt # Build configuration
-```
-
-### CPU Implementation
-
-```cpp
-#include <tvm/ffi/dtype.h>
-#include <tvm/ffi/error.h>
-#include <tvm/ffi/function.h>
-#include <tvm/ffi/container/tensor.h>
-
-namespace tvm_ffi_example {
-
-namespace ffi = tvm::ffi;
-
-void AddOne(ffi::TensorView x, ffi::TensorView y) {
- // Validate inputs
- TVM_FFI_ICHECK(x.ndim() == 1) << "x must be a 1D tensor";
- DLDataType f32_dtype{kDLFloat, 32, 1};
- TVM_FFI_ICHECK(x.dtype() == f32_dtype) << "x must be a float tensor";
- TVM_FFI_ICHECK(y.ndim() == 1) << "y must be a 1D tensor";
- TVM_FFI_ICHECK(y.dtype() == f32_dtype) << "y must be a float tensor";
- TVM_FFI_ICHECK(x.size(0) == y.size(0)) << "x and y must have the same shape";
-
- // Perform the computation
- for (int i = 0; i < x.size(0); ++i) {
- static_cast<float*>(y.data_ptr())[i] =
static_cast<float*>(x.data_ptr())[i] + 1;
- }
-}
-
-// Expose the function through TVM FFI
-TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cpu, tvm_ffi_example::AddOne);
-}
-```
-
-**Key Points:**
-
-- Functions take `tvm::ffi::Tensor` parameters for cross-language compatibility
-- The `TVM_FFI_DLL_EXPORT_TYPED_FUNC` macro exposes the function with a given
name
-
-### CUDA Implementation
-
-```cpp
-void AddOneCUDA(ffi::TensorView x, ffi::TensorView y) {
- // Validation (same as CPU version)
- // ...
-
- int64_t n = x.size(0);
- int64_t nthread_per_block = 256;
- int64_t nblock = (n + nthread_per_block - 1) / nthread_per_block;
-
- // Get current CUDA stream from environment
- cudaStream_t stream = static_cast<cudaStream_t>(
- TVMFFIEnvGetStream(x.device().device_type, x.device().device_id));
-
- // Launch kernel
- AddOneKernel<<<nblock, nthread_per_block, 0, stream>>>(
- static_cast<float*>(x.data_ptr()), static_cast<float*>(y.data_ptr()), n);
-}
-
-TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one_cuda, tvm_ffi_example::AddOneCUDA);
-```
-
-**Key Points:**
-
-- We use `TVMFFIEnvGetStream` to obtain the current stream from the
environement
-- When invoking ffi Function from python end with PyTorch tensor as argument,
- the stream will be populated with torch's current stream.
-
-### Working with PyTorch
-
-Atfer build, we will create library such as `build/add_one_cuda.so`, that can
be loaded by
-with api {py:func}`tvm_ffi.load_module` that returns a
{py:class}`tvm_ffi.Module`
-Then the function will become available as property of the loaded module.
-The tensor arguments in the ffi functions automatically consumes
`torch.Tensor`. The following code shows how
-to use the function in torch.
-
-```python
-import torch
-import tvm_ffi
-
-if torch.cuda.is_available():
- mod = tvm_ffi.load_module("build/add_one_cuda.so")
-
- x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, device="cuda")
- y = torch.empty_like(x)
-
- # TVM FFI automatically handles CUDA streams
- stream = torch.cuda.Stream()
- with torch.cuda.stream(stream):
- mod.add_one_cuda(x, y)
- stream.synchronize()
-```
-
-### Working with Python Data Arrays
-
-TVM FFI functions works automaticaly with python data arrays that are
compatible with dlpack.
-The following examples how to use the function with numpy.
-
-```python
-import tvm_ffi
-import numpy as np
-
-# Load the compiled module
-mod = tvm_ffi.load_module("build/add_one_cpu.so")
-
-# Create input and output arrays
-x = np.array([1, 2, 3, 4, 5], dtype=np.float32)
-y = np.empty_like(x)
-
-# Call the function
-mod.add_one_cpu(x, y)
-print("Result:", y) # [2, 3, 4, 5, 6]
-```
-
-### Working with C++
-
-One important design goal of tvm-ffi is to be universally portable.
-As a result, the result libraries do not have explicit dependencies in python
-and can be loaded in other language environments, such as c++. The following
code
-shows how to run the example exported function in C++.
-
-```cpp
-#include <tvm/ffi/container/tensor.h>
-#include <tvm/ffi/extra/module.h>
-
-namespace ffi = tvm::ffi;
-
-void CallAddOne(ffi::TensorView x, ffi::TensorView y) {
- ffi::Module mod = ffi::Module::LoadFromFile("build/add_one_cpu.so");
- ffi::Function add_one_cpu = mod->GetFunction("add_one_cpu").value();
- add_one_cpu(x, y);
-}
-```
-
-## Summary Key Concepts
-
-- **TVM_FFI_DLL_EXPORT_TYPED_FUNC** exposes a c++ function into tvm-ffi C ABI
-- **ffi::Tensor** is a universal tensor structure that enables zero-copy
exchange of array data
-- **Module loading** is provided by tvm ffi APIs in multiple languages.
diff --git a/docs/get_started/quickstart.rst b/docs/get_started/quickstart.rst
new file mode 100644
index 0000000..8a03328
--- /dev/null
+++ b/docs/get_started/quickstart.rst
@@ -0,0 +1,329 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+..
+.. http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Quick Start
+===========
+
+This guide walks through shipping a minimal ``add_one`` function that computes
+``y = x + 1`` in C++ and CUDA.
+
+TVM-FFI's Open ABI and FFI makes it possible to **build once, ship
everywhere**. That said,
+a single shared library works across:
+
+- **ML frameworks**, e.g. PyTorch, JAX, NumPy, CuPy, etc., and
+- **languages**, e.g. C++, Python, Rust, etc.
+
+.. admonition:: Prerequisite
+ :class: hint
+ :name: prerequisite
+
+ - Python: 3.9 or newer
+ - Compiler: C++17-capable toolchain (GCC/Clang/MSVC)
+ - Optional ML frameworks for testing: NumPy, PyTorch, JAX, CuPy
+ - CUDA: Any modern version if you want to try the CUDA part
+ - TVM-FFI installed via
+
+ .. code-block:: bash
+
+ pip install --reinstall --upgrade apache-tvm-ffi
+
+
+Write a Simple ``add_one``
+--------------------------
+
+.. _sec-cpp-source-code:
+
+Source Code
+~~~~~~~~~~~
+
+Suppose we implement a C++ function ``AddOne`` that performs elementwise ``y =
x + 1`` for a 1-D ``float32`` vector. The source code (C++, CUDA) is:
+
+.. tabs::
+
+ .. group-tab:: C++
+
+ .. code-block:: cpp
+ :emphasize-lines: 8, 17
+
+ // File: main.cc
+ #include <tvm/ffi/container/tensor.h>
+ #include <tvm/ffi/function.h>
+
+ namespace tvm_ffi_example_cpp {
+
+ /*! \brief Perform vector add one: y = x + 1 (1-D float32) */
+ void AddOne(tvm::ffi::TensorView x, tvm::ffi::TensorView y) {
+ int64_t n = x.shape()[0];
+ float* x_data = static_cast<float *>(x.data_ptr());
+ float* y_data = static_cast<float *>(y.data_ptr());
+ for (int64_t i = 0; i < n; ++i) {
+ y_data[i] = x_data[i] + 1;
+ }
+ }
+
+ TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cpp::AddOne);
+ }
+
+
+ .. group-tab:: CUDA
+
+ .. code-block:: cpp
+ :emphasize-lines: 15, 22, 26
+
+ // File: main.cu
+ #include <tvm/ffi/container/tensor.h>
+ #include <tvm/ffi/extra/c_env_api.h>
+ #include <tvm/ffi/function.h>
+
+ namespace tvm_ffi_example_cuda {
+
+ __global__ void AddOneKernel(float* x, float* y, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx < n) {
+ y[idx] = x[idx] + 1;
+ }
+ }
+
+ void AddOne(tvm::ffi::TensorView x, tvm::ffi::TensorView y) {
+ int64_t n = x.shape()[0];
+ float* x_data = static_cast<float *>(x.data_ptr());
+ float* y_data = static_cast<float *>(y.data_ptr());
+ int64_t threads = 256;
+ int64_t blocks = (n + threads - 1) / threads;
+ cudaStream_t stream = static_cast<cudaStream_t>(
+ TVMFFIEnvGetStream(x.device().device_type, x.device().device_id));
+ AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n);
+ }
+
+ TVM_FFI_DLL_EXPORT_TYPED_FUNC(add_one, tvm_ffi_example_cuda::AddOne);
+ }
+
+
+
+Macro :c:macro:`TVM_FFI_DLL_EXPORT_TYPED_FUNC` exports the C++ function
``AddOne`` with public name ``add_one`` in the resulting library.
+TVM-FFI looks it up at runtime to make the function available across languages.
+
+Class :cpp:class:`tvm::ffi::TensorView` allows zero-copy interop with tensors
from different ML frameworks:
+
+- NumPy, CuPy,
+- PyTorch, JAX, or
+- any array type that supports the standard `DLPack protocol
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
+
+Finally, :cpp:func:`TVMFFIEnvGetStream` used in CUDA code makes it possible to
launch a kernel on caller's stream.
+
+.. _sec-cpp-compile-with-tvm-ffi:
+
+Compile with TVM-FFI
+~~~~~~~~~~~~~~~~~~~~
+
+**Raw command.** Basic command to compile the source code can be as concise as
below:
+
+.. tabs::
+
+ .. group-tab:: C++
+
+ .. code-block:: bash
+
+ g++ -shared -O3 main.cc \
+ -fPIC -fvisibility=hidden \
+ `tvm-ffi-config --cxxflags` \
+ `tvm-ffi-config --ldflags` \
+ `tvm-ffi-config --libs` \
+ -o libmain.so
+
+ .. group-tab:: CUDA
+
+ .. code-block:: bash
+
+ nvcc -shared -O3 main.cu \
+ --compiler-options -fPIC \
+ --compiler-options -fvisibility=hidden \
+ `tvm-ffi-config --cxxflags` \
+ `tvm-ffi-config --ldflags` \
+ `tvm-ffi-config --libs` \
+ -o libmain.so
+
+This produces a shared library ``libmain.so``. TVM-FFI automatically embeds
the metadata needed to call the function across language and framework
boundaries.
+
+**CMake.** As the preferred approach to build across platforms, CMake relies
on CMake package ``tvm_ffi``, which can be found via ``tvm-ffi-config
--cmakedir``.
+
+.. tabs::
+
+ .. group-tab:: C++
+
+ .. code-block:: cmake
+
+ # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
+ find_package(Python COMPONENTS Interpreter REQUIRED)
+ execute_process(
+ COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+ OUTPUT_STRIP_TRAILING_WHITESPACE
+ OUTPUT_VARIABLE tvm_ffi_ROOT
+ )
+ find_package(tvm_ffi CONFIG REQUIRED)
+ # Create C++ target `add_one_cpp`
+ add_library(add_one_cpp SHARED main.cc)
+ target_link_libraries(add_one_cpp PRIVATE tvm_ffi_header)
+ target_link_libraries(add_one_cpp PRIVATE tvm_ffi_shared)
+
+ .. group-tab:: CUDA
+
+ .. code-block:: cmake
+
+ # Run `tvm-ffi-config --cmakedir` to find tvm-ffi targets
+ find_package(Python COMPONENTS Interpreter REQUIRED)
+ execute_process(
+ COMMAND "${Python_EXECUTABLE}" -m tvm-ffi-config --cmakedir
+ OUTPUT_STRIP_TRAILING_WHITESPACE
+ OUTPUT_VARIABLE tvm_ffi_ROOT
+ )
+ find_package(tvm_ffi CONFIG REQUIRED)
+ # Create C++ target `add_one_cuda`
+ enable_language(CUDA)
+ add_library(add_one_cuda SHARED main.cu)
+ target_link_libraries(add_one_cuda PRIVATE tvm_ffi_header)
+ target_link_libraries(add_one_cuda PRIVATE tvm_ffi_shared)
+
+.. hint::
+
+ For a single-file C++/CUDA, a convenient method
:py:func:`tvm_ffi.cpp.load_inline`
+ is provided to minimize boilerplate code in compilation, linking and
loading.
+
+Note that ``libmain.so`` is neutral and agnostic to:
+
+- Python version/ABI, because it is pure C++ and not compiled or linked
against Python
+- C++ ABI, because TVM-FFI interacts with the artifact only via stable C APIs
+- Frontend languages, which can be C++, Rust, Python, TypeScript, etc.
+
+.. _sec-use-across-framework:
+
+Ship Across ML Frameworks
+-------------------------
+
+TVM FFI's Python package provides :py:func:`tvm_ffi.load_module`, which loads
either C++ or CUDA's ``libmain.so`` into :py:class:`tvm_ffi.Module`.
+
+.. code-block:: python
+
+ import tvm_ffi
+ mod : tvm_ffi.Module = tvm_ffi.load_module("libmain.so")
+ func : tvm_ffi.Function = mod.add_one
+
+``mod["add_one"]`` retrieves a callable :py:class:`tvm_ffi.Function` that
accepts tensors from host frameworks directly, which can be zero-copy
incorporated in all popular ML frameworks. This process is done seamlessly
without any boilerplate code, and with ultra low latency.
+
+.. tab-set::
+
+ .. tab-item:: PyTorch (C++/CUDA)
+
+ .. code-block:: python
+
+ import torch
+ device = "cpu" # or "cuda"
+ x = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float32, device=device)
+ y = torch.empty_like(x)
+ func(x, y)
+ print(y)
+
+ .. tab-item:: JAX (C++/CUDA)
+
+ Upcoming. See `jax-tvm-ffi <https://github.com/nvidia/jax-tvm-ffi>`_
for preview.
+
+ .. tab-item:: NumPy (C++)
+
+ .. code-block:: python
+
+ import numpy as np
+ x = np.array([1, 2, 3, 4, 5], dtype=np.float32)
+ y = np.empty_like(x)
+ func(x, y)
+ print(y)
+
+ .. tab-item:: CuPy (CUDA)
+
+ .. code-block:: python
+
+ import cupy as cp
+ x = cp.array([1, 2, 3, 4, 5], dtype=cp.float32)
+ y = cp.empty_like(x)
+ func(x, y)
+ print(y)
+
+
+Ship Across Languages
+---------------------
+
+TVM-FFI's core loading mechanism is ABI stable and works across language
boundaries.
+That said, a single artifact can be loaded in every language TVM-FFI supports,
+without having to recompile different artifact targeting different ABIs or
languages.
+
+
+Python
+~~~~~~
+
+As shown in the :ref:`previous section<sec-use-across-framework>`,
:py:func:`tvm_ffi.load_module` loads a language- and framework-neutral
``libmain.so`` and supports incorporating it into all Python frameworks that
implements the standard `DLPack protocol
<https://data-apis.org/array-api/2024.12/design_topics/data_interchange.html>`_.
+
+C++
+~~~
+
+TVM-FFI's C++ API :cpp:func:`tvm::ffi::Module::LoadFromFile` loads
``libmain.so`` and can be used directly in C/C++ with no Python dependency.
Note that it is also ABI stable and can be used without having to worry about
C++ compilers and ABIs.
+
+.. code-block:: cpp
+
+ // File: test_load.cc
+ #include <tvm/ffi/extra/module.h>
+
+ int main() {
+ namespace ffi = tvm::ffi;
+ ffi::Module mod = ffi::Module::LoadFromFile("libmain.so");
+ ffi::Function func = mod->GetFunction("add_one").value();
+ return 0;
+ }
+
+Compile it with:
+
+.. code-block:: bash
+
+ g++ -fvisibility=hidden -O3 \
+ test_load.cc \
+ `tvm-ffi-config --cxxflags` \
+ `tvm-ffi-config --ldflags` \
+ `tvm-ffi-config --libs` \
+ -Wl,-rpath,`tvm-ffi-config --libdir` \
+ -o test_load
+
+ ./test_load
+
+
+Rust
+~~~~
+
+TVM-FFI's Rust API ``tvm_ffi::Module::load_from_file`` loads ``libmain.so``,
and then retrieves a function ``add_one`` from it. This procedure is strictly
identical to C++ and Python:
+
+.. code-block:: rust
+
+ fn load_add_one() -> Result<tvm_ffi::Function> {
+ let module: tvm_ffi::Module =
tvm_ffi::Module::load_from_file("libmain.so")?;
+ let result: tvm_ffi::Function = module.get_function("add_one")?;
+ Ok(result)
+ }
+
+
+Troubleshooting
+---------------
+
+- ``OSError: cannot open shared object file``: Add an rpath (Linux/macOS) or
ensure the DLL is on ``PATH`` (Windows). Example run-path:
``-Wl,-rpath,`tvm-ffi-config --libdir```.
+- ``undefined symbol: __tvm_ffi_add_one``: Ensure you used
``TVM_FFI_DLL_EXPORT_TYPED_FUNC`` and compiled with default symbol visibility
(``-fvisibility=hidden`` is fine; the macro ensures export).
+- ``CUDA error: invalid device function``: Rebuild with the right
``-arch=sm_XX`` for your GPU, or include multiple ``-gencode`` entries.
diff --git a/docs/guides/python_guide.md b/docs/guides/python_guide.md
index 434f3ce..7086576 100644
--- a/docs/guides/python_guide.md
+++ b/docs/guides/python_guide.md
@@ -29,7 +29,7 @@ If so, we will also briefly copy snippets that show the
corresponding C++ behavi
## Load and Run Module
The most common use case of TVM FFI is to load a runnable module and run the
corresponding function.
-You can follow the [quick start guide](../get_started/quick_start.md) for
details on building the
+You can follow the [quickstart guide](../get_started/quickstart.rst) for
details on building the
library `build/add_one_cpu.so`. Let's walk through the load and run example
again for NumPy
```python
diff --git a/docs/index.rst b/docs/index.rst
index ecea8e6..5e8fd0c 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -39,7 +39,7 @@ Table of Contents
:maxdepth: 1
:caption: Get Started
- get_started/quick_start.md
+ get_started/quickstart.rst
.. toctree::
:maxdepth: 1