Re: [PR] doc: Tensor Usage [tvm-ffi]

via GitHub Thu, 08 Jan 2026 02:09:58 -0800


gemini-code-assist[bot] commented on code in PR #390:
URL: https://github.com/apache/tvm-ffi/pull/390#discussion_r2671732342



##########
docs/concepts/tensor.rst:
##########
@@ -63,6 +63,180 @@ Managed object (owning)
   As a loose analogy, think of **view** vs. **managed** as similar to
   ``T*`` (raw pointer) vs. ``std::shared_ptr<T>`` (reference-counted pointer) 
in C++.
 
+Common Usage
+------------
+
+This section introduces the most important APIs you will use in C++ and 
Python. It intentionally
+focuses on introductory, day-to-day methods.
+
+Kernel Signatures
+~~~~~~~~~~~~~~~~~
+
+A typical kernel implementation includes accepting a :cpp:class:`TensorView 
<tvm::ffi::TensorView>` parameter,
+validating its metadata (dtype, shape, device), and then accessing its data 
pointer for computation.
+
+.. code-block:: cpp
+
+    #include <tvm/ffi/tvm_ffi.h>
+
+    void MyKernel(tvm::ffi::TensorView input, tvm::ffi::TensorView output) {
+      // Validate dtype & device
+      if (input.dtype() != DLDataType{kDLFloat, 32, 1})
+        TVM_FFI_THROW(TypeError) << "Expect float32 input, but got " << 
input.dtype();
+      if (input.device() != DLDevice{kDLCUDA, 0})
+        TVM_FFI_THROW(ValueError) << "Expect input on CUDA:0, but got " << 
input.device();
+      // Access data pointer
+      float* input_data_ptr = static_cast<float*>(input.data_ptr());
+      float* output_data_ptr = static_cast<float*>(output.data_ptr());
+      Kernel<<<...>>>(..., input_data_ptr, output_data_ptr, ...);
+    }
+
+C++ side, the following APIs are available to query a tensor's metadata:
+
+ :cpp:func:`TensorView::shape() <tvm::ffi::TensorView::shape>` and 
:cpp:func:`Tensor::shape() <tvm::ffi::Tensor::shape>`
+  shape array
+
+ :cpp:func:`TensorView::dtype() <tvm::ffi::TensorView::dtype>` and 
:cpp:func:`Tensor::dtype() <tvm::ffi::Tensor::dtype>`
+  element data type
+
+ :cpp:func:`TensorView::data_ptr() <tvm::ffi::TensorView::data_ptr>` and 
:cpp:func:`Tensor::data_ptr() <tvm::ffi::Tensor::data_ptr>`
+  base pointer to the tensor's data
+
+ :cpp:func:`TensorView::device() <tvm::ffi::TensorView::device>` and 
:cpp:func:`Tensor::device() <tvm::ffi::Tensor::device>`
+  device type and id
+
+ :cpp:func:`TensorView::byte_offset() <tvm::ffi::TensorView::byte_offset>` and 
:cpp:func:`Tensor::byte_offset() <tvm::ffi::Tensor::byte_offset>`
+  byte offset to the first element
+
+ :cpp:func:`TensorView::ndim() <tvm::ffi::TensorView::ndim>` and 
:cpp:func:`Tensor::ndim() <tvm::ffi::Tensor::ndim>`
+  number of dimensions (:cpp:func:`ShapeView::size 
<tvm::ffi::ShapeView::size>`)
+
+ :cpp:func:`TensorView::numel() <tvm::ffi::TensorView::numel>` and 
:cpp:func:`Tensor::numel() <tvm::ffi::Tensor::numel>`
+  total number of elements (:cpp:func:`ShapeView::Product 
<tvm::ffi::ShapeView::Product>`)
+
+
+
+PyTorch Interop
+~~~~~~~~~~~~~~~
+
+The Python-facing :py:class:`tvm_ffi.Tensor` is a managed n-dimensional array 
that:
+
+* Can be created via :py:func:`tvm_ffi.from_dlpack(ext_tensor, ...) 
<tvm_ffi.from_dlpack>` to import tensors from external frameworks, e.g. 
:ref:`PyTorch <ship-to-pytorch>`, :ref:`JAX <ship-to-jax>`, :ref:`NumPy/CuPy 
<ship-to-numpy>`.
+* Implements the DLPack protocol so it can be passed back to frameworks 
without copying, e.g. :py:func:`torch.from_dlpack`.
+
+Typical import pattern:
+
+.. code-block:: python
+
+   import tvm_ffi
+   import torch
+
+   x_torch = torch.randn(1024, device="cuda")
+   x_tvm_ffi = tvm_ffi.from_dlpack(x_torch, require_contiguous=True)
+   x_torch_again = torch.from_dlpack(x_tvm_ffi)
+
+
+In this example, :py:meth:`tvm_ffi.from_dlpack` creates a ``x_tvm_ffi`` that 
views the same memory as ``x_torch``.

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   `from_dlpack` is a function within the `tvm_ffi` module, not a method. Using 
`:py:func:` will ensure the link in the generated documentation is correct.
   
   ```suggestion
   In this example, :py:func:`tvm_ffi.from_dlpack` creates a ``x_tvm_ffi`` that 
views the same memory as ``x_torch``.
   ```



##########
docs/concepts/tensor.rst:
##########
@@ -63,6 +63,180 @@ Managed object (owning)
   As a loose analogy, think of **view** vs. **managed** as similar to
   ``T*`` (raw pointer) vs. ``std::shared_ptr<T>`` (reference-counted pointer) 
in C++.
 
+Common Usage
+------------
+
+This section introduces the most important APIs you will use in C++ and 
Python. It intentionally
+focuses on introductory, day-to-day methods.
+
+Kernel Signatures
+~~~~~~~~~~~~~~~~~
+
+A typical kernel implementation includes accepting a :cpp:class:`TensorView 
<tvm::ffi::TensorView>` parameter,
+validating its metadata (dtype, shape, device), and then accessing its data 
pointer for computation.
+
+.. code-block:: cpp
+
+    #include <tvm/ffi/tvm_ffi.h>
+
+    void MyKernel(tvm::ffi::TensorView input, tvm::ffi::TensorView output) {
+      // Validate dtype & device
+      if (input.dtype() != DLDataType{kDLFloat, 32, 1})
+        TVM_FFI_THROW(TypeError) << "Expect float32 input, but got " << 
input.dtype();
+      if (input.device() != DLDevice{kDLCUDA, 0})
+        TVM_FFI_THROW(ValueError) << "Expect input on CUDA:0, but got " << 
input.device();
+      // Access data pointer
+      float* input_data_ptr = static_cast<float*>(input.data_ptr());
+      float* output_data_ptr = static_cast<float*>(output.data_ptr());
+      Kernel<<<...>>>(..., input_data_ptr, output_data_ptr, ...);
+    }
+
+C++ side, the following APIs are available to query a tensor's metadata:
+
+ :cpp:func:`TensorView::shape() <tvm::ffi::TensorView::shape>` and 
:cpp:func:`Tensor::shape() <tvm::ffi::Tensor::shape>`
+  shape array
+
+ :cpp:func:`TensorView::dtype() <tvm::ffi::TensorView::dtype>` and 
:cpp:func:`Tensor::dtype() <tvm::ffi::Tensor::dtype>`
+  element data type
+
+ :cpp:func:`TensorView::data_ptr() <tvm::ffi::TensorView::data_ptr>` and 
:cpp:func:`Tensor::data_ptr() <tvm::ffi::Tensor::data_ptr>`
+  base pointer to the tensor's data
+
+ :cpp:func:`TensorView::device() <tvm::ffi::TensorView::device>` and 
:cpp:func:`Tensor::device() <tvm::ffi::Tensor::device>`
+  device type and id
+
+ :cpp:func:`TensorView::byte_offset() <tvm::ffi::TensorView::byte_offset>` and 
:cpp:func:`Tensor::byte_offset() <tvm::ffi::Tensor::byte_offset>`
+  byte offset to the first element
+
+ :cpp:func:`TensorView::ndim() <tvm::ffi::TensorView::ndim>` and 
:cpp:func:`Tensor::ndim() <tvm::ffi::Tensor::ndim>`
+  number of dimensions (:cpp:func:`ShapeView::size 
<tvm::ffi::ShapeView::size>`)
+
+ :cpp:func:`TensorView::numel() <tvm::ffi::TensorView::numel>` and 
:cpp:func:`Tensor::numel() <tvm::ffi::Tensor::numel>`
+  total number of elements (:cpp:func:`ShapeView::Product 
<tvm::ffi::ShapeView::Product>`)
+
+
+
+PyTorch Interop
+~~~~~~~~~~~~~~~
+
+The Python-facing :py:class:`tvm_ffi.Tensor` is a managed n-dimensional array 
that:
+
+* Can be created via :py:func:`tvm_ffi.from_dlpack(ext_tensor, ...) 
<tvm_ffi.from_dlpack>` to import tensors from external frameworks, e.g. 
:ref:`PyTorch <ship-to-pytorch>`, :ref:`JAX <ship-to-jax>`, :ref:`NumPy/CuPy 
<ship-to-numpy>`.
+* Implements the DLPack protocol so it can be passed back to frameworks 
without copying, e.g. :py:func:`torch.from_dlpack`.
+
+Typical import pattern:
+
+.. code-block:: python
+
+   import tvm_ffi
+   import torch
+
+   x_torch = torch.randn(1024, device="cuda")
+   x_tvm_ffi = tvm_ffi.from_dlpack(x_torch, require_contiguous=True)
+   x_torch_again = torch.from_dlpack(x_tvm_ffi)
+
+
+In this example, :py:meth:`tvm_ffi.from_dlpack` creates a ``x_tvm_ffi`` that 
views the same memory as ``x_torch``.
+And :py:func:`torch.from_dlpack` creates a ``x_torch_again`` that views the 
same memory as ``x_tvm_ffi`` and ``x_torch``.
+
+
+C++ Allocation
+~~~~~~~~~~~~~~
+
+TVM-FFI is not a kernel library per se and is not linked to any specific 
device memory allocator or runtime.
+However, for kernel library developers, it provides standardized allocation 
entry points by
+interfacing with the surrounding framework's allocator. For example, it uses 
PyTorch's allocator when running inside
+a PyTorch environment.
+
+**Env Allocator.** Use :cpp:func:`Tensor::FromEnvAlloc() 
<tvm::ffi::Tensor::FromEnvAlloc>` along with C API
+:cpp:func:`TVMFFIEnvTensorAlloc` to allocate a tensor using the framework's 
allocator.
+
+.. code-block:: cpp
+
+  Tensor tensor = Tensor::FromEnvAlloc(
+    TVMFFIEnvTensorAlloc,
+    /*shape=*/{1, 2, 3},
+    /*dtype=*/DLDataType({kDLFloat, 32, 1}),
+    /*device=*/DLDevice({kDLCPU, 0})
+  );
+
+In a PyTorch environment, this is equivalent to :py:func:`torch.empty`.
+
+.. warning::
+
+  While allocation APIs are available, it is generally **recommended** to 
avoid allocating tensors inside kernels.
+  Instead, prefer pre-allocating outputs and passing them in as 
:cpp:class:`tvm::ffi::TensorView` parameters.
+  Reasons include:
+
+  - Avoiding fragmentation and performance pitfalls;
+  - Avoiding cudagraph incompatibilities on GPU;
+  - Allowing the outer framework to control allocation policy (pools, device 
strategies, etc.).
+
+**Custom Allocator.** Use :cpp:func:`Tensor::FromNDAlloc(custom_alloc, ...) 
<tvm::ffi::Tensor::FromNDAlloc>`,
+or its advanced variant :cpp:func:`Tensor::FromNDAllocStrided(custom_alloc, 
...) <tvm::ffi::Tensor::FromNDAllocStrided>`,
+to allocate a tensor with user-provided allocation callback.
+
+Below is an example that uses ``cudaMalloc``/``cudaFree`` as custom allocators 
for GPU tensors.
+
+.. code-block:: cpp
+
+  struct CUDANDAlloc {
+    void AllocData(DLTensor* tensor) {
+      size_t data_size = ffi::GetDataSize(*tensor);
+      void* ptr = nullptr;
+      cudaError_t err = cudaMalloc(&ptr, data_size);
+      TVM_FFI_ICHECK_EQ(err, cudaSuccess) << "cudaMalloc failed: " << 
cudaGetErrorString(err);
+      tensor->data = ptr;
+    }
+
+    void FreeData(DLTensor* tensor) {
+      if (tensor->data != nullptr) {
+        cudaError_t err = cudaFree(tensor->data);
+        TVM_FFI_ICHECK_EQ(err, cudaSuccess) << "cudaFree failed: " << 
cudaGetErrorString(err);
+        tensor->data = nullptr;
+      }
+    }
+  };
+
+  ffi::Tensor cuda_tensor = ffi::Tensor::FromNDAlloc(
+    CUDANDAlloc(),
+    /*shape=*/{3, 4, 5},
+    /*dtype=*/DLDataType({kDLFloat, 32, 1}),
+    /*device=*/DLDevice({kDLCUDA, 0})
+  );
+
+C++ Stream Handling
+~~~~~~~~~~~~~~~~~~~
+
+Besides tensors, stream context is another key concept in a kernel library, 
especially for kernel execution. While CUDA does not have a global context for 
default streams, frameworks like PyTorch maintain a "current stream" per device 
(:py:func:`torch.cuda.current_stream`), and kernel libraries must read the 
current stream from the embedding environment.
+
+As a hardware-agnostic abstraction layer, TVM-FFI is not linked to any 
specific stream management library, but to ensure GPU kernels launch on the 
correct stream, it provides standardized APIs to obtain stream context from the 
upper framework (e.g. PyTorch).
+
+**Obtain Stream Context.** Use C API :cpp:func:`TVMFFIEnvGetStream` to obtain 
the current stream for a given device.
+
+.. code-block:: c++
+
+  void func(ffi::TensorView input, ...) {
+    ffi::DLDevice device = input.device();
+    cudaStream_t stream = 
reinterpret_cast<cudaStream_t>(TVMFFIEnvGetStream(device.device_type, 
device.device_id));
+  }
+
+which is equivalent to:
+
+.. code-block:: c++

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   For consistency with other C++ code blocks in this file, please use `cpp` 
instead of `c++` as the language identifier.
   
   ```suggestion
   .. code-block:: cpp
   ```



##########
docs/concepts/tensor.rst:
##########
@@ -63,6 +63,180 @@ Managed object (owning)
   As a loose analogy, think of **view** vs. **managed** as similar to
   ``T*`` (raw pointer) vs. ``std::shared_ptr<T>`` (reference-counted pointer) 
in C++.
 
+Common Usage
+------------
+
+This section introduces the most important APIs you will use in C++ and 
Python. It intentionally
+focuses on introductory, day-to-day methods.
+
+Kernel Signatures
+~~~~~~~~~~~~~~~~~
+
+A typical kernel implementation includes accepting a :cpp:class:`TensorView 
<tvm::ffi::TensorView>` parameter,
+validating its metadata (dtype, shape, device), and then accessing its data 
pointer for computation.
+
+.. code-block:: cpp
+
+    #include <tvm/ffi/tvm_ffi.h>
+
+    void MyKernel(tvm::ffi::TensorView input, tvm::ffi::TensorView output) {
+      // Validate dtype & device
+      if (input.dtype() != DLDataType{kDLFloat, 32, 1})
+        TVM_FFI_THROW(TypeError) << "Expect float32 input, but got " << 
input.dtype();
+      if (input.device() != DLDevice{kDLCUDA, 0})
+        TVM_FFI_THROW(ValueError) << "Expect input on CUDA:0, but got " << 
input.device();
+      // Access data pointer
+      float* input_data_ptr = static_cast<float*>(input.data_ptr());
+      float* output_data_ptr = static_cast<float*>(output.data_ptr());
+      Kernel<<<...>>>(..., input_data_ptr, output_data_ptr, ...);
+    }
+
+C++ side, the following APIs are available to query a tensor's metadata:
+
+ :cpp:func:`TensorView::shape() <tvm::ffi::TensorView::shape>` and 
:cpp:func:`Tensor::shape() <tvm::ffi::Tensor::shape>`
+  shape array
+
+ :cpp:func:`TensorView::dtype() <tvm::ffi::TensorView::dtype>` and 
:cpp:func:`Tensor::dtype() <tvm::ffi::Tensor::dtype>`
+  element data type
+
+ :cpp:func:`TensorView::data_ptr() <tvm::ffi::TensorView::data_ptr>` and 
:cpp:func:`Tensor::data_ptr() <tvm::ffi::Tensor::data_ptr>`
+  base pointer to the tensor's data
+
+ :cpp:func:`TensorView::device() <tvm::ffi::TensorView::device>` and 
:cpp:func:`Tensor::device() <tvm::ffi::Tensor::device>`
+  device type and id
+
+ :cpp:func:`TensorView::byte_offset() <tvm::ffi::TensorView::byte_offset>` and 
:cpp:func:`Tensor::byte_offset() <tvm::ffi::Tensor::byte_offset>`
+  byte offset to the first element
+
+ :cpp:func:`TensorView::ndim() <tvm::ffi::TensorView::ndim>` and 
:cpp:func:`Tensor::ndim() <tvm::ffi::Tensor::ndim>`
+  number of dimensions (:cpp:func:`ShapeView::size 
<tvm::ffi::ShapeView::size>`)
+
+ :cpp:func:`TensorView::numel() <tvm::ffi::TensorView::numel>` and 
:cpp:func:`Tensor::numel() <tvm::ffi::Tensor::numel>`
+  total number of elements (:cpp:func:`ShapeView::Product 
<tvm::ffi::ShapeView::Product>`)
+
+
+
+PyTorch Interop
+~~~~~~~~~~~~~~~
+
+The Python-facing :py:class:`tvm_ffi.Tensor` is a managed n-dimensional array 
that:
+
+* Can be created via :py:func:`tvm_ffi.from_dlpack(ext_tensor, ...) 
<tvm_ffi.from_dlpack>` to import tensors from external frameworks, e.g. 
:ref:`PyTorch <ship-to-pytorch>`, :ref:`JAX <ship-to-jax>`, :ref:`NumPy/CuPy 
<ship-to-numpy>`.
+* Implements the DLPack protocol so it can be passed back to frameworks 
without copying, e.g. :py:func:`torch.from_dlpack`.
+
+Typical import pattern:
+
+.. code-block:: python
+
+   import tvm_ffi
+   import torch
+
+   x_torch = torch.randn(1024, device="cuda")
+   x_tvm_ffi = tvm_ffi.from_dlpack(x_torch, require_contiguous=True)
+   x_torch_again = torch.from_dlpack(x_tvm_ffi)
+
+
+In this example, :py:meth:`tvm_ffi.from_dlpack` creates a ``x_tvm_ffi`` that 
views the same memory as ``x_torch``.
+And :py:func:`torch.from_dlpack` creates a ``x_torch_again`` that views the 
same memory as ``x_tvm_ffi`` and ``x_torch``.
+
+
+C++ Allocation
+~~~~~~~~~~~~~~
+
+TVM-FFI is not a kernel library per se and is not linked to any specific 
device memory allocator or runtime.
+However, for kernel library developers, it provides standardized allocation 
entry points by
+interfacing with the surrounding framework's allocator. For example, it uses 
PyTorch's allocator when running inside
+a PyTorch environment.
+
+**Env Allocator.** Use :cpp:func:`Tensor::FromEnvAlloc() 
<tvm::ffi::Tensor::FromEnvAlloc>` along with C API
+:cpp:func:`TVMFFIEnvTensorAlloc` to allocate a tensor using the framework's 
allocator.
+
+.. code-block:: cpp
+
+  Tensor tensor = Tensor::FromEnvAlloc(
+    TVMFFIEnvTensorAlloc,
+    /*shape=*/{1, 2, 3},
+    /*dtype=*/DLDataType({kDLFloat, 32, 1}),
+    /*device=*/DLDevice({kDLCPU, 0})
+  );
+
+In a PyTorch environment, this is equivalent to :py:func:`torch.empty`.
+
+.. warning::
+
+  While allocation APIs are available, it is generally **recommended** to 
avoid allocating tensors inside kernels.
+  Instead, prefer pre-allocating outputs and passing them in as 
:cpp:class:`tvm::ffi::TensorView` parameters.
+  Reasons include:
+
+  - Avoiding fragmentation and performance pitfalls;
+  - Avoiding cudagraph incompatibilities on GPU;
+  - Allowing the outer framework to control allocation policy (pools, device 
strategies, etc.).
+
+**Custom Allocator.** Use :cpp:func:`Tensor::FromNDAlloc(custom_alloc, ...) 
<tvm::ffi::Tensor::FromNDAlloc>`,
+or its advanced variant :cpp:func:`Tensor::FromNDAllocStrided(custom_alloc, 
...) <tvm::ffi::Tensor::FromNDAllocStrided>`,
+to allocate a tensor with user-provided allocation callback.
+
+Below is an example that uses ``cudaMalloc``/``cudaFree`` as custom allocators 
for GPU tensors.
+
+.. code-block:: cpp
+
+  struct CUDANDAlloc {
+    void AllocData(DLTensor* tensor) {
+      size_t data_size = ffi::GetDataSize(*tensor);
+      void* ptr = nullptr;
+      cudaError_t err = cudaMalloc(&ptr, data_size);
+      TVM_FFI_ICHECK_EQ(err, cudaSuccess) << "cudaMalloc failed: " << 
cudaGetErrorString(err);
+      tensor->data = ptr;
+    }
+
+    void FreeData(DLTensor* tensor) {
+      if (tensor->data != nullptr) {
+        cudaError_t err = cudaFree(tensor->data);
+        TVM_FFI_ICHECK_EQ(err, cudaSuccess) << "cudaFree failed: " << 
cudaGetErrorString(err);
+        tensor->data = nullptr;
+      }
+    }
+  };
+
+  ffi::Tensor cuda_tensor = ffi::Tensor::FromNDAlloc(
+    CUDANDAlloc(),
+    /*shape=*/{3, 4, 5},
+    /*dtype=*/DLDataType({kDLFloat, 32, 1}),
+    /*device=*/DLDevice({kDLCUDA, 0})
+  );
+
+C++ Stream Handling
+~~~~~~~~~~~~~~~~~~~
+
+Besides tensors, stream context is another key concept in a kernel library, 
especially for kernel execution. While CUDA does not have a global context for 
default streams, frameworks like PyTorch maintain a "current stream" per device 
(:py:func:`torch.cuda.current_stream`), and kernel libraries must read the 
current stream from the embedding environment.
+
+As a hardware-agnostic abstraction layer, TVM-FFI is not linked to any 
specific stream management library, but to ensure GPU kernels launch on the 
correct stream, it provides standardized APIs to obtain stream context from the 
upper framework (e.g. PyTorch).
+
+**Obtain Stream Context.** Use C API :cpp:func:`TVMFFIEnvGetStream` to obtain 
the current stream for a given device.
+
+.. code-block:: c++

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   For consistency with other C++ code blocks in this file, please use `cpp` 
instead of `c++` as the language identifier.
   
   ```suggestion
   .. code-block:: cpp
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] doc: Tensor Usage [tvm-ffi]

Reply via email to