Thank you for proposing this! I left a comment on the PR as well, but
I'm excited for this to standardize a few concepts that I have run
into whilst working on ADBC and GeoArrow:

- Properly returning an array with >1 dimension from the PostgreSQL ADBC driver
- As the basis for encoding raster tiles as rows in a table (e.g.,
http://www.geopackage.org/spec/#_tile_matrix_introduction )

Excited to see the PR progress!

-dewey

On Thu, Aug 17, 2023 at 9:54 AM Rok Mihevc <rok.mih...@gmail.com> wrote:
>
> Hey all!
>
>
> Besides the recently added FixedShapeTensor [1] canonical extension type
> there appears to be a need for an already proposed VariableShapeTensor
> [2]. VariableShapeTensor
> would store tensors of variable shapes but uniform number of
> dimensions, dimension names and dimension permutations.
>
> There are examples of such types: Ray implements
> ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4].
>
> I propose we discuss adding the below text to
> format/CanonicalExtensions.rst to read as [5] and a C++/Python
> implementation as proposed in [6]. A vote can be called after a discussion
> here.
>
> Variable shape tensor
>
> =====================
>
> * Extension name: `arrow.variable_shape_tensor`.
>
> * The storage type of the extension is: ``StructArray`` where struct
>
>   is composed of **data** and **shape** fields describing a single
>
>   tensor per row:
>
>   * **data** is a ``List`` holding tensor elements of a single tensor.
>
>     Data type of the list elements is uniform across the entire column
>
>     and also provided in metadata.
>
>   * **shape** is a ``FixedSizeList`` of the tensor shape where
>
>     the size of the list is equal to the number of dimensions of the
>
>     tensor.
>
> * Extension type parameters:
>
>   * **value_type** = the Arrow data type of individual tensor elements.
>
>   * **ndim** = the number of dimensions of the tensor.
>
>   Optional parameters describing the logical layout:
>
>   * **dim_names** = explicit names to tensor dimensions
>
>     as an array. The length of it should be equal to the shape
>
>     length and equal to the number of dimensions.
>
>     ``dim_names`` can be used if the dimensions have well-known
>
>     names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
>
>     original dimensions, defined as an array.
>
>     The indices contain a permutation of the values [0, 1, .., N-1] where
>
>     N is the number of dimensions. The permutation indicates which
>
>     dimension of the logical layout corresponds to which dimension of the
>
>     physical tensor (the i-th dimension of the logical view corresponds
>
>     to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
>     Permutation can be useful in case the logical order of
>
>     the tensor is a permutation of the physical order (row-major).
>
>     When logical and physical layout are equal, the permutation will always
>
>     be ([0, 1, .., N-1]) and can therefore be left out.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including number of
>
>   dimensions of the contained tensors as an integer with key **"ndim"**
>
>   plus optional dimension names with keys **"dim_names"** and ordering of
>
>   the dimensions with key **"permutation"**.
>
>   - Example: ``{ "ndim": 2}``
>
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
>     ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}``
>
>   - Example of permuted 3-dimensional tensor:
>
>     ``{ "ndim": 3, "permutation": [2, 0, 1]}``
>
>     This is the physical layout shape and the shape of the logical
>
>     layout would given an individual tensor of shape [100, 200, 500]
>
>     be ``[500, 100, 200]``.
>
> .. note::
>
>   Elements in a variable shape tensor extension array are stored
>
>   in row-major/C-contiguous order.
>
>
> [1] https://github.com/apache/arrow/issues/33924
>
> [2] https://github.com/apache/arrow/issues/24868
>
> [3]
> https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809
>
> [4] https://pytorch.org/docs/stable/nested.html
>
> [5]
> https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor
>
> [6] https://github.com/apache/arrow/pull/37166
>
>
>
> Best,
>
> Rok

Reply via email to