Thank you for proposing this! I left a comment on the PR as well, but I'm excited for this to standardize a few concepts that I have run into whilst working on ADBC and GeoArrow:
- Properly returning an array with >1 dimension from the PostgreSQL ADBC driver - As the basis for encoding raster tiles as rows in a table (e.g., http://www.geopackage.org/spec/#_tile_matrix_introduction ) Excited to see the PR progress! -dewey On Thu, Aug 17, 2023 at 9:54 AM Rok Mihevc <rok.mih...@gmail.com> wrote: > > Hey all! > > > Besides the recently added FixedShapeTensor [1] canonical extension type > there appears to be a need for an already proposed VariableShapeTensor > [2]. VariableShapeTensor > would store tensors of variable shapes but uniform number of > dimensions, dimension names and dimension permutations. > > There are examples of such types: Ray implements > ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4]. > > I propose we discuss adding the below text to > format/CanonicalExtensions.rst to read as [5] and a C++/Python > implementation as proposed in [6]. A vote can be called after a discussion > here. > > Variable shape tensor > > ===================== > > * Extension name: `arrow.variable_shape_tensor`. > > * The storage type of the extension is: ``StructArray`` where struct > > is composed of **data** and **shape** fields describing a single > > tensor per row: > > * **data** is a ``List`` holding tensor elements of a single tensor. > > Data type of the list elements is uniform across the entire column > > and also provided in metadata. > > * **shape** is a ``FixedSizeList`` of the tensor shape where > > the size of the list is equal to the number of dimensions of the > > tensor. > > * Extension type parameters: > > * **value_type** = the Arrow data type of individual tensor elements. > > * **ndim** = the number of dimensions of the tensor. > > Optional parameters describing the logical layout: > > * **dim_names** = explicit names to tensor dimensions > > as an array. The length of it should be equal to the shape > > length and equal to the number of dimensions. > > ``dim_names`` can be used if the dimensions have well-known > > names and they map to the physical layout (row-major). > > * **permutation** = indices of the desired ordering of the > > original dimensions, defined as an array. > > The indices contain a permutation of the values [0, 1, .., N-1] where > > N is the number of dimensions. The permutation indicates which > > dimension of the logical layout corresponds to which dimension of the > > physical tensor (the i-th dimension of the logical view corresponds > > to the dimension with number ``permutations[i]`` of the physical > tensor). > > Permutation can be useful in case the logical order of > > the tensor is a permutation of the physical order (row-major). > > When logical and physical layout are equal, the permutation will always > > be ([0, 1, .., N-1]) and can therefore be left out. > > * Description of the serialization: > > The metadata must be a valid JSON object including number of > > dimensions of the contained tensors as an integer with key **"ndim"** > > plus optional dimension names with keys **"dim_names"** and ordering of > > the dimensions with key **"permutation"**. > > - Example: ``{ "ndim": 2}`` > > - Example with ``dim_names`` metadata for NCHW ordered data: > > ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}`` > > - Example of permuted 3-dimensional tensor: > > ``{ "ndim": 3, "permutation": [2, 0, 1]}`` > > This is the physical layout shape and the shape of the logical > > layout would given an individual tensor of shape [100, 200, 500] > > be ``[500, 100, 200]``. > > .. note:: > > Elements in a variable shape tensor extension array are stored > > in row-major/C-contiguous order. > > > [1] https://github.com/apache/arrow/issues/33924 > > [2] https://github.com/apache/arrow/issues/24868 > > [3] > https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809 > > [4] https://pytorch.org/docs/stable/nested.html > > [5] > https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor > > [6] https://github.com/apache/arrow/pull/37166 > > > > Best, > > Rok