Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

Rok Mihevc Tue, 12 Sep 2023 17:38:27 -0700

After some discussion on the PR [https://github.com/apache/arrow/pull/37166]
we've altered the proposed type by removing the ndim parameter and
adding ragged_dimensions one.
If there is no further feedback I'd like to call for a vote early next
week. Proposed language now reads:


Variable shape tensor
=====================

* Extension name: `arrow.variable_shape_tensor`.

* The storage type of the extension is: ``StructArray`` where struct
  is composed of **data** and **shape** fields describing a single
  tensor per row:

  * **data** is a ``List`` holding tensor elements of a single tensor.
    Data type of the list elements is uniform across the entire column
    and also provided in metadata.
  * **shape** is a ``FixedSizeList<uint32>[ndim]`` of the tensor shape where
    the size of the list ``ndim`` is equal to the number of dimensions of
the
    tensor.

* Extension type parameters:

  * **value_type** = the Arrow data type of individual tensor elements.

  Optional parameters describing the logical layout:

  * **dim_names** = explicit names to tensor dimensions
    as an array. The length of it should be equal to the shape
    length and equal to the number of dimensions.

    ``dim_names`` can be used if the dimensions have well-known
    names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the
    original dimensions, defined as an array.

    The indices contain a permutation of the values [0, 1, .., N-1] where
    N is the number of dimensions. The permutation indicates which
    dimension of the logical layout corresponds to which dimension of the
    physical tensor (the i-th dimension of the logical view corresponds
    to the dimension with number ``permutations[i]`` of the physical
tensor).

    Permutation can be useful in case the logical order of
    the tensor is a permutation of the physical order (row-major).

    When logical and physical layout are equal, the permutation will always
    be ([0, 1, .., N-1]) and can therefore be left out.

  * **ragged_dimensions** = indices of ragged dimensions whose sizes may
    differ. Dimensions where all elements have the same size are called
    uniform dimensions. Indices are a subset of all possible dimension
    indices ([0, 1, .., N-1]).
    Ragged dimensions list can be left out. In that case all dimensions
    are assumed ragged.

* Description of the serialization:

  The metadata must be a valid JSON object including number of
  dimensions of the contained tensors as an integer with key **"ndim"**
  plus optional dimension names with keys **"dim_names"** and ordering of
  the dimensions with key **"permutation"**.

  - Example with ``dim_names`` metadata for NCHW ordered data:

    ``{ "dim_names": ["C", "H", "W"] }``

  - Example with ``ragged_dimensions`` metadata for a set of color images
    with variable width:

    ``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }``

  - Example of permuted 3-dimensional tensor:

    ``{ "permutation": [2, 0, 1] }``

    This is the physical layout shape and the shape of the logical
    layout would given an individual tensor of shape [100, 200, 500]
    be ``[500, 100, 200]``.

.. note::

  Elements in a variable shape tensor extension array are stored
  in row-major/C-contiguous order.


Rok

Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

Reply via email to