Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

Rok Mihevc Fri, 15 Sep 2023 11:32:36 -0700

First, thanks for all the input!

On Wed, Sep 13, 2023 at 6:27 AM Alenka Frim
<ale...@voltrondata.com.invalid> wrote:
> In the PR you mention that "this [ragged dimensions] would be purely
> metadata that would help converting arrow <-> jagged/ragged". Are there any
> examples available to better understand this metadata and how it would be
> used in the conversion you mention?

After checking tf [1] / pytorch [2] I now believe List[type] would
strictly be enough to express ragged/jagged. However, storing
ragged/uniform dimensions would probably still be valuable for other
purposes.

[1] 
https://github.com/tensorflow/tensorflow/blob/v2.13.0/tensorflow/python/ops/ragged/ragged_tensor.py#L65-L305
[2] 
https://github.com/pytorch/torchrec/blob/cdd9f20cc6cf090c0d5fc91a01d45723905525e1/torchrec/sparse/jagged_tensor.py#L185-L225

On Wed, Sep 13, 2023 at 11:25 AM Antoine Pitrou <anto...@python.org> wrote:
> It's a bit confusing that an empty list means "no ragged dimensions" but
> a missing entry means "all dimensions are ragged". This seems
> error-prone to me.

Good point, I like Jeremy's proposal to address this.

> Also, to be clear, "ragged_dimensions" is only useful for data validation?

As the proposal stands, yes. I would like to amend it, see below.

On Wed, Sep 13, 2023 at 11:41 PM Jeremy Leibs <jer...@rerun.io> wrote:
> I would propose instead:
>
> **uniform_dimenions** = Indices of dimensions whose sizes are guaranteed to
> remain constant.
> Indices are a subset of all possible dimension indices ([0, 1, .., N-1]).
> The uniform dimensions must
> still be represented in the `shape` field, and must always be the same
> value for all tensors in the
> array -- this allows code to interpret the tensor correctly without
> accounting for uniform dimensions
> while still permitting optional optimizations that take advantage of the
> uniformity. Uniform_dimensions
> can be left out, in which case it is assumed that all dimensions might be
> variable.

I prefer this to the current proposal!

> Please consider adding some wording and an example such as:
> [..]

Will do.

How about also changing shape and adding uniform_shape like so:
"""
**shape** is a ``FixedSizeList<uint32>[ndim_ragged]`` of ragged shape
of each tensor contained in ``data`` where the size of the list
``ndim_ragged`` is equal to the number of dimensions of tensor
subtracted by the number of ragged dimensions.
[..]
**uniform_shape**
Sizes of all contained tensors in their uniform dimensions.
"""

This would make shape array smaller (in width) if more uniform
dimensions were provided. However it would increase the complexity of
the extension type a little bit.

Best,
Rok

Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

Reply via email to