Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

Jeremy Leibs Fri, 15 Sep 2023 19:34:10 -0700

On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc <rok.mih...@gmail.com> wrote:


>
> How about also changing shape and adding uniform_shape like so:
> """
> **shape** is a ``FixedSizeList<uint32>[ndim_ragged]`` of ragged shape
> of each tensor contained in ``data`` where the size of the list
> ``ndim_ragged`` is equal to the number of dimensions of tensor
> subtracted by the number of ragged dimensions.
> [..]
> **uniform_shape**
> Sizes of all contained tensors in their uniform dimensions.
> """
>
> This would make shape array smaller (in width) if more uniform
> dimensions were provided. However it would increase the complexity of
> the extension type a little bit.
>
>
This trade-off doesn't seem  worthwhile to me.
 - Shape array will almost always be dramatically smaller than the tensor
data itself, so the space savings are unlikely to be meaningful in practice.
 - On the other hand, coding up the index offset math for a sparsely
represented shape with implicitly interleaved uniform dimensions is much
more error prone (and less efficient).
 - Even just consider answering a simple question like "What is the size of
dimension N":

If `shape` always contains all the dimensions, this is trivially `shape[N]`
(or `shape[permuations[N]]` if permutations was specified.)

On the other hand, if `shape` only contains the ragged/variable dimensions
this lookup instead becomes something like:
```
offset = count(uniform_dimensions < N)
shape[N - offset]
```

Maybe this doesn't seem too bad at first, but does everyone implement this
as count()? Does someone implement it as
`find_lower_bound(uniform_dimension, N)`? Did they validate that
`uniform_dimensions` was specified as a sorted list?

Now for added risk of errors, consider how this interacts with the
`permuation`... in my opinion there is way too much thinking required to
figure out if the correct value is: `shape[permutations[N] - offset]` or
`shape[permutations[N - offset]]`.

Arrow design guidance typically skews heavily in favor of efficient
deterministic access over maximally space-efficient representations.

Best,
Jeremy

Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

Reply via email to