I agree, the increased complexity is probably not worth the savings
from keeping only shapes of ragged dimensions.
However I still think it could be valuable to optionally store uniform
dimension shape as metadata. So:

**uniform_shape** = Sizes of all contained tensors in their uniform dimensions.

Given we have a series of 3 dimensional image tensors  [H, W, C] with
variable width (uniform_dimenions = [0, 2]):

Notation option 1:
**uniform_shape** = [400,3]

Notation option 2:
**uniform_shape** = [400,0,3]

Best,
Rok

On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs <jer...@rerun.io> wrote:
>
> On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc <rok.mih...@gmail.com> wrote:
>
> >
> > How about also changing shape and adding uniform_shape like so:
> > """
> > **shape** is a ``FixedSizeList<uint32>[ndim_ragged]`` of ragged shape
> > of each tensor contained in ``data`` where the size of the list
> > ``ndim_ragged`` is equal to the number of dimensions of tensor
> > subtracted by the number of ragged dimensions.
> > [..]
> > **uniform_shape**
> > Sizes of all contained tensors in their uniform dimensions.
> > """
> >
> > This would make shape array smaller (in width) if more uniform
> > dimensions were provided. However it would increase the complexity of
> > the extension type a little bit.
> >
> >
> This trade-off doesn't seem  worthwhile to me.
>  - Shape array will almost always be dramatically smaller than the tensor
> data itself, so the space savings are unlikely to be meaningful in practice.
>  - On the other hand, coding up the index offset math for a sparsely
> represented shape with implicitly interleaved uniform dimensions is much
> more error prone (and less efficient).
>  - Even just consider answering a simple question like "What is the size of
> dimension N":
>
> If `shape` always contains all the dimensions, this is trivially `shape[N]`
> (or `shape[permuations[N]]` if permutations was specified.)
>
> On the other hand, if `shape` only contains the ragged/variable dimensions
> this lookup instead becomes something like:
> ```
> offset = count(uniform_dimensions < N)
> shape[N - offset]
> ```
>
> Maybe this doesn't seem too bad at first, but does everyone implement this
> as count()? Does someone implement it as
> `find_lower_bound(uniform_dimension, N)`? Did they validate that
> `uniform_dimensions` was specified as a sorted list?
>
> Now for added risk of errors, consider how this interacts with the
> `permuation`... in my opinion there is way too much thinking required to
> figure out if the correct value is: `shape[permutations[N] - offset]` or
> `shape[permutations[N - offset]]`.
>
> Arrow design guidance typically skews heavily in favor of efficient
> deterministic access over maximally space-efficient representations.
>
> Best,
> Jeremy

Reply via email to