Thanks for all the input and feedback so far!
I went ahead and added unifom_shape parameter with the second notation
option to the proposal [1].

More discussion could be valuable here so I'd like to call the vote on
Wednesday.

[1]
https://github.com/apache/arrow/pull/37166/files/9c827a0ba54280f4695202e17e32902986c4f12f#diff-b54425cb176b53e51925c13a4d4e85cf7d03d4e1226e6d5bf4d7ae09923db8b3

Best,
Rok

On Sat, Sep 16, 2023 at 3:11 PM Rok Mihevc <rok.mih...@gmail.com> wrote:

> I agree, the increased complexity is probably not worth the savings
> from keeping only shapes of ragged dimensions.
> However I still think it could be valuable to optionally store uniform
> dimension shape as metadata. So:
>
> **uniform_shape** = Sizes of all contained tensors in their uniform
> dimensions.
>
> Given we have a series of 3 dimensional image tensors  [H, W, C] with
> variable width (uniform_dimenions = [0, 2]):
>
> Notation option 1:
> **uniform_shape** = [400,3]
>
> Notation option 2:
> **uniform_shape** = [400,0,3]
>
> Best,
> Rok
>
> On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs <jer...@rerun.io> wrote:
> >
> > On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc <rok.mih...@gmail.com> wrote:
> >
> > >
> > > How about also changing shape and adding uniform_shape like so:
> > > """
> > > **shape** is a ``FixedSizeList<uint32>[ndim_ragged]`` of ragged shape
> > > of each tensor contained in ``data`` where the size of the list
> > > ``ndim_ragged`` is equal to the number of dimensions of tensor
> > > subtracted by the number of ragged dimensions.
> > > [..]
> > > **uniform_shape**
> > > Sizes of all contained tensors in their uniform dimensions.
> > > """
> > >
> > > This would make shape array smaller (in width) if more uniform
> > > dimensions were provided. However it would increase the complexity of
> > > the extension type a little bit.
> > >
> > >
> > This trade-off doesn't seem  worthwhile to me.
> >  - Shape array will almost always be dramatically smaller than the tensor
> > data itself, so the space savings are unlikely to be meaningful in
> practice.
> >  - On the other hand, coding up the index offset math for a sparsely
> > represented shape with implicitly interleaved uniform dimensions is much
> > more error prone (and less efficient).
> >  - Even just consider answering a simple question like "What is the size
> of
> > dimension N":
> >
> > If `shape` always contains all the dimensions, this is trivially
> `shape[N]`
> > (or `shape[permuations[N]]` if permutations was specified.)
> >
> > On the other hand, if `shape` only contains the ragged/variable
> dimensions
> > this lookup instead becomes something like:
> > ```
> > offset = count(uniform_dimensions < N)
> > shape[N - offset]
> > ```
> >
> > Maybe this doesn't seem too bad at first, but does everyone implement
> this
> > as count()? Does someone implement it as
> > `find_lower_bound(uniform_dimension, N)`? Did they validate that
> > `uniform_dimensions` was specified as a sorted list?
> >
> > Now for added risk of errors, consider how this interacts with the
> > `permuation`... in my opinion there is way too much thinking required to
> > figure out if the correct value is: `shape[permutations[N] - offset]` or
> > `shape[permutations[N - offset]]`.
> >
> > Arrow design guidance typically skews heavily in favor of efficient
> > deterministic access over maximally space-efficient representations.
> >
> > Best,
> > Jeremy
>

Reply via email to