I agree, the increased complexity is probably not worth the savings from keeping only shapes of ragged dimensions. However I still think it could be valuable to optionally store uniform dimension shape as metadata. So:
**uniform_shape** = Sizes of all contained tensors in their uniform dimensions. Given we have a series of 3 dimensional image tensors [H, W, C] with variable width (uniform_dimenions = [0, 2]): Notation option 1: **uniform_shape** = [400,3] Notation option 2: **uniform_shape** = [400,0,3] Best, Rok On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs <jer...@rerun.io> wrote: > > On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc <rok.mih...@gmail.com> wrote: > > > > > How about also changing shape and adding uniform_shape like so: > > """ > > **shape** is a ``FixedSizeList<uint32>[ndim_ragged]`` of ragged shape > > of each tensor contained in ``data`` where the size of the list > > ``ndim_ragged`` is equal to the number of dimensions of tensor > > subtracted by the number of ragged dimensions. > > [..] > > **uniform_shape** > > Sizes of all contained tensors in their uniform dimensions. > > """ > > > > This would make shape array smaller (in width) if more uniform > > dimensions were provided. However it would increase the complexity of > > the extension type a little bit. > > > > > This trade-off doesn't seem worthwhile to me. > - Shape array will almost always be dramatically smaller than the tensor > data itself, so the space savings are unlikely to be meaningful in practice. > - On the other hand, coding up the index offset math for a sparsely > represented shape with implicitly interleaved uniform dimensions is much > more error prone (and less efficient). > - Even just consider answering a simple question like "What is the size of > dimension N": > > If `shape` always contains all the dimensions, this is trivially `shape[N]` > (or `shape[permuations[N]]` if permutations was specified.) > > On the other hand, if `shape` only contains the ragged/variable dimensions > this lookup instead becomes something like: > ``` > offset = count(uniform_dimensions < N) > shape[N - offset] > ``` > > Maybe this doesn't seem too bad at first, but does everyone implement this > as count()? Does someone implement it as > `find_lower_bound(uniform_dimension, N)`? Did they validate that > `uniform_dimensions` was specified as a sorted list? > > Now for added risk of errors, consider how this interacts with the > `permuation`... in my opinion there is way too much thinking required to > figure out if the correct value is: `shape[permutations[N] - offset]` or > `shape[permutations[N - offset]]`. > > Arrow design guidance typically skews heavily in favor of efficient > deterministic access over maximally space-efficient representations. > > Best, > Jeremy