On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc <rok.mih...@gmail.com> wrote:
> > How about also changing shape and adding uniform_shape like so: > """ > **shape** is a ``FixedSizeList<uint32>[ndim_ragged]`` of ragged shape > of each tensor contained in ``data`` where the size of the list > ``ndim_ragged`` is equal to the number of dimensions of tensor > subtracted by the number of ragged dimensions. > [..] > **uniform_shape** > Sizes of all contained tensors in their uniform dimensions. > """ > > This would make shape array smaller (in width) if more uniform > dimensions were provided. However it would increase the complexity of > the extension type a little bit. > > This trade-off doesn't seem worthwhile to me. - Shape array will almost always be dramatically smaller than the tensor data itself, so the space savings are unlikely to be meaningful in practice. - On the other hand, coding up the index offset math for a sparsely represented shape with implicitly interleaved uniform dimensions is much more error prone (and less efficient). - Even just consider answering a simple question like "What is the size of dimension N": If `shape` always contains all the dimensions, this is trivially `shape[N]` (or `shape[permuations[N]]` if permutations was specified.) On the other hand, if `shape` only contains the ragged/variable dimensions this lookup instead becomes something like: ``` offset = count(uniform_dimensions < N) shape[N - offset] ``` Maybe this doesn't seem too bad at first, but does everyone implement this as count()? Does someone implement it as `find_lower_bound(uniform_dimension, N)`? Did they validate that `uniform_dimensions` was specified as a sorted list? Now for added risk of errors, consider how this interacts with the `permuation`... in my opinion there is way too much thinking required to figure out if the correct value is: `shape[permutations[N] - offset]` or `shape[permutations[N - offset]]`. Arrow design guidance typically skews heavily in favor of efficient deterministic access over maximally space-efficient representations. Best, Jeremy