Would it be an option to use a StructArray for that? One array with the
values, and one with the repetitions:

Int32([1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 1, 2]) ->

StructArray([
    "values": Int32([1, 2, 3, 1, 2]),
    "repetitions": UInt32([1, 3, 5, 1, 1]),
])

It does not have the same API, but I think that the physical operations
would be different, anyways: ("array + 2" would only operate on "values").
I think that a small struct / object with some operator overloading would
address this, and writing something on the metadata would allow others to
consume it, a-la extension type?

On a related note, such encoding would address DataFusion's issue of
representing scalars / constant arrays: a constant array would be
represented as a repetition. Currently we just unpack (i.e. allocate) a
constant array when we want to transfer through a RecordBatch.

Best,
Jorge




On Thu, Mar 25, 2021, 10:03 Kirill Lykov <lykov.kir...@gmail.com> wrote:

> Thanks for the answer.
> I asked about it because we need it and I was about writing a summer intern
> proposal for a student to work on it.
> Looks like it could work fine.
>
> On Wed, Mar 24, 2021 at 3:49 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > The SparseTensor stuff is something else entirely (that's matrices
> > where the entries are mostly 0)
> >
> > There isn't anything to help you right now aside from dictionary
> > encoding — if your dictionary has 256 elements or less, you can use
> > uint8 index type and thus have 1 byte per value. We've discussed
> > implementing RLE in the project and so if we do that in the future
> > then a random access data structure could be built on top of RLE (in
> > principle)
> >
> > On Wed, Mar 24, 2021 at 8:53 AM Niranda Perera <niranda.per...@gmail.com
> >
> > wrote:
> > >
> > > Hi Lykov,
> > >
> > > I believe there's an arrow sparse tensor abstraction.
> > >
> > > On Wed, Mar 24, 2021, 05:05 Kirill Lykov <lykov.kir...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I wonder if there is an existing way to store floats/ints with many
> > > > repetitions in some container (not sure about terminology).
> > > > For example, I might have data like A=[1, 2, 2, 2, 3, 3, 3, 3, 3, 3,
> > 1, 2]
> > > > and i would like to store only B=[1, 2, 3, 1, 2] but from user
> > > > perspective it behaves like container A. I know I can use dictionary
> > but as
> > > > far I understand it will store internally indices of the chosen
> > elements so
> > > > it makes sense more for binary data or structures.
> > > >
> > > > --
> > > > Best regards,
> > > > Kirill Lykov
> > > >
> >
>
>
> --
> Best regards,
> Kirill Lykov
>

Reply via email to