Re: [VOTE] Add RLE Arrays to Arrow Format

Andrew Lamb Fri, 16 Dec 2022 13:42:11 -0800

+1 on the proposal as written

I think it makes sense and offers exciting opportunities for faster
computation (especially for cases where parquet files can be decoded
directly into such an array and avoid unpacking. RLE encoded dictionary are
quite compelling)


I would prefer to use the term Run-End-Encoding (which would also follow
the naming of the internal fields) but I don't view that as a deal blocker.

Thank you for all your work in this matter,
Andrew

On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <[email protected]> wrote:

> I'm not at all opposed to renaming it as `Run-End-Encoding` if that would
> be preferable. Hopefully others will chime in with their feedback.
>
> --Matt
>
> On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <[email protected]> wrote:
>
> > Thank you Matt, Tobias, and others for the great work on this.
> >
> > I am -0.5 on this proposal in its current form because (pardon the
> > pedantry) what we have implemented here is not run-length encoding; it
> > is run-end encoding. Based on community input, the choice was made to
> > store run ends instead of run lengths because this enables O(log(N))
> > random access as opposed to O(N). This is a sensible choice, but it
> > comes with some trade-offs including limitations in array length
> > (which maybe not really a problem in practice) and lack of bit-for-bit
> > equivalence with RLE encodings that use run lengths like Velox's
> > SequenceVector encoding (which I think is a more serious problem in
> > practice).
> >
> > I believe that we should either:
> > (a) rename this to "run-end encoding"
> > (b) change this to a parameterized type called "run encoding" that
> > takes a Boolean parameter specifying whether run lengths or run ends
> > are stored.
> >
> > Ian
> >
> > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <[email protected]>
> > wrote:
> > >
> > > Hello,
> > >
> > > I'd like to propose adding the RLE type based on earlier
> > discussions[1][2]
> > > to the Arrow format:
> > > - Columnar Format description:
> > >
> >
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > - Flatbuffers changes:
> > >
> >
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > >
> > > There is a proposed implementation available in both C++ (written by
> > Tobias
> > > Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> > > implemented and were tested to be compatible over IPC with an archery
> > test.
> > > In both cases, the implementations are split out among several Draft
> PRs
> > so
> > > that they can be easily reviewed piecemeal if the vote is approved,
> with
> > > each Draft PR including the changes of the one before it. The links
> > > provided are the Draft PRs with the entirety of the changes included.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > > because...
> > >
> > > Thanks much, and please let me know if any more information or links
> are
> > > needed (I've never proposed a vote before on here!)
> > >
> > > --Matt
> > >
> > > [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > [3] https://github.com/apache/arrow/pull/14179
> > > [4] https://github.com/apache/arrow/pull/14223
> >
>

Re: [VOTE] Add RLE Arrays to Arrow Format

Reply via email to