Re: [VOTE] Add RLE Arrays to Arrow Format

Weston Pace Fri, 16 Dec 2022 17:24:08 -0800

+1

I agree that run-end encoding makes more sense but also don't see it
as a deal breaker.


The most compelling counter-argument I've seen for new types is to
avoid a schism where some implementations do not support the newer
types.  However, for the type proposed here I think the risk is low
because data can be losslessly converted to existing formats for
compatibility with any system that doesn't support the type.

Another argument I've seen is that we should introduce a more formal
distinction between "layouts" and "types" (with dictionary and
run-end-encoding being layouts).  However, this seems like an
impractical change at this point.  In addition, given that we have
dictionary as an array type the cat is already out of the bag.
Furthermore, systems and implementations are still welcome to make
this distinction themselves.  The spec only needs to specify what the
buffer layouts should be.  If a particular library chooses to group
those layouts into two different categories I think that would still
be feasible.

-Weston

On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb <[email protected]> wrote:
>
> +1 on the proposal as written
>
> I think it makes sense and offers exciting opportunities for faster
> computation (especially for cases where parquet files can be decoded
> directly into such an array and avoid unpacking. RLE encoded dictionary are
> quite compelling)
>
> I would prefer to use the term Run-End-Encoding (which would also follow
> the naming of the internal fields) but I don't view that as a deal blocker.
>
> Thank you for all your work in this matter,
> Andrew
>
> On Wed, Dec 14, 2022 at 5:08 PM Matt Topol <[email protected]> wrote:
>
> > I'm not at all opposed to renaming it as `Run-End-Encoding` if that would
> > be preferable. Hopefully others will chime in with their feedback.
> >
> > --Matt
> >
> > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook <[email protected]> wrote:
> >
> > > Thank you Matt, Tobias, and others for the great work on this.
> > >
> > > I am -0.5 on this proposal in its current form because (pardon the
> > > pedantry) what we have implemented here is not run-length encoding; it
> > > is run-end encoding. Based on community input, the choice was made to
> > > store run ends instead of run lengths because this enables O(log(N))
> > > random access as opposed to O(N). This is a sensible choice, but it
> > > comes with some trade-offs including limitations in array length
> > > (which maybe not really a problem in practice) and lack of bit-for-bit
> > > equivalence with RLE encodings that use run lengths like Velox's
> > > SequenceVector encoding (which I think is a more serious problem in
> > > practice).
> > >
> > > I believe that we should either:
> > > (a) rename this to "run-end encoding"
> > > (b) change this to a parameterized type called "run encoding" that
> > > takes a Boolean parameter specifying whether run lengths or run ends
> > > are stored.
> > >
> > > Ian
> > >
> > > On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <[email protected]>
> > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'd like to propose adding the RLE type based on earlier
> > > discussions[1][2]
> > > > to the Arrow format:
> > > > - Columnar Format description:
> > > >
> > >
> > https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> > > > - Flatbuffers changes:
> > > >
> > >
> > https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
> > > >
> > > > There is a proposed implementation available in both C++ (written by
> > > Tobias
> > > > Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> > > > implemented and were tested to be compatible over IPC with an archery
> > > test.
> > > > In both cases, the implementations are split out among several Draft
> > PRs
> > > so
> > > > that they can be easily reviewed piecemeal if the vote is approved,
> > with
> > > > each Draft PR including the changes of the one before it. The links
> > > > provided are the Draft PRs with the entirety of the changes included.
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 add the proposed RLE type to the Apache Arrow format
> > > > [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> > > > because...
> > > >
> > > > Thanks much, and please let me know if any more information or links
> > are
> > > > needed (I've never proposed a vote before on here!)
> > > >
> > > > --Matt
> > > >
> > > > [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> > > > [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> > > > [3] https://github.com/apache/arrow/pull/14179
> > > > [4] https://github.com/apache/arrow/pull/14223
> > >
> >

Re: [VOTE] Add RLE Arrays to Arrow Format

Reply via email to