Re: PRs for RLE support

Weston Pace Wed, 14 Sep 2022 11:18:49 -0700

I will clarify the offset problem.  It essentially boils down to "if
you don't have constant access to elements then an array length offset
does not give you constant access to buffer offsets".


We start with an RLE<int64> array of length 200.  We slice it with
(start=10, length=100) to get an RLE<int64> array of length 100 and an
offset of 10.

Now we want to write an IPC file (or access the values for whatever
reason).  The values buffer has 400 bytes and the run ends buffer has
200 bytes (these numbers could be anything less than 1600/800 so I'm
picking these at random).  We need to copy a portion of the "run ends"
buffer into the file.  What bytes are these?  The only way to tell
would be to do a binary search on the 200 bytes run ends buffer.

On the other hand, if there were two child arrays then an
implementation, when slicing, could choose to always keep the offset
of the parent array at 0 and instead put the offsets in the child
arrays.  Now you have a parent array with offset 0, a run ends (int32)
array with offset 74 and length 5 and a values (int64) array with
offset 74 and length 5.  We can clearly say that we want to grab bytes
296-316 from the run ends buffer and bytes 592-632 from the values
buffer.  Of course, other implementations would always be free to use
offsets in the parent array.  So I think the log(N) approach would
still be needed as a fallback.

Other options:

 * Just eat the log(n) cost, it's not that expensive and any
application doing excessive slicing could cache the offsets
themselves.
 * Add an optional buffer offset to the spec that can be populated in
cases where random array access is not possible.

On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
<[email protected]> wrote:
>
> >  * Should we encode "run lengths" or "run ends"?
>
> In addition to the points mentioned above, this seems the most consistent
> with the variable-length binary/list layouts
>
> > encoding the run ends as a buffer (similar to list array for example)
> makes it difficult to calculate offsets
>
> I don't have a strong opinion about this, but I also don't understand the
> logic. Surely the implementation is just generating/reading a buffer of
> integers and there's some overhead/indirection to wrapping it in an Array
> (that must then be validated).
>
> As a matter of curiosity, was a dictionary approach ever considered? If one
> new layout was added (one buffer containing the run ends of a RLE 0:N int32
> array), the dictionary member could be the values array and perhaps make it
> easier for implementations that already handle dictionaries.
>
>
>
>
>
> On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol <[email protected]>
> wrote:
>
> > Just wanted to chime in here that I also have several draft PRs for
> > implementing the RLE arrays in Go as the second implementation (since
> > we use two implementations as a requirement to vote on
> > changes/additions to the format).
> >
> > They can be found here:
> >
> > <https://github.com/apache/arrow/pull/14111>
> > <https://github.com/apache/arrow/pull/14114>
> > <https://github.com/apache/arrow/pull/14126>
> >
> > --Matt
> >
> > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > <[email protected]> wrote:
> > >>
> > >>   * Should we encode "run lengths" or "run ends"?
> > >
> > >
> > > I think the project has leaned towards sublinear access, so run ends
> > > make
> > > sense.  The downside is that we run into similar issues with
> > > List/LargeList
> > > where the total number of elements is limited by bit-width (which can
> > > also
> > > cause space wastage, e.g. with run ends it might be reasonable to
> > > limit
> > > bit-width to 16).
> > >
> > > The values are definitely a child array.  However, encoding the run
> > >>  ends as a buffer (similar to list array for example) makes it
> > >>  difficult to calculate offsets.  Translating an array offset to a
> > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > >> a
> > >>  child array (so the RLE array has no buffers and two child arrays)
> > >>  then this problem goes away.
> > >
> > >
> > > I'm not sure I understand this, could you provide an example of the
> > > problem
> > > that the child array solves?
> > >
> > >
> > >
> > >
> > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <[email protected]
> > > <mailto:[email protected]>> wrote:
> > >
> > >>  I'm going to bump this because it would be good to get feedback.  In
> > >>  particular it would be nice to get feedback on the suggested format
> > >>  change[1].  We are currently moving forward on coming up with an IPC
> > >>  format proposal which we will share when ready.
> > >>
> > >>  The two interesting points that jump out to me are:
> > >>
> > >>   * Should we encode "run lengths" or "run ends"?
> > >>
> > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"
> > >> 3,
> > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
> > >>  as that leads to O(log(N)) random access (instead of O(N)) and it's
> > >>  not clear there are any disadvantages.
> > >>
> > >>   * Should the run ends be encoded as a buffer or a child array?
> > >>
> > >>  The values are definitely a child array.  However, encoding the run
> > >>  ends as a buffer (similar to list array for example) makes it
> > >>  difficult to calculate offsets.  Translating an array offset to a
> > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > >> a
> > >>  child array (so the RLE array has no buffers and two child arrays)
> > >>  then this problem goes away.
> > >>
> > >>  [1] <https://github.com/apache/arrow/pull/13333/files>
> > >>
> > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > >>  <[email protected] <mailto:[email protected]>>
> > >> wrote:
> > >>  >
> > >>  > Hello Everyone,
> > >>  >
> > >>  > Recently, I have implemented support for run-length encoding in
> > >> Arrow
> > >>  > C++. So far my implementation is split into different subtasks of
> > >>  > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
> > >>  >
> > >>  > I have (draft) PRs available for:
> > >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
> > >>  > subclasses, etc.
> > >>  >   (subtasks 1-9)
> > >>  > - encode, decode kernels (fixed size only):
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
> > >>  > - filter kernel (fixed size only):
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
> > >>  > - simple benchmark for the RLE kernels
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
> > >>  > - adding RLE to Arrow Columnar format document
> > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
> > >>  >
> > >>  > What is not yet implemented:
> > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > >>  > - caching of physical offsets when working with sliced arrays -
> > >> finding
> > >>  > these offsets is an  O(log(n)) binary search which could be
> > >> avoided in
> > >>  > a lot of cases
> > >>  >
> > >>  > I'm interested in any feedback on the code and I'm wondering what
> > >> would
> > >>  > be the best way to get this merged.
> > >>  >
> > >>  > A lot of the PRs depend on earlier ones. I ordered the subtasks
> > >> in a
> > >>  > way they could be merged. The first would be:
> > >>  > 1. Handling of array-only types using VisitTypeInline:
> > >>  >    <https://issues.apache.org/jira/browse/ARROW-17258>
> > >>  > 2. Adding RLE type / array class (only builds on #1):
> > >>  >    <https://issues.apache.org/jira/browse/ARROW-17261>
> > >>  > -  also, since it has no dependencies: adding RLE to Arrow
> > >> Columnar
> > >>  > format document
> > >>  >    <https://issues.apache.org/jira/browse/ARROW-16773>
> > >>  >
> > >>  > Best,
> > >>  > Tobias
> > >>
> >
> >

Re: PRs for RLE support

Reply via email to