Thanks for the clarification! The (probably very common) slicing case makes
a lot of sense.

On Wed, Sep 14, 2022 at 3:19 PM Weston Pace <weston.p...@gmail.com> wrote:

> I will clarify the offset problem.  It essentially boils down to "if
> you don't have constant access to elements then an array length offset
> does not give you constant access to buffer offsets".
>
> We start with an RLE<int64> array of length 200.  We slice it with
> (start=10, length=100) to get an RLE<int64> array of length 100 and an
> offset of 10.
>
> Now we want to write an IPC file (or access the values for whatever
> reason).  The values buffer has 400 bytes and the run ends buffer has
> 200 bytes (these numbers could be anything less than 1600/800 so I'm
> picking these at random).  We need to copy a portion of the "run ends"
> buffer into the file.  What bytes are these?  The only way to tell
> would be to do a binary search on the 200 bytes run ends buffer.
>
> On the other hand, if there were two child arrays then an
> implementation, when slicing, could choose to always keep the offset
> of the parent array at 0 and instead put the offsets in the child
> arrays.  Now you have a parent array with offset 0, a run ends (int32)
> array with offset 74 and length 5 and a values (int64) array with
> offset 74 and length 5.  We can clearly say that we want to grab bytes
> 296-316 from the run ends buffer and bytes 592-632 from the values
> buffer.  Of course, other implementations would always be free to use
> offsets in the parent array.  So I think the log(N) approach would
> still be needed as a fallback.
>
> Other options:
>
>  * Just eat the log(n) cost, it's not that expensive and any
> application doing excessive slicing could cache the offsets
> themselves.
>  * Add an optional buffer offset to the spec that can be populated in
> cases where random array access is not possible.
>
> On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> <de...@voltrondata.com.invalid> wrote:
> >
> > >  * Should we encode "run lengths" or "run ends"?
> >
> > In addition to the points mentioned above, this seems the most consistent
> > with the variable-length binary/list layouts
> >
> > > encoding the run ends as a buffer (similar to list array for example)
> > makes it difficult to calculate offsets
> >
> > I don't have a strong opinion about this, but I also don't understand the
> > logic. Surely the implementation is just generating/reading a buffer of
> > integers and there's some overhead/indirection to wrapping it in an Array
> > (that must then be validated).
> >
> > As a matter of curiosity, was a dictionary approach ever considered? If
> one
> > new layout was added (one buffer containing the run ends of a RLE 0:N
> int32
> > array), the dictionary member could be the values array and perhaps make
> it
> > easier for implementations that already handle dictionaries.
> >
> >
> >
> >
> >
> > On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol
> <m...@voltrondata.com.invalid>
> > wrote:
> >
> > > Just wanted to chime in here that I also have several draft PRs for
> > > implementing the RLE arrays in Go as the second implementation (since
> > > we use two implementations as a requirement to vote on
> > > changes/additions to the format).
> > >
> > > They can be found here:
> > >
> > > <https://github.com/apache/arrow/pull/14111>
> > > <https://github.com/apache/arrow/pull/14114>
> > > <https://github.com/apache/arrow/pull/14126>
> > >
> > > --Matt
> > >
> > > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > > <emkornfi...@gmail.com> wrote:
> > > >>
> > > >>   * Should we encode "run lengths" or "run ends"?
> > > >
> > > >
> > > > I think the project has leaned towards sublinear access, so run ends
> > > > make
> > > > sense.  The downside is that we run into similar issues with
> > > > List/LargeList
> > > > where the total number of elements is limited by bit-width (which can
> > > > also
> > > > cause space wastage, e.g. with run ends it might be reasonable to
> > > > limit
> > > > bit-width to 16).
> > > >
> > > > The values are definitely a child array.  However, encoding the run
> > > >>  ends as a buffer (similar to list array for example) makes it
> > > >>  difficult to calculate offsets.  Translating an array offset to a
> > > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > > >> a
> > > >>  child array (so the RLE array has no buffers and two child arrays)
> > > >>  then this problem goes away.
> > > >
> > > >
> > > > I'm not sure I understand this, could you provide an example of the
> > > > problem
> > > > that the child array solves?
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.p...@gmail.com
> > > > <mailto:weston.p...@gmail.com>> wrote:
> > > >
> > > >>  I'm going to bump this because it would be good to get feedback.
> In
> > > >>  particular it would be nice to get feedback on the suggested format
> > > >>  change[1].  We are currently moving forward on coming up with an
> IPC
> > > >>  format proposal which we will share when ready.
> > > >>
> > > >>  The two interesting points that jump out to me are:
> > > >>
> > > >>   * Should we encode "run lengths" or "run ends"?
> > > >>
> > > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"
> > > >> 3,
> > > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is
> preferred
> > > >>  as that leads to O(log(N)) random access (instead of O(N)) and it's
> > > >>  not clear there are any disadvantages.
> > > >>
> > > >>   * Should the run ends be encoded as a buffer or a child array?
> > > >>
> > > >>  The values are definitely a child array.  However, encoding the run
> > > >>  ends as a buffer (similar to list array for example) makes it
> > > >>  difficult to calculate offsets.  Translating an array offset to a
> > > >>  buffer offset takes O(log(N)) time.  If the run ends are encoded as
> > > >> a
> > > >>  child array (so the RLE array has no buffers and two child arrays)
> > > >>  then this problem goes away.
> > > >>
> > > >>  [1] <https://github.com/apache/arrow/pull/13333/files>
> > > >>
> > > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > > >>  <tob...@zagorni.eu.invalid <mailto:tob...@zagorni.eu.invalid>>
> > > >> wrote:
> > > >>  >
> > > >>  > Hello Everyone,
> > > >>  >
> > > >>  > Recently, I have implemented support for run-length encoding in
> > > >> Arrow
> > > >>  > C++. So far my implementation is split into different subtasks of
> > > >>  > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771
> >).
> > > >>  >
> > > >>  > I have (draft) PRs available for:
> > > >>  > - general handling of RLE in arrow C++, Type, Arrow, Builder
> > > >>  > subclasses, etc.
> > > >>  >   (subtasks 1-9)
> > > >>  > - encode, decode kernels (fixed size only):
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
> > > >>  > - filter kernel (fixed size only):
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
> > > >>  > - simple benchmark for the RLE kernels
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
> > > >>  > - adding RLE to Arrow Columnar format document
> > > >>  >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
> > > >>  >
> > > >>  > What is not yet implemented:
> > > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > > >>  > - caching of physical offsets when working with sliced arrays -
> > > >> finding
> > > >>  > these offsets is an  O(log(n)) binary search which could be
> > > >> avoided in
> > > >>  > a lot of cases
> > > >>  >
> > > >>  > I'm interested in any feedback on the code and I'm wondering what
> > > >> would
> > > >>  > be the best way to get this merged.
> > > >>  >
> > > >>  > A lot of the PRs depend on earlier ones. I ordered the subtasks
> > > >> in a
> > > >>  > way they could be merged. The first would be:
> > > >>  > 1. Handling of array-only types using VisitTypeInline:
> > > >>  >    <https://issues.apache.org/jira/browse/ARROW-17258>
> > > >>  > 2. Adding RLE type / array class (only builds on #1):
> > > >>  >    <https://issues.apache.org/jira/browse/ARROW-17261>
> > > >>  > -  also, since it has no dependencies: adding RLE to Arrow
> > > >> Columnar
> > > >>  > format document
> > > >>  >    <https://issues.apache.org/jira/browse/ARROW-16773>
> > > >>  >
> > > >>  > Best,
> > > >>  > Tobias
> > > >>
> > >
> > >
>

Reply via email to