Re: PRs for RLE support

Tobias Zagorni Wed, 14 Sep 2022 13:44:39 -0700

Am Mittwoch, dem 14.09.2022 um 14:33 -0400 schrieb Matthew Topol:
> Doesn't this explanation conflate the Logical Offset (the parent's 
> offset) and the Physical Offset (the offset of the run ends / values 
> children)? Or am I missing something? If you have an RLE<int64> array
> of length 200, and you slice it with {start: 10, length: 100}, I
> don't 
> understand how you can represent that while keeping the parent's
> offset 
> as "0" because the offset of "10" in the slice could represent
> anywhere 
> from 0 - 10 physical values you have to offset by. You'd still keep
> an 
> offset of 10 in the parent, and potentially cache the *physical*
> offset 
> too, right?


I could see this working if we require that the whole Buffer of the run
ends array (and not just the Array) holds valid run end values. We
could look into run_ends_array[-1] to find the logical offset, and
assume 0 if run_ends_array.offset is 0. This actually seems like an
interesting property to me, even if it makes logic more complex because
another offset to consider. 

What I like about the custom Slicing approach is that it works without
a cache and for that reason also across library boundaries without
losing cache information.

With the current format, adding an efficient Slice function with
physical length inputs would still be possible, but rely on the cache
to store the phyiscal start and end of the slice.

> The benefit I see for having the child arrays over the run-ends being
> a 
> buffer in the RLE array directly is that you can easily tell how
> large 
> the buffers *should* be as the length and offset of the run ends 
> (int32) array and the values array will always represent *physical* 
> values while the parent's offset and length can represent the
> *logical* 
> values of the RLE. If the run ends were a buffer in the RLE array 
> directly, it would be much more difficult to maintain the separation
> of 
> the "physical" and "logical" offset/lengths.
> 
> Does that make sense?
> 
> --Matt

This makes perfect sense to me. The primary reason for me was that
Buffers do not have a length field in the Arrow C interface so thier
length is not really known, while in Arrow C++ they have a size. For
all existing data types this can be easiliy resolved from either the
length field or offsets buffer. For RLE the the run ends buffer is
physical size of the array (or larger) which cannot be easily
determined without iterating over the whole buffer.

But we need a valid buffer size, so we can resolve logical to physical
offsets using binary search. Also I'm not sure if it is required to be
known for each buffer in Arrow C++ anyways.

- Tobias

> 
> On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace 
> <[email protected]> wrote:
> > I will clarify the offset problem.  It essentially boils down to
> > "if
> > you don't have constant access to elements then an array length
> > offset
> > does not give you constant access to buffer offsets".
> > 
> > We start with an RLE<int64> array of length 200.  We slice it with
> > (start=10, length=100) to get an RLE<int64> array of length 100 and
> > an
> > offset of 10.
> > 
> > Now we want to write an IPC file (or access the values for whatever
> > reason).  The values buffer has 400 bytes and the run ends buffer
> > has
> > 200 bytes (these numbers could be anything less than 1600/800 so
> > I'm
> > picking these at random).  We need to copy a portion of the "run
> > ends"
> > buffer into the file.  What bytes are these?  The only way to tell
> > would be to do a binary search on the 200 bytes run ends buffer.
> > 
> > On the other hand, if there were two child arrays then an
> > implementation, when slicing, could choose to always keep the
> > offset
> > of the parent array at 0 and instead put the offsets in the child
> > arrays.  Now you have a parent array with offset 0, a run ends
> > (int32)
> > array with offset 74 and length 5 and a values (int64) array with
> > offset 74 and length 5.  We can clearly say that we want to grab
> > bytes
> > 296-316 from the run ends buffer and bytes 592-632 from the values
> > buffer.  Of course, other implementations would always be free to
> > use
> > offsets in the parent array.  So I think the log(N) approach would
> > still be needed as a fallback.
> > 
> > Other options:
> > 
> >  * Just eat the log(n) cost, it's not that expensive and any
> > application doing excessive slicing could cache the offsets
> > themselves.
> >  * Add an optional buffer offset to the spec that can be populated
> > in
> > cases where random array access is not possible.
> > 
> > On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> > <[email protected] 
> > <mailto:[email protected]>> wrote:
> > > 
> > >  >  * Should we encode "run lengths" or "run ends"?
> > > 
> > >  In addition to the points mentioned above, this seems the most 
> > > consistent
> > >  with the variable-length binary/list layouts
> > > 
> > >  > encoding the run ends as a buffer (similar to list array for 
> > > example)
> > >  makes it difficult to calculate offsets
> > > 
> > >  I don't have a strong opinion about this, but I also don't 
> > > understand the
> > >  logic. Surely the implementation is just generating/reading a 
> > > buffer of
> > >  integers and there's some overhead/indirection to wrapping it in
> > > an 
> > > Array
> > >  (that must then be validated).
> > > 
> > >  As a matter of curiosity, was a dictionary approach ever 
> > > considered? If one
> > >  new layout was added (one buffer containing the run ends of a
> > > RLE 
> > > 0:N int32
> > >  array), the dictionary member could be the values array and
> > > perhaps 
> > > make it
> > >  easier for implementations that already handle dictionaries.
> > > 
> > > 
> > > 
> > > 
> > > 
> > >  On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol 
> > > <
> > > [email protected] <mailto:[email protected]>
> > > >
> > >  wrote:
> > > 
> > >  > Just wanted to chime in here that I also have several draft
> > > PRs 
> > > for
> > >  > implementing the RLE arrays in Go as the second implementation
> > > (since
> > >  > we use two implementations as a requirement to vote on
> > >  > changes/additions to the format).
> > >  >
> > >  > They can be found here:
> > >  >
> > >  > <<https://github.com/apache/arrow/pull/14111>>
> > >  > <<https://github.com/apache/arrow/pull/14114>>
> > >  > <<https://github.com/apache/arrow/pull/14126>>
> > >  >
> > >  > --Matt
> > >  >
> > >  > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > >  > <[email protected] <mailto:[email protected]>> wrote:
> > >  > >>
> > >  > >>   * Should we encode "run lengths" or "run ends"?
> > >  > >
> > >  > >
> > >  > > I think the project has leaned towards sublinear access, so
> > > run 
> > > ends
> > >  > > make
> > >  > > sense.  The downside is that we run into similar issues with
> > >  > > List/LargeList
> > >  > > where the total number of elements is limited by bit-width 
> > > (which can
> > >  > > also
> > >  > > cause space wastage, e.g. with run ends it might be
> > > reasonable 
> > > to
> > >  > > limit
> > >  > > bit-width to 16).
> > >  > >
> > >  > > The values are definitely a child array.  However, encoding
> > > the 
> > > run
> > >  > >>  ends as a buffer (similar to list array for example) makes
> > > it
> > >  > >>  difficult to calculate offsets.  Translating an array
> > > offset 
> > > to a
> > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are 
> > > encoded as
> > >  > >> a
> > >  > >>  child array (so the RLE array has no buffers and two child
> > > arrays)
> > >  > >>  then this problem goes away.
> > >  > >
> > >  > >
> > >  > > I'm not sure I understand this, could you provide an example
> > > of 
> > > the
> > >  > > problem
> > >  > > that the child array solves?
> > >  > >
> > >  > >
> > >  > >
> > >  > >
> > >  > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace 
> > > <[email protected] <mailto:[email protected]>
> > >  > > <<mailto:[email protected]>>> wrote:
> > >  > >
> > >  > >>  I'm going to bump this because it would be good to get 
> > > feedback.  In
> > >  > >>  particular it would be nice to get feedback on the
> > > suggested 
> > > format
> > >  > >>  change[1].  We are currently moving forward on coming up
> > > with 
> > > an IPC
> > >  > >>  format proposal which we will share when ready.
> > >  > >>
> > >  > >>  The two interesting points that jump out to me are:
> > >  > >>
> > >  > >>   * Should we encode "run lengths" or "run ends"?
> > >  > >>
> > >  > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run
> > > lengths"
> > >  > >> 3,
> > >  > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is
> > > preferred
> > >  > >>  as that leads to O(log(N)) random access (instead of O(N))
> > > and it's
> > >  > >>  not clear there are any disadvantages.
> > >  > >>
> > >  > >>   * Should the run ends be encoded as a buffer or a child 
> > > array?
> > >  > >>
> > >  > >>  The values are definitely a child array.  However,
> > > encoding 
> > > the run
> > >  > >>  ends as a buffer (similar to list array for example) makes
> > > it
> > >  > >>  difficult to calculate offsets.  Translating an array
> > > offset 
> > > to a
> > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are 
> > > encoded as
> > >  > >> a
> > >  > >>  child array (so the RLE array has no buffers and two child
> > > arrays)
> > >  > >>  then this problem goes away.
> > >  > >>
> > >  > >>  [1] <<https://github.com/apache/arrow/pull/13333/files>>
> > >  > >>
> > >  > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > >  > >> 
> > > <[email protected] <mailto:[email protected]> 
> > > <<mailto:[email protected]>>>
> > >  > >> wrote:
> > >  > >>  >
> > >  > >>  > Hello Everyone,
> > >  > >>  >
> > >  > >>  > Recently, I have implemented support for run-length 
> > > encoding in
> > >  > >> Arrow
> > >  > >>  > C++. So far my implementation is split into different 
> > > subtasks of
> > >  > >>  > ARROW-16771 
> > > (<<https://issues.apache.org/jira/browse/ARROW-16771>>).
> > >  > >>  >
> > >  > >>  > I have (draft) PRs available for:
> > >  > >>  > - general handling of RLE in arrow C++, Type, Arrow,
> > > Builder
> > >  > >>  > subclasses, etc.
> > >  > >>  >   (subtasks 1-9)
> > >  > >>  > - encode, decode kernels (fixed size only):
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-16772>>)
> > >  > >>  > - filter kernel (fixed size only):
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-16774>>)
> > >  > >>  > - simple benchmark for the RLE kernels
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-17026>>)
> > >  > >>  > - adding RLE to Arrow Columnar format document
> > >  > >>  >  
> > > (<<https://issues.apache.org/jira/browse/ARROW-16773>>)
> > >  > >>  >
> > >  > >>  > What is not yet implemented:
> > >  > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > >  > >>  > - caching of physical offsets when working with sliced 
> > > arrays -
> > >  > >> finding
> > >  > >>  > these offsets is an  O(log(n)) binary search which could
> > > be
> > >  > >> avoided in
> > >  > >>  > a lot of cases
> > >  > >>  >
> > >  > >>  > I'm interested in any feedback on the code and I'm 
> > > wondering what
> > >  > >> would
> > >  > >>  > be the best way to get this merged.
> > >  > >>  >
> > >  > >>  > A lot of the PRs depend on earlier ones. I ordered the 
> > > subtasks
> > >  > >> in a
> > >  > >>  > way they could be merged. The first would be:
> > >  > >>  > 1. Handling of array-only types using VisitTypeInline:
> > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17258>>
> > >  > >>  > 2. Adding RLE type / array class (only builds on #1):
> > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17261>>
> > >  > >>  > -  also, since it has no dependencies: adding RLE to
> > > Arrow
> > >  > >> Columnar
> > >  > >>  > format document
> > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-16773>>
> > >  > >>  >
> > >  > >>  > Best,
> > >  > >>  > Tobias
> > >  > >>
> > >  >
> > >  >
>

Re: PRs for RLE support

Reply via email to