Re: PRs for RLE support

Weston Pace Wed, 14 Sep 2022 23:32:59 -0700

> The downside is that we run into similar issues with List/LargeList
> where the total number of elements is limited by bit-width (which can also
> cause space wastage, e.g. with run ends it might be reasonable to limit
> bit-width to 16).


True.  I guess another question would be whether we want to parameterize
the type for run ends.  The 32-bit/64-bit case isn't terribly interesting
to me personally but I suppose we have LargeList for a reason.  The
16-bit/32-bit case is more compelling I think.

On Wed, Sep 14, 2022 at 1:44 PM Tobias Zagorni <tob...@zagorni.eu.invalid>
wrote:

> Am Mittwoch, dem 14.09.2022 um 14:33 -0400 schrieb Matthew Topol:
> > Doesn't this explanation conflate the Logical Offset (the parent's
> > offset) and the Physical Offset (the offset of the run ends / values
> > children)? Or am I missing something? If you have an RLE<int64> array
> > of length 200, and you slice it with {start: 10, length: 100}, I
> > don't
> > understand how you can represent that while keeping the parent's
> > offset
> > as "0" because the offset of "10" in the slice could represent
> > anywhere
> > from 0 - 10 physical values you have to offset by. You'd still keep
> > an
> > offset of 10 in the parent, and potentially cache the *physical*
> > offset
> > too, right?
>
> I could see this working if we require that the whole Buffer of the run
> ends array (and not just the Array) holds valid run end values. We
> could look into run_ends_array[-1] to find the logical offset, and
> assume 0 if run_ends_array.offset is 0. This actually seems like an
> interesting property to me, even if it makes logic more complex because
> another offset to consider.
>
> What I like about the custom Slicing approach is that it works without
> a cache and for that reason also across library boundaries without
> losing cache information.
>
> With the current format, adding an efficient Slice function with
> physical length inputs would still be possible, but rely on the cache
> to store the phyiscal start and end of the slice.
>
> > The benefit I see for having the child arrays over the run-ends being
> > a
> > buffer in the RLE array directly is that you can easily tell how
> > large
> > the buffers *should* be as the length and offset of the run ends
> > (int32) array and the values array will always represent *physical*
> > values while the parent's offset and length can represent the
> > *logical*
> > values of the RLE. If the run ends were a buffer in the RLE array
> > directly, it would be much more difficult to maintain the separation
> > of
> > the "physical" and "logical" offset/lengths.
> >
> > Does that make sense?
> >
> > --Matt
>
> This makes perfect sense to me. The primary reason for me was that
> Buffers do not have a length field in the Arrow C interface so thier
> length is not really known, while in Arrow C++ they have a size. For
> all existing data types this can be easiliy resolved from either the
> length field or offsets buffer. For RLE the the run ends buffer is
> physical size of the array (or larger) which cannot be easily
> determined without iterating over the whole buffer.
>
> But we need a valid buffer size, so we can resolve logical to physical
> offsets using binary search. Also I'm not sure if it is required to be
> known for each buffer in Arrow C++ anyways.
>
> - Tobias
>
> >
> > On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace
> > <weston.p...@gmail.com> wrote:
> > > I will clarify the offset problem.  It essentially boils down to
> > > "if
> > > you don't have constant access to elements then an array length
> > > offset
> > > does not give you constant access to buffer offsets".
> > >
> > > We start with an RLE<int64> array of length 200.  We slice it with
> > > (start=10, length=100) to get an RLE<int64> array of length 100 and
> > > an
> > > offset of 10.
> > >
> > > Now we want to write an IPC file (or access the values for whatever
> > > reason).  The values buffer has 400 bytes and the run ends buffer
> > > has
> > > 200 bytes (these numbers could be anything less than 1600/800 so
> > > I'm
> > > picking these at random).  We need to copy a portion of the "run
> > > ends"
> > > buffer into the file.  What bytes are these?  The only way to tell
> > > would be to do a binary search on the 200 bytes run ends buffer.
> > >
> > > On the other hand, if there were two child arrays then an
> > > implementation, when slicing, could choose to always keep the
> > > offset
> > > of the parent array at 0 and instead put the offsets in the child
> > > arrays.  Now you have a parent array with offset 0, a run ends
> > > (int32)
> > > array with offset 74 and length 5 and a values (int64) array with
> > > offset 74 and length 5.  We can clearly say that we want to grab
> > > bytes
> > > 296-316 from the run ends buffer and bytes 592-632 from the values
> > > buffer.  Of course, other implementations would always be free to
> > > use
> > > offsets in the parent array.  So I think the log(N) approach would
> > > still be needed as a fallback.
> > >
> > > Other options:
> > >
> > >  * Just eat the log(n) cost, it's not that expensive and any
> > > application doing excessive slicing could cache the offsets
> > > themselves.
> > >  * Add an optional buffer offset to the spec that can be populated
> > > in
> > > cases where random array access is not possible.
> > >
> > > On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington
> > > <de...@voltrondata.com.invalid
> > > <mailto:de...@voltrondata.com.invalid>> wrote:
> > > >
> > > >  >  * Should we encode "run lengths" or "run ends"?
> > > >
> > > >  In addition to the points mentioned above, this seems the most
> > > > consistent
> > > >  with the variable-length binary/list layouts
> > > >
> > > >  > encoding the run ends as a buffer (similar to list array for
> > > > example)
> > > >  makes it difficult to calculate offsets
> > > >
> > > >  I don't have a strong opinion about this, but I also don't
> > > > understand the
> > > >  logic. Surely the implementation is just generating/reading a
> > > > buffer of
> > > >  integers and there's some overhead/indirection to wrapping it in
> > > > an
> > > > Array
> > > >  (that must then be validated).
> > > >
> > > >  As a matter of curiosity, was a dictionary approach ever
> > > > considered? If one
> > > >  new layout was added (one buffer containing the run ends of a
> > > > RLE
> > > > 0:N int32
> > > >  array), the dictionary member could be the values array and
> > > > perhaps
> > > > make it
> > > >  easier for implementations that already handle dictionaries.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >  On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol
> > > > <
> > > > m...@voltrondata.com.invalid <mailto:m...@voltrondata.com.invalid>
> > > > >
> > > >  wrote:
> > > >
> > > >  > Just wanted to chime in here that I also have several draft
> > > > PRs
> > > > for
> > > >  > implementing the RLE arrays in Go as the second implementation
> > > > (since
> > > >  > we use two implementations as a requirement to vote on
> > > >  > changes/additions to the format).
> > > >  >
> > > >  > They can be found here:
> > > >  >
> > > >  > <<https://github.com/apache/arrow/pull/14111>>
> > > >  > <<https://github.com/apache/arrow/pull/14114>>
> > > >  > <<https://github.com/apache/arrow/pull/14126>>
> > > >  >
> > > >  > --Matt
> > > >  >
> > > >  > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield
> > > >  > <emkornfi...@gmail.com <mailto:emkornfi...@gmail.com>> wrote:
> > > >  > >>
> > > >  > >>   * Should we encode "run lengths" or "run ends"?
> > > >  > >
> > > >  > >
> > > >  > > I think the project has leaned towards sublinear access, so
> > > > run
> > > > ends
> > > >  > > make
> > > >  > > sense.  The downside is that we run into similar issues with
> > > >  > > List/LargeList
> > > >  > > where the total number of elements is limited by bit-width
> > > > (which can
> > > >  > > also
> > > >  > > cause space wastage, e.g. with run ends it might be
> > > > reasonable
> > > > to
> > > >  > > limit
> > > >  > > bit-width to 16).
> > > >  > >
> > > >  > > The values are definitely a child array.  However, encoding
> > > > the
> > > > run
> > > >  > >>  ends as a buffer (similar to list array for example) makes
> > > > it
> > > >  > >>  difficult to calculate offsets.  Translating an array
> > > > offset
> > > > to a
> > > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are
> > > > encoded as
> > > >  > >> a
> > > >  > >>  child array (so the RLE array has no buffers and two child
> > > > arrays)
> > > >  > >>  then this problem goes away.
> > > >  > >
> > > >  > >
> > > >  > > I'm not sure I understand this, could you provide an example
> > > > of
> > > > the
> > > >  > > problem
> > > >  > > that the child array solves?
> > > >  > >
> > > >  > >
> > > >  > >
> > > >  > >
> > > >  > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace
> > > > <weston.p...@gmail.com <mailto:weston.p...@gmail.com>
> > > >  > > <<mailto:weston.p...@gmail.com>>> wrote:
> > > >  > >
> > > >  > >>  I'm going to bump this because it would be good to get
> > > > feedback.  In
> > > >  > >>  particular it would be nice to get feedback on the
> > > > suggested
> > > > format
> > > >  > >>  change[1].  We are currently moving forward on coming up
> > > > with
> > > > an IPC
> > > >  > >>  format proposal which we will share when ready.
> > > >  > >>
> > > >  > >>  The two interesting points that jump out to me are:
> > > >  > >>
> > > >  > >>   * Should we encode "run lengths" or "run ends"?
> > > >  > >>
> > > >  > >>  For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run
> > > > lengths"
> > > >  > >> 3,
> > > >  > >>  2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is
> > > > preferred
> > > >  > >>  as that leads to O(log(N)) random access (instead of O(N))
> > > > and it's
> > > >  > >>  not clear there are any disadvantages.
> > > >  > >>
> > > >  > >>   * Should the run ends be encoded as a buffer or a child
> > > > array?
> > > >  > >>
> > > >  > >>  The values are definitely a child array.  However,
> > > > encoding
> > > > the run
> > > >  > >>  ends as a buffer (similar to list array for example) makes
> > > > it
> > > >  > >>  difficult to calculate offsets.  Translating an array
> > > > offset
> > > > to a
> > > >  > >>  buffer offset takes O(log(N)) time.  If the run ends are
> > > > encoded as
> > > >  > >> a
> > > >  > >>  child array (so the RLE array has no buffers and two child
> > > > arrays)
> > > >  > >>  then this problem goes away.
> > > >  > >>
> > > >  > >>  [1] <<https://github.com/apache/arrow/pull/13333/files>>
> > > >  > >>
> > > >  > >>  On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
> > > >  > >>
> > > > <tob...@zagorni.eu.invalid <mailto:tob...@zagorni.eu.invalid>
> > > > <<mailto:tob...@zagorni.eu.invalid>>>
> > > >  > >> wrote:
> > > >  > >>  >
> > > >  > >>  > Hello Everyone,
> > > >  > >>  >
> > > >  > >>  > Recently, I have implemented support for run-length
> > > > encoding in
> > > >  > >> Arrow
> > > >  > >>  > C++. So far my implementation is split into different
> > > > subtasks of
> > > >  > >>  > ARROW-16771
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16771>>).
> > > >  > >>  >
> > > >  > >>  > I have (draft) PRs available for:
> > > >  > >>  > - general handling of RLE in arrow C++, Type, Arrow,
> > > > Builder
> > > >  > >>  > subclasses, etc.
> > > >  > >>  >   (subtasks 1-9)
> > > >  > >>  > - encode, decode kernels (fixed size only):
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16772>>)
> > > >  > >>  > - filter kernel (fixed size only):
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16774>>)
> > > >  > >>  > - simple benchmark for the RLE kernels
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-17026>>)
> > > >  > >>  > - adding RLE to Arrow Columnar format document
> > > >  > >>  >
> > > > (<<https://issues.apache.org/jira/browse/ARROW-16773>>)
> > > >  > >>  >
> > > >  > >>  > What is not yet implemented:
> > > >  > >>  > - converting RLE to formats like Parquet, JSON, IPC.
> > > >  > >>  > - caching of physical offsets when working with sliced
> > > > arrays -
> > > >  > >> finding
> > > >  > >>  > these offsets is an  O(log(n)) binary search which could
> > > > be
> > > >  > >> avoided in
> > > >  > >>  > a lot of cases
> > > >  > >>  >
> > > >  > >>  > I'm interested in any feedback on the code and I'm
> > > > wondering what
> > > >  > >> would
> > > >  > >>  > be the best way to get this merged.
> > > >  > >>  >
> > > >  > >>  > A lot of the PRs depend on earlier ones. I ordered the
> > > > subtasks
> > > >  > >> in a
> > > >  > >>  > way they could be merged. The first would be:
> > > >  > >>  > 1. Handling of array-only types using VisitTypeInline:
> > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17258>>
> > > >  > >>  > 2. Adding RLE type / array class (only builds on #1):
> > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-17261>>
> > > >  > >>  > -  also, since it has no dependencies: adding RLE to
> > > > Arrow
> > > >  > >> Columnar
> > > >  > >>  > format document
> > > >  > >>  >    <<https://issues.apache.org/jira/browse/ARROW-16773>>
> > > >  > >>  >
> > > >  > >>  > Best,
> > > >  > >>  > Tobias
> > > >  > >>
> > > >  >
> > > >  >
> >
>

Re: PRs for RLE support

Reply via email to