Re: [C++] Adding Run-Length Encoding to Arrow

Wes McKinney Fri, 08 Jul 2022 06:20:56 -0700

hi all,

Just catching up on this e-mail thread from last month. Since I've
been neck deep refactoring the kernels code the last few weeks I have
a few thoughts about this:

* How we implement and use RLE in the C++ library and Acero is
separate from how RLE will be represented in the Arrow IPC format and
C ABI. I think both the IPC representation and the different library
implementations should be given separate but thorough consideration.
There are some other projects like DuckDB and Velox which may already
have some RLE stuff, and given that they use the Arrow C interface, we
should strive if possible to create a C ABI representation that
everyone is happy about zero-copying. I will try to help with some of
this format alignment across projects that have an interest in
Arrow-compatibility if that sounds good.

* I believe that having a Type::RLE is the right approach in C++ and
it makes dynamic dispatch everywhere in the library pretty
straightforward.

* It's unclear to me whether it is a good idea for RLE to replace the
use of Scalar values in the compute kernels sublibrary, especially
after the refactoring that I've just done to drop the scalar-specific
code paths from most kernels (and about to remove the rest once I
finish refactoring the aggregate kernels). It feels complicated (as
opposed to letting scalars and RLE co-exist peacefully) but I could be
convinced otherwise -- it may be hard to arrive at a conclusion
without some prototyping / exploration

When encountering RLE in the kernels, there will be a few cases:

a. Kernels with RLE-optimized code paths (e.g. aggregation, filtering)
b. Kernels which immediately must up-promote from RLE to non-RLE, or
iterate on a per-run basis, invoking the (array, scalar) code paths
for each run
c. Kernels which pass through to execute on the (usually much smaller)
values portion of the RLE array

Otherwise I am excited to see some progress on this (and relatedly,
being able to represent Constant arrays — possibly using the RLE
representation — in the IPC format would be a huge space-saving
benefit for some applications)

Thanks
Wes

On Thu, Jun 9, 2022 at 2:07 PM Sasha Krassovsky
<krassovskysa...@gmail.com> wrote:
>
> A format where run lengths and values are interleaved would almost certainly 
> be worse than having them separate. For example, unary scalar kernel 
> evaluation is exactly the same as on raw arrays when they are not 
> interleaved. Further, in the context of vectorization, a vectorized load into 
> the array would be a mix of run lengths and values - not good as you have to 
> de-interleave them somehow anyway. Am I missing something - did you have an 
> optimization in mind?
>
> Regarding a benchmark, I agree it could be useful. We need to decide what’s 
> “reasonable”. There are pathological cases either way: you can be 2x slower 
> than a raw column if you have a million runs of length 1 or you can be 
> 1000000x faster if you have one run of length 1000000. Where I think RLE can 
> shine most is if a filter column is RLE or if a group-by column is RLE. These 
> are the two benchmarks that I would suggest writing.
>
> Sasha Krassovsky
>
> > On Jun 8, 2022, at 3:59 AM, Alessandro Molina <alessan...@voltrondata.com> 
> > wrote:
> >
> > RLE would probably have some benefits that it makes sense to evaluate, I
> > would personally go in the direction of having a minimal benchmarking suite
> > for some of the cases where we expect to seem most benefit (IE: filtering)
> > so we can discuss with real numbers.
> >
> > Also, the currently proposed format divides run lengths and values, maybe a
> > format where run lengths and values are stored interleaved in the same
> > buffer might be able to allow more optimisations in the contest of
> > vectorised operations. Even though it might be harder to work with for
> > things that are not fixed width.
> >
> > On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni <tob...@zagorni.eu.invalid>
> > wrote:
> >
> >> I created a Jira for adding RLE as ARROW-16771, and draft PRs:
> >>
> >> - https://github.com/apache/arrow/pull/13330
> >>  Encode/Decode functions for (currently fixed width types only)
> >>
> >> - https://github.com/apache/arrow/pull/13333
> >>  For updating docs
> >>
> >> Best,
> >> Tobias
> >>
> >> Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney:
> >>> I haven't had a chance to look at the branch in detail, but if you
> >>> can
> >>> provide a pointer to a specification or other details about the
> >>> proposed memory format for RLE (basically: what would be added to the
> >>> columnar documentation as well as the Flatbuffers schema files), it
> >>> would be helpful so it can be circulated to some other interested
> >>> parties working primarily outside of Arrow (e.g. DuckDB) who might
> >>> like to converge on a standard especially given that it would be
> >>> exported across the C data interface. Thanks!
> >>
> >>
>

Re: [C++] Adding Run-Length Encoding to Arrow

Reply via email to