Hi, I'm currently working on adding Run-Length encoding to arrow. I
created a function to dictionary-encode arrays here (currently only for
fixed length types):
https://github.com/apache/arrow/compare/master...zagto:rle?expand=1

The general idea is that RLE data will be a nested data type, with a
single child holding a regular ArrayData of the type of the values, but
with the duplicate values removed. The parent contains a single buffer
of uint64 representing the run lengths by holding the run length of all
runs from the first to the current one

I'm interested to hear what you think about this.

Here are some points that came up during internal discussions:

What are the intended use cases for this:
- external engines want to provide run-length encoded data to work on
using arrow?
- more efficient ACERO (and possibly somewhat simplified, since we can
now use RLE arrays to replace Scalar Datums)

Automatic kernel dispatch:
- Scalar kernels would likely just work on RLE data
- How should it be implemented? The current place for logic like that
seems to be the DispatchBest methods. These only allow to replace the
input types, but for this RLE scheme we would need to make the kernel
work on a child array of the input. This mechanism would likely need to
be extended a lot.
- For kernels which don't work on RLE data, should we automatically
call Encode and Decode kernels?
- Should RLE really be a data type? Dictionary encoding set an example
for this, but just like it, RLE is actually a different encoding for
arrays of the same data type.
- There could be data where RLE is only beneficial on some batches,
while others are smaller without RLE. Supporting this would require to
dispatch different kernels per batch. Should we support this
eventually?
- Kernel dispatch of encodings is probably more about handling
encodings in general, and somewhat separate from adding RLE. We a
similar issue for here: ARROW-11508

Format:
- What data type should we use for the run-length values? int32 would
save memory, but force us to encode very long arrays as multiple
arrays, compared to int64. Or should we support different types?
Especially for external systems working with arrow, this would make
things more flexible, at the cost of additional complexity in arrow.
- I made the length field of the ArrayData hold the physical number of
elements in the encoded array. The logical number can already be fould
at the end of the accumulated run-lengths buffer. Also this is likely
less confusing for code working on the physical data unaware of RLE. Is
there more to consider for this decision?
- Should we allow multiple runs of the same value following each other?
Otherwise we would either need a pass to correct this after a lot of
operations, or make RLE-aware versions of thier kernels.

Original RFC for adding new types (including RLE):
https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq

Best,
Tobias

Reply via email to