Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
The general idea is that RLE data will be a nested data type, with a single child holding a regular ArrayData of the type of the values, but with the duplicate values removed. The parent contains a single buffer of uint64 representing the run lengths by holding the run length of all runs from the first to the current one I'm interested to hear what you think about this. Here are some points that came up during internal discussions: What are the intended use cases for this: - external engines want to provide run-length encoded data to work on using arrow? - more efficient ACERO (and possibly somewhat simplified, since we can now use RLE arrays to replace Scalar Datums) Automatic kernel dispatch: - Scalar kernels would likely just work on RLE data - How should it be implemented? The current place for logic like that seems to be the DispatchBest methods. These only allow to replace the input types, but for this RLE scheme we would need to make the kernel work on a child array of the input. This mechanism would likely need to be extended a lot. - For kernels which don't work on RLE data, should we automatically call Encode and Decode kernels? - Should RLE really be a data type? Dictionary encoding set an example for this, but just like it, RLE is actually a different encoding for arrays of the same data type. - There could be data where RLE is only beneficial on some batches, while others are smaller without RLE. Supporting this would require to dispatch different kernels per batch. Should we support this eventually? - Kernel dispatch of encodings is probably more about handling encodings in general, and somewhat separate from adding RLE. We a similar issue for here: ARROW-11508 Format: - What data type should we use for the run-length values? int32 would save memory, but force us to encode very long arrays as multiple arrays, compared to int64. Or should we support different types? Especially for external systems working with arrow, this would make things more flexible, at the cost of additional complexity in arrow. - I made the length field of the ArrayData hold the physical number of elements in the encoded array. The logical number can already be fould at the end of the accumulated run-lengths buffer. Also this is likely less confusing for code working on the physical data unaware of RLE. Is there more to consider for this decision? - Should we allow multiple runs of the same value following each other? Otherwise we would either need a pass to correct this after a lot of operations, or make RLE-aware versions of thier kernels. Original RFC for adding new types (including RLE): https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq Best, Tobias