Re: [C++] Adding Run-Length Encoding to Arrow

2022-07-19 Thread Antoine Pitrou
Le 08/07/2022 à 15:19, Wes McKinney a écrit : * I believe that having a Type::RLE is the right approach in C++ and it makes dynamic dispatch everywhere in the library pretty straightforward. +1 on this, as it will raise a nice NotImplemented error for existing code rather than crash or corr

Re: [C++] Adding Run-Length Encoding to Arrow

2022-07-08 Thread Wes McKinney
hi all, Just catching up on this e-mail thread from last month. Since I've been neck deep refactoring the kernels code the last few weeks I have a few thoughts about this: * How we implement and use RLE in the C++ library and Acero is separate from how RLE will be represented in the Arrow IPC for

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-09 Thread Sasha Krassovsky
A format where run lengths and values are interleaved would almost certainly be worse than having them separate. For example, unary scalar kernel evaluation is exactly the same as on raw arrays when they are not interleaved. Further, in the context of vectorization, a vectorized load into the ar

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-08 Thread Alessandro Molina
RLE would probably have some benefits that it makes sense to evaluate, I would personally go in the direction of having a minimal benchmarking suite for some of the cases where we expect to seem most benefit (IE: filtering) so we can discuss with real numbers. Also, the currently proposed format d

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-07 Thread Tobias Zagorni
I created a Jira for adding RLE as ARROW-16771, and draft PRs: - https://github.com/apache/arrow/pull/13330 Encode/Decode functions for (currently fixed width types only) - https://github.com/apache/arrow/pull/1 For updating docs Best, Tobias Am Dienstag, dem 31.05.2022 um 17:13 -0500 s

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-04 Thread Andrew Lamb
I think the biggest benefit of RLE is not on-the-wire compression, as that can be done via more general purpose compression schemes as Antoine mentions. The biggest benefit of RLE is that it allows operating directly and very efficiently on the "encoded" form -- for example, you can apply filters

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Tobias Zagorni
Am Freitag, dem 03.06.2022 um 09:32 -0700 schrieb Micah Kornfield: > > > > Thinking about compatibility with existing software, RLE could > > possibly > > even made an Extension Type that follows the layout of a struct of > > int32 and the encoded value type. I'm wondering wether this would > > be

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Micah Kornfield
> > Thinking about compatibility with existing software, RLE could possibly > even made an Extension Type that follows the layout of a struct of > int32 and the encoded value type. I'm wondering wether this would be > better for compatibility. I might be misunderstanding this proposal, but I don'

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Tobias Zagorni
> Well, Arrow C++ does not have a notion of encoding distinct from the > data type. Adding such a notion would risk breaking compatibility for > all existing software that hasn't been upgraded to dispatch based on > encoding. Thinking about compatibility with existing software, RLE could possibl

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-01 Thread Neal Richardson
Would it make sense to make a draft PR with your branch so that folks can comment on specific parts of it? Neal On Wed, Jun 1, 2022 at 10:20 AM Tobias Zagorni wrote: > Am Dienstag, dem 31.05.2022 um 12:41 -0700 schrieb Micah Kornfield: > > > > - Should we allow multiple runs of the same value f

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-01 Thread Tobias Zagorni
Am Dienstag, dem 31.05.2022 um 12:41 -0700 schrieb Micah Kornfield: > > - Should we allow multiple runs of the same value following each > other? > > Otherwise we would either need a pass to correct this after a lot > > of > > operations, or make RLE-aware versions of thier kernels. > > Is there

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Weston Pace
> I don't think replacing Scalar compute paths with dedicated paths for > RLE-encoded data would ever be a simplification. Also, when a kernel > hasn't been upgraded with a native path for RLE data, former Scalar > Datums would now be expanded to the full RLE-decoded version before > running the ke

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Wes McKinney
I haven't had a chance to look at the branch in detail, but if you can provide a pointer to a specification or other details about the proposed memory format for RLE (basically: what would be added to the columnar documentation as well as the Flatbuffers schema files), it would be helpful so it can

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Tobias Zagorni
Hi, Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou: > > Hi, > > Le 31/05/2022 à 20:24, Tobias Zagorni a écrit : > > Hi, I'm currently working on adding Run-Length encoding to arrow. I > > created a function to dictionary-encode arrays here (currently only > > for > > fixed le

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Antoine Pitrou
Le 31/05/2022 à 21:41, Micah Kornfield a écrit : I'm currently working on adding Run-Length encoding to arrow. Nice What are the intended use cases for this: - external engines want to provide run-length encoded data to work on using arrow? It is more than just external engines. Many p

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Micah Kornfield
> > I'm currently working on adding Run-Length encoding to arrow. Nice > What are the intended use cases for this: > - external engines want to provide run-length encoded data to work on > using arrow? > It is more than just external engines. Many popular file formats support RLE encoding. Bei

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Antoine Pitrou
Hi, Le 31/05/2022 à 20:24, Tobias Zagorni a écrit : Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1 The general

[C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Tobias Zagorni
Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1 The general idea is that RLE data will be a nested data type, with a