> The downside is that we run into similar issues with List/LargeList > where the total number of elements is limited by bit-width (which can also > cause space wastage, e.g. with run ends it might be reasonable to limit > bit-width to 16).
True. I guess another question would be whether we want to parameterize the type for run ends. The 32-bit/64-bit case isn't terribly interesting to me personally but I suppose we have LargeList for a reason. The 16-bit/32-bit case is more compelling I think. On Wed, Sep 14, 2022 at 1:44 PM Tobias Zagorni <tob...@zagorni.eu.invalid> wrote: > Am Mittwoch, dem 14.09.2022 um 14:33 -0400 schrieb Matthew Topol: > > Doesn't this explanation conflate the Logical Offset (the parent's > > offset) and the Physical Offset (the offset of the run ends / values > > children)? Or am I missing something? If you have an RLE<int64> array > > of length 200, and you slice it with {start: 10, length: 100}, I > > don't > > understand how you can represent that while keeping the parent's > > offset > > as "0" because the offset of "10" in the slice could represent > > anywhere > > from 0 - 10 physical values you have to offset by. You'd still keep > > an > > offset of 10 in the parent, and potentially cache the *physical* > > offset > > too, right? > > I could see this working if we require that the whole Buffer of the run > ends array (and not just the Array) holds valid run end values. We > could look into run_ends_array[-1] to find the logical offset, and > assume 0 if run_ends_array.offset is 0. This actually seems like an > interesting property to me, even if it makes logic more complex because > another offset to consider. > > What I like about the custom Slicing approach is that it works without > a cache and for that reason also across library boundaries without > losing cache information. > > With the current format, adding an efficient Slice function with > physical length inputs would still be possible, but rely on the cache > to store the phyiscal start and end of the slice. > > > The benefit I see for having the child arrays over the run-ends being > > a > > buffer in the RLE array directly is that you can easily tell how > > large > > the buffers *should* be as the length and offset of the run ends > > (int32) array and the values array will always represent *physical* > > values while the parent's offset and length can represent the > > *logical* > > values of the RLE. If the run ends were a buffer in the RLE array > > directly, it would be much more difficult to maintain the separation > > of > > the "physical" and "logical" offset/lengths. > > > > Does that make sense? > > > > --Matt > > This makes perfect sense to me. The primary reason for me was that > Buffers do not have a length field in the Arrow C interface so thier > length is not really known, while in Arrow C++ they have a size. For > all existing data types this can be easiliy resolved from either the > length field or offsets buffer. For RLE the the run ends buffer is > physical size of the array (or larger) which cannot be easily > determined without iterating over the whole buffer. > > But we need a valid buffer size, so we can resolve logical to physical > offsets using binary search. Also I'm not sure if it is required to be > known for each buffer in Arrow C++ anyways. > > - Tobias > > > > > On Wed, Sep 14 2022 at 11:18:27 AM -0700, Weston Pace > > <weston.p...@gmail.com> wrote: > > > I will clarify the offset problem. It essentially boils down to > > > "if > > > you don't have constant access to elements then an array length > > > offset > > > does not give you constant access to buffer offsets". > > > > > > We start with an RLE<int64> array of length 200. We slice it with > > > (start=10, length=100) to get an RLE<int64> array of length 100 and > > > an > > > offset of 10. > > > > > > Now we want to write an IPC file (or access the values for whatever > > > reason). The values buffer has 400 bytes and the run ends buffer > > > has > > > 200 bytes (these numbers could be anything less than 1600/800 so > > > I'm > > > picking these at random). We need to copy a portion of the "run > > > ends" > > > buffer into the file. What bytes are these? The only way to tell > > > would be to do a binary search on the 200 bytes run ends buffer. > > > > > > On the other hand, if there were two child arrays then an > > > implementation, when slicing, could choose to always keep the > > > offset > > > of the parent array at 0 and instead put the offsets in the child > > > arrays. Now you have a parent array with offset 0, a run ends > > > (int32) > > > array with offset 74 and length 5 and a values (int64) array with > > > offset 74 and length 5. We can clearly say that we want to grab > > > bytes > > > 296-316 from the run ends buffer and bytes 592-632 from the values > > > buffer. Of course, other implementations would always be free to > > > use > > > offsets in the parent array. So I think the log(N) approach would > > > still be needed as a fallback. > > > > > > Other options: > > > > > > * Just eat the log(n) cost, it's not that expensive and any > > > application doing excessive slicing could cache the offsets > > > themselves. > > > * Add an optional buffer offset to the spec that can be populated > > > in > > > cases where random array access is not possible. > > > > > > On Wed, Sep 14, 2022 at 10:53 AM Dewey Dunnington > > > <de...@voltrondata.com.invalid > > > <mailto:de...@voltrondata.com.invalid>> wrote: > > > > > > > > > * Should we encode "run lengths" or "run ends"? > > > > > > > > In addition to the points mentioned above, this seems the most > > > > consistent > > > > with the variable-length binary/list layouts > > > > > > > > > encoding the run ends as a buffer (similar to list array for > > > > example) > > > > makes it difficult to calculate offsets > > > > > > > > I don't have a strong opinion about this, but I also don't > > > > understand the > > > > logic. Surely the implementation is just generating/reading a > > > > buffer of > > > > integers and there's some overhead/indirection to wrapping it in > > > > an > > > > Array > > > > (that must then be validated). > > > > > > > > As a matter of curiosity, was a dictionary approach ever > > > > considered? If one > > > > new layout was added (one buffer containing the run ends of a > > > > RLE > > > > 0:N int32 > > > > array), the dictionary member could be the values array and > > > > perhaps > > > > make it > > > > easier for implementations that already handle dictionaries. > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Sep 14, 2022 at 2:04 PM Matthew Topol > > > > < > > > > m...@voltrondata.com.invalid <mailto:m...@voltrondata.com.invalid> > > > > > > > > > wrote: > > > > > > > > > Just wanted to chime in here that I also have several draft > > > > PRs > > > > for > > > > > implementing the RLE arrays in Go as the second implementation > > > > (since > > > > > we use two implementations as a requirement to vote on > > > > > changes/additions to the format). > > > > > > > > > > They can be found here: > > > > > > > > > > <<https://github.com/apache/arrow/pull/14111>> > > > > > <<https://github.com/apache/arrow/pull/14114>> > > > > > <<https://github.com/apache/arrow/pull/14126>> > > > > > > > > > > --Matt > > > > > > > > > > On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield > > > > > <emkornfi...@gmail.com <mailto:emkornfi...@gmail.com>> wrote: > > > > > >> > > > > > >> * Should we encode "run lengths" or "run ends"? > > > > > > > > > > > > > > > > > > I think the project has leaned towards sublinear access, so > > > > run > > > > ends > > > > > > make > > > > > > sense. The downside is that we run into similar issues with > > > > > > List/LargeList > > > > > > where the total number of elements is limited by bit-width > > > > (which can > > > > > > also > > > > > > cause space wastage, e.g. with run ends it might be > > > > reasonable > > > > to > > > > > > limit > > > > > > bit-width to 16). > > > > > > > > > > > > The values are definitely a child array. However, encoding > > > > the > > > > run > > > > > >> ends as a buffer (similar to list array for example) makes > > > > it > > > > > >> difficult to calculate offsets. Translating an array > > > > offset > > > > to a > > > > > >> buffer offset takes O(log(N)) time. If the run ends are > > > > encoded as > > > > > >> a > > > > > >> child array (so the RLE array has no buffers and two child > > > > arrays) > > > > > >> then this problem goes away. > > > > > > > > > > > > > > > > > > I'm not sure I understand this, could you provide an example > > > > of > > > > the > > > > > > problem > > > > > > that the child array solves? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Sep 14, 2022 at 9:36 AM Weston Pace > > > > <weston.p...@gmail.com <mailto:weston.p...@gmail.com> > > > > > > <<mailto:weston.p...@gmail.com>>> wrote: > > > > > > > > > > > >> I'm going to bump this because it would be good to get > > > > feedback. In > > > > > >> particular it would be nice to get feedback on the > > > > suggested > > > > format > > > > > >> change[1]. We are currently moving forward on coming up > > > > with > > > > an IPC > > > > > >> format proposal which we will share when ready. > > > > > >> > > > > > >> The two interesting points that jump out to me are: > > > > > >> > > > > > >> * Should we encode "run lengths" or "run ends"? > > > > > >> > > > > > >> For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run > > > > lengths" > > > > > >> 3, > > > > > >> 2, 4 or "run ends" 3, 5, 9. In the proposal the latter is > > > > preferred > > > > > >> as that leads to O(log(N)) random access (instead of O(N)) > > > > and it's > > > > > >> not clear there are any disadvantages. > > > > > >> > > > > > >> * Should the run ends be encoded as a buffer or a child > > > > array? > > > > > >> > > > > > >> The values are definitely a child array. However, > > > > encoding > > > > the run > > > > > >> ends as a buffer (similar to list array for example) makes > > > > it > > > > > >> difficult to calculate offsets. Translating an array > > > > offset > > > > to a > > > > > >> buffer offset takes O(log(N)) time. If the run ends are > > > > encoded as > > > > > >> a > > > > > >> child array (so the RLE array has no buffers and two child > > > > arrays) > > > > > >> then this problem goes away. > > > > > >> > > > > > >> [1] <<https://github.com/apache/arrow/pull/13333/files>> > > > > > >> > > > > > >> On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni > > > > > >> > > > > <tob...@zagorni.eu.invalid <mailto:tob...@zagorni.eu.invalid> > > > > <<mailto:tob...@zagorni.eu.invalid>>> > > > > > >> wrote: > > > > > >> > > > > > > >> > Hello Everyone, > > > > > >> > > > > > > >> > Recently, I have implemented support for run-length > > > > encoding in > > > > > >> Arrow > > > > > >> > C++. So far my implementation is split into different > > > > subtasks of > > > > > >> > ARROW-16771 > > > > (<<https://issues.apache.org/jira/browse/ARROW-16771>>). > > > > > >> > > > > > > >> > I have (draft) PRs available for: > > > > > >> > - general handling of RLE in arrow C++, Type, Arrow, > > > > Builder > > > > > >> > subclasses, etc. > > > > > >> > (subtasks 1-9) > > > > > >> > - encode, decode kernels (fixed size only): > > > > > >> > > > > > (<<https://issues.apache.org/jira/browse/ARROW-16772>>) > > > > > >> > - filter kernel (fixed size only): > > > > > >> > > > > > (<<https://issues.apache.org/jira/browse/ARROW-16774>>) > > > > > >> > - simple benchmark for the RLE kernels > > > > > >> > > > > > (<<https://issues.apache.org/jira/browse/ARROW-17026>>) > > > > > >> > - adding RLE to Arrow Columnar format document > > > > > >> > > > > > (<<https://issues.apache.org/jira/browse/ARROW-16773>>) > > > > > >> > > > > > > >> > What is not yet implemented: > > > > > >> > - converting RLE to formats like Parquet, JSON, IPC. > > > > > >> > - caching of physical offsets when working with sliced > > > > arrays - > > > > > >> finding > > > > > >> > these offsets is an O(log(n)) binary search which could > > > > be > > > > > >> avoided in > > > > > >> > a lot of cases > > > > > >> > > > > > > >> > I'm interested in any feedback on the code and I'm > > > > wondering what > > > > > >> would > > > > > >> > be the best way to get this merged. > > > > > >> > > > > > > >> > A lot of the PRs depend on earlier ones. I ordered the > > > > subtasks > > > > > >> in a > > > > > >> > way they could be merged. The first would be: > > > > > >> > 1. Handling of array-only types using VisitTypeInline: > > > > > >> > <<https://issues.apache.org/jira/browse/ARROW-17258>> > > > > > >> > 2. Adding RLE type / array class (only builds on #1): > > > > > >> > <<https://issues.apache.org/jira/browse/ARROW-17261>> > > > > > >> > - also, since it has no dependencies: adding RLE to > > > > Arrow > > > > > >> Columnar > > > > > >> > format document > > > > > >> > <<https://issues.apache.org/jira/browse/ARROW-16773>> > > > > > >> > > > > > > >> > Best, > > > > > >> > Tobias > > > > > >> > > > > > > > > > > > > >