felipecrv commented on code in PR #37877: URL: https://github.com/apache/arrow/pull/37877#discussion_r1990201006
########## docs/source/format/Columnar.rst: ########## @@ -487,6 +499,103 @@ will be represented as follows: :: |-------------------------------|-----------------------| | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) | +ListView Layout +~~~~~~~~~~~~~~~ + +The ListView layout is defined by three buffers: a validity bitmap, an offsets +buffer, and an additional sizes buffer. Sizes and offsets have the identical bit +width and both 32-bit and 64-bit signed integer options are supported. + +As in the List layout, the offsets encode the start position of each slot in the +child array. In contrast to the List layout, list lengths are stored explicitly +in the sizes buffer instead of inferred. This allows offsets to be out of order. +Elements of the child array do not have to be stored in the same order they +logically appear in the list elements of the parent array. + +Every list-view value, including null values, has to guarantee the following +invariants: :: + + 0 <= offsets[i] <= length of the child array + 0 <= offsets[i] + size[i] <= length of the child array + +A list-view type is specified like ``ListView<T>``, where ``T`` is any type +(primitive or nested). In these examples we use 32-bit offsets and sizes where +the 64-bit version would be denoted by ``LargeListView<T>``. + +**Example Layout: ``ListView<Int8>`` Array** + +We illustrate an example of ``ListView<Int8>`` with length 4 having values:: + + [[12, -7, 25], null, [0, -127, 127, 50], []] Review Comment: @adriangb anything can happen: they can be duplicated in the data or entries can point to the same data. Compact representation: ``` buffers: offsets: [0, _, 3, _, 0] sizes: [3, _, 4, 0, 3] children: values: [12, -7, 25, 0, -127, 127, 12] ``` Common representation: ``` buffers: offsets: [0, _, 3, _, 7] sizes: [3, _, 4, 0, 3] children: values: [12, -7, 25, 0, -127, 127, 12, 12, -7, 25] ``` *using _ to indicate that the value doesn't matter* Doing de-duplication is an expensive operation, but you can imagine some kernel, by construction, producing a compact list-view array. Imagine a function that generates an array of prefixes of another array given sizes -- every offset of would be `0` and only the sizes would vary. The main practical consequence of the `ListViewArray` is that lists can be written to the array in any random order. If you need to set array[i] to the logical value [a, b, c] all you have to do is append [a, b, c] to the child array and set offsets[i] and sizes[i] to the appropriate sizes. This is not possible with `ListArray` since an array at a random position i forces all the following values of the child array to move further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org