tustvold opened a new issue, #1799:
URL: https://github.com/apache/arrow-rs/issues/1799
**TLDR**
Make ArrayData layout explicit so that we can eventually push offsets down
into the underlying buffers/bitmaps, instead of tracking them as a top-level
concept which has proven to be rather error prone.
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
Currently `ArrayData` is defined as follows.
```
pub struct ArrayData {
/// The data type for this array data
data_type: DataType,
/// The number of elements in this array data
len: usize,
/// The number of null elements in this array data
null_count: usize,
/// The offset into this array data, in number of items
offset: usize,
/// The buffers for this array data. Note that depending on the array
types, this
/// could hold different kinds of buffers (e.g., value buffer, value
offset buffer)
/// at different positions.
buffers: Vec<Buffer>,
/// The child(ren) of this array. Only non-empty for nested types,
currently
/// `ListArray` and `StructArray`.
child_data: Vec<ArrayData>,
/// The null bitmap. A `None` value for this indicates all values are
non-null in
/// this array.
null_bitmap: Option<Bitmap>,
}
```
This is simple, but has a couple of caveats:
* It isn't clear what is present for specific layout types
* There is no clear path to storing `BooleanArray` as `BitMap` vs `Buffer`,
which would allow removing `offset`
* Vec allocations for one or two elements (the C++ implementation inlines
these)
* There is potential for accidentally interpreting a buffer incorrectly
**Describe the solution you'd like**
Introduce a new `ArrayDataLayout` enumeration:
```
pub enum ArrayDataLayout {
Boolean { values: Buffer },
Primitive{ values: Buffer },
Offsets { offsets: Buffer, values: Buffer },
Dictionary { keys: Buffer, values: ArrayData },
List { offsets: Buffer, elements: ArrayData },
Struct { children: Vec<ArrayData> },
Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> },
}
```
```
pub struct ArrayData {
/// The data type for this array data
data_type: DataType,
/// The number of elements in this array data
len: usize,
/// The number of null elements in this array data
null_count: usize,
/// The offset into this array data, in number of items
offset: usize,
/// The null bitmap. A `None` value for this indicates all values are
non-null in
/// this array.
null_bitmap: Option<Bitmap>,
/// The array data layout
layout: ArrayDataLayout
}
```
We could then progressively deprecate the methods that explicitly refer to
buffers by index, etc...
**Describe alternatives you've considered**
We could not do this
**Additional context**
This could be seen as an evolution of @HaoYang670 's proposal in
https://github.com/apache/arrow-rs/issues/1640
It also relates to @jhorstmann 's proposal on
https://github.com/apache/arrow-rs/pull/1499#issuecomment-1096878229
It could also be seen as an interpretation of the arrow2 physical vs logical
type separation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]