[Discuss][Rust] Support for Dictionary Array types

Andy Thomason Fri, 29 Nov 2019 04:37:11 -0800

I am noodling with the Dictionary implementation and would like to approve the 
data design and invite edits. Forgive my unfamiliarity with this mailing list.


Given you current design, it would seem best to add a DataType of Dictionary 
with the two sub-types for the key and values.

An array type like this may be sufficient for a reference implementation.

```
/// A dictionary where integer keys index an array in the `DictionaryBatch`
pub struct DictionaryArray {
    keys: ArrayRef,
    values: Vec<ArrayDataRef>,
}
```

Note that in the `RecordBatch`, the keys are owned by the `RecordBatch` and the 
values they index are owned by one or more `DictionaryBatch`. The multiple 
entries for values allow for delta DictionaryBatches.

The most conceptually similar existing array is the List type except that the 
index can be something other than i32 and the result is a single row.

In practice, there will only ever be one dictionary batch shared amongst all 
the record batches and so we would just get a pair of slices and use one to 
index the other.

Another common case is to reduce the size of string arrays in the case where 
there is a limited alphabet of strings and some acceleration would be welcome 
for this.

[Discuss][Rust] Support for Dictionary Array types

Reply via email to