I am noodling with the Dictionary implementation and would like to approveĀ the
data design and invite edits. Forgive my unfamiliarity with this mailing list.
Given you current design, it would seem best to add a DataType of Dictionary
with the two sub-types for the key and values.
An array type like this may be sufficient for a reference implementation.
```
/// A dictionary where integer keys index an array in the `DictionaryBatch`
pub struct DictionaryArray {
keys: ArrayRef,
values: Vec<ArrayDataRef>,
}
```
Note that in the `RecordBatch`, the keys are owned by the `RecordBatch` and the
values they index are owned by one or more `DictionaryBatch`. The multiple
entriesĀ for values allow for delta DictionaryBatches.
The most conceptually similar existing array is the List type except that the
index can be something other than i32 and the result is a single row.
In practice, there will only ever be one dictionary batch shared amongst all
the record batches and so we would just get a pair of slices and use one to
index the other.
Another common case is to reduce the size of string arrays in the case where
there is a limited alphabet of strings and some acceleration would be welcome
for this.