Dimitri Vorona created ARROW-2176:
-------------------------------------
Summary: [C++] Extend DictionaryBuilder to support delta
dictionaries
Key: ARROW-2176
URL: https://issues.apache.org/jira/browse/ARROW-2176
Project: Apache Arrow
Issue Type: New Feature
Components: C++
Reporter: Dimitri Vorona
Fix For: 0.9.0
[The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a possibility
of sending additional dictionary batches with a previously seen id and a
isDelta flag to extend the existing dictionaries with new entries. Right now,
the DictioniaryBuilder (as well as IPC writer and reader) do not support
generation of delta dictionaries.
This pull request contains a basic implementation of the DictionaryBuilder with
delta dictionaries support. The use API can be seen in the dictionary tests
(i.e.
[here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
The basic idea is that the user just reuses the builder object after calling
Finish(Array*) for the first time. Subsequent calls to Append will create new
entries only for the unseen element and reuse id from previous dictionaries for
the seen ones.
Some considerations:
# The API is pretty implicit, and additional flag for Finish, which explicitly
indicates a desire to use the builder for delta dictionary generation might be
expedient from the error avoidance point of view.
# Right now the implementation uses an additional "overflow dictionary" to
store the seen items. This adds a copy on each Finish call and an additional
lookup at each GetItem or Append call. I assume, we might get away with
returning Array slices at Finish, which would remove the need for an additional
overflow dictionary. If the gist of the PR is approved, I can look into further
optimizations.
The Writer and Reader extensions would be pretty simple, since the
DictionaryBuilder API remains basically the same.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)