tustvold opened a new issue #209: URL: https://github.com/apache/arrow-datafusion/issues/209
The dictionary support added in #1262 hydrates dictionaries for arrow flight. In some situations it is possible to do better than this. This is somewhat complicated because dictionaries may be shared across columns for some record batches, however, the dictionary ID is encoded in the schema and must be constant for a given column. A very basic protocol would assign each column in the schema a unique dictionary ID, and before sending each record batch send out a non-differential dictionary update containing the dictionary for the column within that record batch. This is potentially wasteful, and will likely want to incorporate heuristics for when it is better to hydrate the values and/or re-encode the dictionary, but should be easy to implement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
