As another item for consideration -- in C++ at least, the dictionary
id is dealt with as an internal detail of the IPC message production
process. When serializing the Schema, id's are assigned to each
dictionary-encoded field in the DictionaryMemo object, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/dictionary.h

When record batches are reconstructed, the dictionary corresponding to
an id at the time of reconstruction is set in the Array's internal
data -- that's the "dictionary" member of the ArrayData object
(https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L231).

On Tue, Apr 7, 2020 at 1:22 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> hey Paul,
>
> Take a look at how dictionaries work in the IPC protocol
>
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#serialization-and-interprocess-communication-ipc
>
> Dictionaries are sent as separate messages. When a field is tagged as
> dictionary encoded in the schema, the IPC reader must keep track of
> the dictionaries it's seen come across the protocol and then set them
> in the reconstructed record batches when a record batch comes through.
>
> Note that the protocol now supports dictionary deltas (dictionaries
> can be appended to by subsequent messages for the same dictionary id)
> and replacements (new dictionary for an id).
>
> I don't know what the status of handling dictionaries in the Rust IPC,
> but it would be a good idea to take time to take into account the
> above details.
>
> Finally, note that Rust is not participating in either the regular IPC
> nor Flight integration tests. This is an important milestone to being
> able to depend on the Rust library in production.
>
> Thanks
> Wes
>
> On Tue, Apr 7, 2020 at 10:36 AM Paul Dix <p...@influxdata.com> wrote:
> >
> > Hello,
> > I'm trying to build a Rust based Flight server and I'd like to use
> > Dictionary encoding for a number of string columns in my data. I've seen
> > that StringDictionary was recently added to Rust here:
> > https://github.com/apache/arrow/commit/c7a7d2dcc46ed06593b994cb54c5eaf9ccd1d21d#diff-72812e30873455dcee2ce2d1ee26e4ab.
> >
> > However, that doesn't seem to reach down into Flight. When I attempt to
> > send a schema through flight that has a Dictionary<UInt8, Utf8> it throws
> > an error when attempting to convert from the Rust type to the Flatbuffer
> > field type. I figured I'd take a swing at adding that to convert.rs here:
> > https://github.com/apache/arrow/blob/master/rust/arrow/src/ipc/convert.rs#L319
> >
> > However, when I look at the definitions in Schema.fbs and the related
> > generated Rust file, Dictionary isn't a type there. Should I be sending
> > this down as some other composed type? And if so, how does this look at the
> > client side of things? In my test I'm connecting to the Flight server via
> > PyArrow and working with it in Pandas so I'm hoping that it will be able to
> > consume Dictionary fields.
> >
> > Separately, the Rust field type doesn't have a spot for the dictionary ID,
> > which I assume I'll need to send down so it can be consumed on the client.
> > Would appreciate any thoughts on that. A little push in the right direction
> > and I'll be happy to submit a PR to help push the Rust Flight
> > implementation farther along.
> >
> > Thanks,
> > Paul

Reply via email to