Re: [Format] Semantics for dictionary batches in streams

Wes McKinney Tue, 27 Aug 2019 14:10:48 -0700

On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <[email protected]> wrote:
>
>
> Le 27/08/2019 à 22:31, Wes McKinney a écrit :
> > So the current situation we have right now in C++ is that if we tried
> > to create an IPC stream from a sequence of record batches that don't
> > all have the same dictionary, we'd run into two scenarios:
> >
> > * Batches that either have a prefix of a prior-observed dictionary, or
> > the prior dictionary is a prefix of their dictionary. For example,
> > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
> > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
> > such case we could compute and send a delta batch
> >
> > * Batches with a dictionary that is a permutation of values, and
> > possibly new unique values.
> >
> > In this latter case, without the option of replacing an existing ID in
> > the stream, we would have to do a unification / permutation of indices
> > and then also possibly send a delta batch. We should probably have
> > code at some point that deals with both cases, but in the meantime I
> > would like to allow dictionaries to be redefined in this case. Seems
> > like we might need a vote to formalize this?
>
> Isn't the stream format deviating from the file format then?  In the
> file format, IIUC, dictionaries can appear after the respective record
> batches, so there's no way to tell whether the original or redefined
> version of a dictionary is being referred to.


You make a good point -- we can consider changes to the file format to
allow for record batches to have different dictionaries. Even handling
delta dictionaries with the current file format would be a bit tedious
(though not indeterminate)

> Regards
>
> Antoine.

Re: [Format] Semantics for dictionary batches in streams

Reply via email to