So the current situation we have right now in C++ is that if we tried to create an IPC stream from a sequence of record batches that don't all have the same dictionary, we'd run into two scenarios:
* Batches that either have a prefix of a prior-observed dictionary, or the prior dictionary is a prefix of their dictionary. For example, suppose that the dictionary sent for an id was ['A', 'B', 'C'] and then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In such case we could compute and send a delta batch * Batches with a dictionary that is a permutation of values, and possibly new unique values. In this latter case, without the option of replacing an existing ID in the stream, we would have to do a unification / permutation of indices and then also possibly send a delta batch. We should probably have code at some point that deals with both cases, but in the meantime I would like to allow dictionaries to be redefined in this case. Seems like we might need a vote to formalize this? Independent from this decision, I would strongly recommend that all implementations handle dictionaries in-memory as data and not metadata (i.e. do not have dictionaries in the schema). It was lucky (see ARROW-3144) that this problematic early design in the C++ library could be fixed with less than a week of work. Thanks Wes On Sun, Aug 11, 2019 at 9:17 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > I'm not sure what you mean by record-in-dictionary-id, so it is possible > this is a solution that I just don't understand :) > > The only two references to dictionary IDs that I could find, are one in > schema.fbs [1] which is attached a column in a schema and the one > referenced above in DictionaryBatches define Message.fbs [2] for > transmitting dictionaries. It is quite possible I missed something. > > The indices into the dictionary are Int Arrays in a normal record batch. > I suppose the other option is to reset the stream by sending a new schema, > but I don't think that is supported either. This is what lead to my > original question. > > Does no one do this today? > > I think Wes did some recent work on the C++ Parquet in reading > dictionaries, and might have faced some of these issues, I'm not sure how > he dealt with it (haven't gotten back to the Parquet code yet). > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271 > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 > > On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > Wow, you've shown how little I've thought about Arrow dictionaries for a > > while. I thought we had a dictionary id and a record-in-dictionary-id. > > Wouldn't that approach make more sense? Does no one do this today? (We > > frequently use compound values for this type of scenario...) > > > > On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> Reading data from two different parquet files sequentially with different > >> dictionaries for the same column. This could be handled by re-encoding > >> data but that seems potentially sub-optimal. > >> > >> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <jacq...@apache.org> > >> wrote: > >> > >>> What situation are anticipating where you're going to be restating ids > >>> mid stream? > >>> > >>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <emkornfi...@gmail.com> > >>> wrote: > >>> > >>>> The IPC specification [1] defines behavior when isDelta on a > >>>> DictionaryBatch [2] is "true". I might have missed it in the > >>>> specification, but I couldn't find the interpretation for what the > >>>> expected > >>>> behavior is when isDelta=false and and two dictionary batches with the > >>>> same ID are sent. > >>>> > >>>> It seems like there are two options: > >>>> 1. Interpret the new dictionary batch as replacing the old one. > >>>> 2. Regard this as an error condition. > >>>> > >>>> Based on the fact that in the "file format" dictionaries are allowed to > >>>> be > >>>> placed in any order relative to the record batches, I assume it is the > >>>> second, but just wanted to make sure. > >>>> > >>>> Thanks, > >>>> Micah > >>>> > >>>> [1] https://arrow.apache.org/docs/ipc.html > >>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 > >>>> > >>>