I'm not sure what you mean by record-in-dictionary-id, so it is possible this is a solution that I just don't understand :)
The only two references to dictionary IDs that I could find, are one in schema.fbs [1] which is attached a column in a schema and the one referenced above in DictionaryBatches define Message.fbs [2] for transmitting dictionaries. It is quite possible I missed something. The indices into the dictionary are Int Arrays in a normal record batch. I suppose the other option is to reset the stream by sending a new schema, but I don't think that is supported either. This is what lead to my original question. Does no one do this today? I think Wes did some recent work on the C++ Parquet in reading dictionaries, and might have faced some of these issues, I'm not sure how he dealt with it (haven't gotten back to the Parquet code yet). [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271 [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau <jacq...@apache.org> wrote: > Wow, you've shown how little I've thought about Arrow dictionaries for a > while. I thought we had a dictionary id and a record-in-dictionary-id. > Wouldn't that approach make more sense? Does no one do this today? (We > frequently use compound values for this type of scenario...) > > On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Reading data from two different parquet files sequentially with different >> dictionaries for the same column. This could be handled by re-encoding >> data but that seems potentially sub-optimal. >> >> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <jacq...@apache.org> >> wrote: >> >>> What situation are anticipating where you're going to be restating ids >>> mid stream? >>> >>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> The IPC specification [1] defines behavior when isDelta on a >>>> DictionaryBatch [2] is "true". I might have missed it in the >>>> specification, but I couldn't find the interpretation for what the >>>> expected >>>> behavior is when isDelta=false and and two dictionary batches with the >>>> same ID are sent. >>>> >>>> It seems like there are two options: >>>> 1. Interpret the new dictionary batch as replacing the old one. >>>> 2. Regard this as an error condition. >>>> >>>> Based on the fact that in the "file format" dictionaries are allowed to >>>> be >>>> placed in any order relative to the record batches, I assume it is the >>>> second, but just wanted to make sure. >>>> >>>> Thanks, >>>> Micah >>>> >>>> [1] https://arrow.apache.org/docs/ipc.html >>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 >>>> >>>