Re: [Format] Semantics for dictionary batches in streams

Micah Kornfield Sun, 11 Aug 2019 19:17:54 -0700

I'm not sure what you mean by record-in-dictionary-id, so it is possible
this is a solution that I just don't understand :)

The only two references to dictionary IDs that I could find, are  one in
schema.fbs [1] which is attached a column in a schema and the one
referenced above in DictionaryBatches define Message.fbs [2] for
transmitting dictionaries.  It is quite possible I missed something.

 The indices into the dictionary are Int Arrays in a normal record batch.
I suppose the other option is to reset the stream by sending a new schema,
but I don't think that is supported either. This is what lead to my
original question.

Does no one do this today?

I think Wes did some recent work on the C++ Parquet in reading
dictionaries, and might have faced some of these issues, I'm not sure how
he dealt with it (haven't gotten back to the Parquet code yet).

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72

On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau <jacq...@apache.org> wrote:

> Wow, you've shown how little I've thought about Arrow dictionaries for a
> while. I thought we had a dictionary id and a record-in-dictionary-id.
> Wouldn't that approach make more sense? Does no one do this today? (We
> frequently use compound values for this type of scenario...)
>
> On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Reading data from two different parquet files sequentially with different
>> dictionaries for the same column.  This could be handled by re-encoding
>> data but that seems potentially sub-optimal.
>>
>> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <jacq...@apache.org>
>> wrote:
>>
>>> What situation are anticipating where you're going to be restating ids
>>> mid stream?
>>>
>>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> The IPC specification [1] defines behavior when isDelta on a
>>>> DictionaryBatch [2] is "true".  I might have missed it in the
>>>> specification, but I couldn't find the interpretation for what the
>>>> expected
>>>> behavior is when isDelta=false and and two  dictionary batches  with the
>>>> same ID are sent.
>>>>
>>>> It seems like there are two options:
>>>> 1.  Interpret the new dictionary batch as replacing the old one.
>>>> 2.  Regard this as an error condition.
>>>>
>>>> Based on the fact that in the "file format" dictionaries are allowed to
>>>> be
>>>> placed in any order relative to the record batches, I assume it is the
>>>> second, but just wanted to make sure.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> [1] https://arrow.apache.org/docs/ipc.html
>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>>>
>>>

Re: [Format] Semantics for dictionary batches in streams

Reply via email to