Re: [Format] Semantics for dictionary batches in streams

Wes McKinney Tue, 27 Aug 2019 13:32:08 -0700

So the current situation we have right now in C++ is that if we tried
to create an IPC stream from a sequence of record batches that don't
all have the same dictionary, we'd run into two scenarios:


* Batches that either have a prefix of a prior-observed dictionary, or
the prior dictionary is a prefix of their dictionary. For example,
suppose that the dictionary sent for an id was ['A', 'B', 'C'] and
then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In
such case we could compute and send a delta batch

* Batches with a dictionary that is a permutation of values, and
possibly new unique values.

In this latter case, without the option of replacing an existing ID in
the stream, we would have to do a unification / permutation of indices
and then also possibly send a delta batch. We should probably have
code at some point that deals with both cases, but in the meantime I
would like to allow dictionaries to be redefined in this case. Seems
like we might need a vote to formalize this?

Independent from this decision, I would strongly recommend that all
implementations handle dictionaries in-memory as data and not metadata
(i.e. do not have dictionaries in the schema). It was lucky (see
ARROW-3144) that this problematic early design in the C++ library
could be fixed with less than a week of work.

Thanks
Wes

On Sun, Aug 11, 2019 at 9:17 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> I'm not sure what you mean by record-in-dictionary-id, so it is possible
> this is a solution that I just don't understand :)
>
> The only two references to dictionary IDs that I could find, are  one in
> schema.fbs [1] which is attached a column in a schema and the one
> referenced above in DictionaryBatches define Message.fbs [2] for
> transmitting dictionaries.  It is quite possible I missed something.
>
>  The indices into the dictionary are Int Arrays in a normal record batch.
> I suppose the other option is to reset the stream by sending a new schema,
> but I don't think that is supported either. This is what lead to my
> original question.
>
> Does no one do this today?
>
> I think Wes did some recent work on the C++ Parquet in reading
> dictionaries, and might have faced some of these issues, I'm not sure how
> he dealt with it (haven't gotten back to the Parquet code yet).
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L271
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>
> On Sun, Aug 11, 2019 at 6:32 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> > Wow, you've shown how little I've thought about Arrow dictionaries for a
> > while. I thought we had a dictionary id and a record-in-dictionary-id.
> > Wouldn't that approach make more sense? Does no one do this today? (We
> > frequently use compound values for this type of scenario...)
> >
> > On Sat, Aug 10, 2019 at 4:20 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> Reading data from two different parquet files sequentially with different
> >> dictionaries for the same column.  This could be handled by re-encoding
> >> data but that seems potentially sub-optimal.
> >>
> >> On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <jacq...@apache.org>
> >> wrote:
> >>
> >>> What situation are anticipating where you're going to be restating ids
> >>> mid stream?
> >>>
> >>> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <emkornfi...@gmail.com>
> >>> wrote:
> >>>
> >>>> The IPC specification [1] defines behavior when isDelta on a
> >>>> DictionaryBatch [2] is "true".  I might have missed it in the
> >>>> specification, but I couldn't find the interpretation for what the
> >>>> expected
> >>>> behavior is when isDelta=false and and two  dictionary batches  with the
> >>>> same ID are sent.
> >>>>
> >>>> It seems like there are two options:
> >>>> 1.  Interpret the new dictionary batch as replacing the old one.
> >>>> 2.  Regard this as an error condition.
> >>>>
> >>>> Based on the fact that in the "file format" dictionaries are allowed to
> >>>> be
> >>>> placed in any order relative to the record batches, I assume it is the
> >>>> second, but just wanted to make sure.
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> [1] https://arrow.apache.org/docs/ipc.html
> >>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
> >>>>
> >>>

Re: [Format] Semantics for dictionary batches in streams

Reply via email to