Re: No replacement dictionaries supported in pyarrow?

Nate Bauernfeind Fri, 19 Mar 2021 09:25:41 -0700

Actually, I slightly want to rephrase my claim. I see the footer is defined
as:


table Footer {
  version: org.apache.arrow.flatbuf.MetadataVersion;

  schema: org.apache.arrow.flatbuf.Schema;

  dictionaries: [ Block ];

  recordBatches: [ Block ];

  /// User-defined metadata
  custom_metadata: [ KeyValue ];
}

So, the footer does not contain the dictionary batch definitions, but
rather points to them. Since dictionary deltas are append-only, one can
build the dictionary up front and then support O(1) random access on the
record batches.

You could support dictionary replacement by having an "effective"
dictionary for contiguous blocks of record batches; which ends up being
O(log(n_k)) for the dictionary with id k that has n_k replacements.

However, the documentation claim, that you can use a dictionary and
dictionary key that has not yet been defined, isn't giving a lot of wiggle
room for alternatives.

On Fri, Mar 19, 2021 at 10:03 AM Nate Bauernfeind <
nate.bauernfe...@gmail.com> wrote:

> The dictionary is not allowed to change throughout the file; which is
> ultimately OP's request. This is because all of the dictionary definition
> is in the footer of the file; which was clearly done to support random
> access of record batches.
>
> To quote the documentation:
>
> > We define a “file format” supporting random access that is build with
> the stream format.
> >
> > [...]
> >
> > In the file format, there is no requirement that dictionary keys should
> be defined in a DictionaryBatch before they are used in a RecordBatch, as
> long as the keys are defined somewhere in the file. Further more, it is
> invalid to have more than one non-delta dictionary batch per dictionary ID
> (i.e. dictionary replacement is not supported). Delta dictionaries are
> applied in the order they appear in the file footer.
>
>
> On Fri, Mar 19, 2021 at 6:37 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> I am also under the impression that the file format is supposed to support
>> deltas, but not replacements. Is this not implemented in C++?
>>
>> On Thu, Mar 18, 2021 at 9:57 PM Nate Bauernfeind <
>> nate.bauernfe...@gmail.com>
>> wrote:
>>
>> > If dictionary replacements were supported, then the IPC file format
>> > couldn't guarantee random access reads.
>> >
>> > Personally, I would like to support a stream-based file format that is a
>> > series of the Flight protobufs. In my extension of arrow flight, by
>> > stuffing our state-based data into the app_metadata field on the
>> FlightData
>> > object, we can't write down a stream natively in the IPC based file
>> format
>> > (for testing, or sharing the reproduction of an error). In particular,
>> the
>> > IPC format is based around the flatbuffer payloads instead of the Flight
>> > protobuf payloads. It might be nice to support an additional type of IPC
>> > file for stateful streams. If interested, it would be easy to integrate
>> > with the existing code using a different magic field in the footer
>> (such as
>> > 'FLGHT1', instead of 'ARROW1'). In addition to the offsets and sizes of
>> > payloads, it might be nice to indicate the type of payload (RecordBatch
>> vs
>> > DictionaryBatch, etc). We wouldn't have O(1) random access, but I think
>> in
>> > the "replay of a stream" scenario, one probably isn't looking for random
>> > access anyways.
>> >
>> > On Thu, Mar 18, 2021 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com>
>> > wrote:
>> >
>> > > Hmm, I noticed this "The IPC file format doesn't support dictionary
>> > > replacements or deltas." I was under the impression we aimed to
>> support
>> > > dictionary deltas in the file format.  If not we should remove "Delta
>> > > dictionaries are applied in the order they appear in the file footer."
>> > from
>> > > the specification.
>> > >
>> > > On Thu, Mar 18, 2021 at 8:48 AM Antoine Pitrou <anto...@python.org>
>> > wrote:
>> > >
>> > > >
>> > > > It's a bit more configurable, but basically yes.  See the IPC write
>> > > > options:
>> > > >
>> > >
>> >
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L73
>> > > >
>> > > > Regards
>> > > >
>> > > > Antoine.
>> > > >
>> > > >
>> > > > Le 18/03/2021 à 16:37, Jacob Quinn a écrit :
>> > > > > Ah, interesting. So to make sure I understand correctly, the C++
>> > write
>> > > > > implementation will scan all "batches" and unify all dictionary
>> > values
>> > > > > before writing out the schema + dictionary messages? But only when
>> > > > writing
>> > > > > the file format? In the streaming case, it would still write
>> > > > > replacement/delta dictionary messages as needed.
>> > > > >
>> > > > > -Jacob
>> > > > >
>> > > > > On Thu, Mar 18, 2021 at 9:10 AM Neal Richardson <
>> > > > neal.p.richard...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Somewhat related issue:
>> > > > https://issues.apache.org/jira/browse/ARROW-10406
>> > > > >>
>> > > > >> On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield <
>> > > emkornfi...@gmail.com
>> > > > >
>> > > > >> wrote:
>> > > > >>
>> > > > >>> BTW, this nuance always felt a little strange to me, but would
>> have
>> > > > >>> required adding additional information to the file format, to
>> > > > >> disambiguate
>> > > > >>> when exactly a dictionary was intended to be replaced.
>> > > > >>>
>> > > > >>> On Wed, Mar 17, 2021 at 11:19 PM Micah Kornfield <
>> > > > emkornfi...@gmail.com>
>> > > > >>> wrote:
>> > > > >>>
>> > > > >>>> Hi Jacob,
>> > > > >>>> There is nuance.  The file format does not support dictionary
>> > > > >>> replacement,
>> > > > >>>> the specification [1] why that is currently the case.  Only the
>> > > > "stream
>> > > > >>>> format" supports replacement (i.e. no magic number, only schema
>> > > > >> followed
>> > > > >>> by
>> > > > >>>> one or more dictionary/record-batch messages).
>> > > > >>>>
>> > > > >>>> -Micah
>> > > > >>>>
>> > > > >>>> [1]
>> > > > https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
>> > > > >>>>
>> > > > >>>> On Wed, Mar 17, 2021 at 11:04 PM Jacob Quinn <
>> > > quinn.jac...@gmail.com>
>> > > > >>>> wrote:
>> > > > >>>>
>> > > > >>>>> Had an issue come up here:
>> > > > >>>>>
>> > > > >>
>> > >
>> https://github.com/JuliaData/Arrow.jl/issues/129#issuecomment-777350450
>> > > > >>> .
>> > > > >>>>>  From the implementation status page, it says C++ supports
>> > > > replacement
>> > > > >>>>> dictionaries and that python tracks the C++ implementation. Is
>> > this
>> > > > >>> just a
>> > > > >>>>> pyarrow issue where it specifically doesn't support
>> replacement
>> > > > >>>>> dictionaries? Or it's not "hooked in" properly?
>> > > > >>>>>
>> > > > >>>>> -Jacob
>> > > > >>>>>
>> > > > >>>>
>> > > > >>>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: No replacement dictionaries supported in pyarrow?

Reply via email to