Hi Wes,
Yes, that is exactly it. For the file format, the spec dictates that it should
be possible to output deltas but currently this is not possible. An
`ArrowInvalid` error is thrown.
Example code:
```
import pandas as pd
import pyarrow as pa
print(pa.__version__)
schema = pa.schema([
("foo", pa.dictionary(pa.int16(), pa.string()))
])
pd1 = pd.DataFrame({"foo": pd.Categorical(["a"], categories=["a", "b"])})
b1 = pa.RecordBatch.from_pandas(pd1, schema=schema)
pd2 = pd.DataFrame({"foo": pd.Categorical(["a", "bbbb"], categories=["a", "b",
"bbbb"])})
b2 = pa.RecordBatch.from_pandas(pd2, schema=schema)
options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
with pa.ipc.new_file("/tmp/sdavis_tmp.arrow", schema=schema, options=options)
as writer:
writer.write(b1)
writer.write(b2)
```
Best,
Sam
________________________________
From: Wes McKinney <[email protected]>
Sent: 24 July 2021 01:43
To: [email protected] <[email protected]>
Subject: Re: [PyArrow] DictionaryArray isDelta Support
If I'm interpreting you correctly, the issue is that every dictionary
must be a prefix of a common dictionary for the delta logic to work.
So if the first batch has
"a", "b"
then in the next batch
"a", "b", "c" is OK and will emit a delta
"b", "a", "c" is not and will trigger this error
If we wanted to allow for deltas coming from unordered dictionaries as
an option, that could be implemented in theory but it not super
trivial
On Fri, Jul 23, 2021 at 9:25 AM Sam Davis <[email protected]> wrote:
>
> For reference, I think this check in the C++ code triggers regardless of
> whether the delta option is turned on:
>
> https://github.com/apache/arrow/blob/e0401123736c85283e527797a113a3c38c0915f2/cpp/src/arrow/ipc/writer.cc#L1066<https://github.com/apache/arrow/blob/e0401123736c85283e527797a113a3c38c0915f2/cpp/src/arrow/ipc/writer.cc#L1066>
> ________________________________
> From: Sam Davis <[email protected]>
> Sent: 23 July 2021 14:43
> To: [email protected] <[email protected]>
> Subject: Re: [PyArrow] DictionaryArray isDelta Support
>
> Yes I know this as quoted in the spec. What I am wondering is for the file
> format how can I write deltas out using PyArrow?
>
> The previous example was a trivial version of reality.
>
> More concretely, say I want to write 100e6 rows out in multiple RecordBatches
> to a non-streaming file format using PyArrow. I do not want to do a complete
> pass ahead of time to compute the full set of strings for the relevant
> columns and would therefore like to dump out deltas when new strings appear
> in a given column. Is this possible?
>
> In the example code ideally this would "just" add on the delta containing the
> dictionary difference of it and the previous batches. I'm happy as a user to
> maintain the full set of categories seen thus far and tell PyArrow what the
> delta is if necessary.
> ________________________________
> From: Wes McKinney <[email protected]>
> Sent: 23 July 2021 14:36
> To: [email protected] <[email protected]>
> Subject: Re: [PyArrow] DictionaryArray isDelta Support
>
> Dictionary replacements aren't supported in the file format, only
> deltas. Your use case is a replacement, not a delta. You could use the
> stream format instead.
>
> On Fri, Jul 23, 2021 at 8:32 AM Sam Davis <[email protected]> wrote:
> >
> > Hey Wes,
> >
> > Thanks, I had not spotted this before! It doesn't seem to change the
> > behaviour with `pa.ipc.new_file` however. Maybe I'm using it incorrectly?
> >
> > ```
> > import pandas as pd
> > import pyarrow as pa
> >
> > print(pa.__version__)
> >
> > schema = pa.schema([
> > ("foo", pa.dictionary(pa.int16(), pa.string()))
> > ])
> >
> > pd1 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["a"*i for i
> > in range(64)])})
> > b1 = pa.RecordBatch.from_pandas(pd1, schema=schema)
> >
> > pd2 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["b"*i for i
> > in range(64)])})
> > b2 = pa.RecordBatch.from_pandas(pd2, schema=schema)
> >
> > options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
> >
> > with pa.ipc.new_file("/tmp/sdavis_tmp.arrow", schema=b1.schema,
> > options=options) as writer:
> > writer.write(b1)
> > writer.write(b2)
> > ```
> >
> > Version printed: 4.0.1
> >
> > Sam
> > ________________________________
> > From: Wes McKinney <[email protected]>
> > Sent: 23 July 2021 14:24
> > To: [email protected] <[email protected]>
> > Subject: Re: [PyArrow] DictionaryArray isDelta Support
> >
> > hi Sam
> >
> > On Fri, Jul 23, 2021 at 8:15 AM Sam Davis <[email protected]>
> > wrote:
> > >
> > > Hi,
> > >
> > > We want to write out RecordBatches of data, where one or more columns in
> > > a batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(),
> > > pa.string()` as the column only contains a handful of unique values.
> > >
> > > However, PyArrow seems to lack support for writing these batches out to
> > > either the streaming or (non-streaming) file format.
> > >
> > > When attempting to write two distinct batches the following error message
> > > is triggered:
> > >
> > > > ArrowInvalid: Dictionary replacement detected when writing IPC file
> > > > format. Arrow IPC files only support a single dictionary for a given
> > > > field across all batches.
> > >
> > > I believe this message is false and that support is possible based on
> > > reading the spec:
> > >
> > > > Dictionaries are written in the stream and file formats as a sequence
> > > > of record batches...
> > > > ...
> > > > The dictionary isDelta flag allows existing dictionaries to be expanded
> > > > for future record batch materializations. A dictionary batch with
> > > > isDelta set indicates that its vector should be concatenated with those
> > > > of any previous batches with the same id. In a stream which encodes one
> > > > column, the list of strings ["A", "B", "C", "B", "D", "C", "E", "A"],
> > > > with a delta dictionary batch could take the form:
> > >
> > > ```
> > > <SCHEMA>
> > > <DICTIONARY 0>
> > > (0) "A"
> > > (1) "B"
> > > (2) "C"
> > >
> > > <RECORD BATCH 0>
> > > 0
> > > 1
> > > 2
> > > 1
> > >
> > > <DICTIONARY 0 DELTA>
> > > (3) "D"
> > > (4) "E"
> > >
> > > <RECORD BATCH 1>
> > > 3
> > > 2
> > > 4
> > > 0
> > > EOS
> > > ```
> > >
> > > > Alternatively, if isDelta is set to false, then the dictionary replaces
> > > > the existing dictionary for the same ID. Using the same example as
> > > > above, an alternate encoding could be:
> > >
> > > ```
> > > <SCHEMA>
> > > <DICTIONARY 0>
> > > (0) "A"
> > > (1) "B"
> > > (2) "C"
> > >
> > > <RECORD BATCH 0>
> > > 0
> > > 1
> > > 2
> > > 1
> > >
> > > <DICTIONARY 0>
> > > (0) "A"
> > > (1) "C"
> > > (2) "D"
> > > (3) "E"
> > >
> > > <RECORD BATCH 1>
> > > 2
> > > 1
> > > 3
> > > 0
> > > EOS
> > > ```
> > >
> > > It also specifies in the IPC File Format (non-streaming) section:
> > >
> > > > In the file format, there is no requirement that dictionary keys should
> > > > be defined in a DictionaryBatch before they are used in a RecordBatch,
> > > > as long as the keys are defined somewhere in the file. Further more, it
> > > > is invalid to have more than one non-delta dictionary batch per
> > > > dictionary ID (i.e. dictionary replacement is not supported). Delta
> > > > dictionaries are applied in the order they appear in the file footer.
> > >
> > > So for the non-streaming format multiple non-delta dictionaries are not
> > > supported but one non-delta followed by delta dictionaries should be.
> > >
> > > Is it possible to do this in PyArrow? If so, how? If not, how easy would
> > > it be to add? Is it currently possible via C++ and therefore can I write
> > > a Cython or similar extension that will let me do this now without
> > > waiting for a release?
> > >
> >
> > In pyarrow (3.0.0 or later), you need to opt into emitting dictionary
> > deltas using pyarrow.ipc.IpcWriteOptions. Can you show your code?
> >
> > https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc<https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc>
> >
> > > Best,
> > >
> > > Sam
> > > IMPORTANT NOTICE: The information transmitted is intended only for the
> > > person or entity to which it is addressed and may contain confidential
> > > and/or privileged material. Any review, re-transmission, dissemination or
> > > other use of, or taking of any action in reliance upon, this information
> > > by persons or entities other than the intended recipient is prohibited.
> > > If you received this in error, please contact the sender and delete the
> > > material from any computer. Although we routinely screen for viruses,
> > > addressees should check this e-mail and any attachment for viruses. We
> > > make no warranty as to absence of viruses in this e-mail or any
> > > attachments.
> > IMPORTANT NOTICE: The information transmitted is intended only for the
> > person or entity to which it is addressed and may contain confidential
> > and/or privileged material. Any review, re-transmission, dissemination or
> > other use of, or taking of any action in reliance upon, this information by
> > persons or entities other than the intended recipient is prohibited. If you
> > received this in error, please contact the sender and delete the material
> > from any computer. Although we routinely screen for viruses, addressees
> > should check this e-mail and any attachment for viruses. We make no
> > warranty as to absence of viruses in this e-mail or any attachments.
> IMPORTANT NOTICE: The information transmitted is intended only for the person
> or entity to which it is addressed and may contain confidential and/or
> privileged material. Any review, re-transmission, dissemination or other use
> of, or taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer. Although we routinely screen for viruses, addressees should check
> this e-mail and any attachment for viruses. We make no warranty as to absence
> of viruses in this e-mail or any attachments.
IMPORTANT NOTICE: The information transmitted is intended only for the person
or entity to which it is addressed and may contain confidential and/or
privileged material. Any review, re-transmission, dissemination or other use
of, or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this
in error, please contact the sender and delete the material from any computer.
Although we routinely screen for viruses, addressees should check this e-mail
and any attachment for viruses. We make no warranty as to absence of viruses in
this e-mail or any attachments.