Hey Wes,
Thanks, I had not spotted this before! It doesn't seem to change the behaviour
with `pa.ipc.new_file` however. Maybe I'm using it incorrectly?
```
import pandas as pd
import pyarrow as pa
print(pa.__version__)
schema = pa.schema([
("foo", pa.dictionary(pa.int16(), pa.string()))
])
pd1 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["a"*i for i in
range(64)])})
b1 = pa.RecordBatch.from_pandas(pd1, schema=schema)
pd2 = pd.DataFrame({"foo": pd.Categorical(["aaaa"], categories=["b"*i for i in
range(64)])})
b2 = pa.RecordBatch.from_pandas(pd2, schema=schema)
options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
with pa.ipc.new_file("/tmp/sdavis_tmp.arrow", schema=b1.schema,
options=options) as writer:
writer.write(b1)
writer.write(b2)
```
Version printed: 4.0.1
Sam
________________________________
From: Wes McKinney <[email protected]>
Sent: 23 July 2021 14:24
To: [email protected] <[email protected]>
Subject: Re: [PyArrow] DictionaryArray isDelta Support
hi Sam
On Fri, Jul 23, 2021 at 8:15 AM Sam Davis <[email protected]> wrote:
>
> Hi,
>
> We want to write out RecordBatches of data, where one or more columns in a
> batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(),
> pa.string()` as the column only contains a handful of unique values.
>
> However, PyArrow seems to lack support for writing these batches out to
> either the streaming or (non-streaming) file format.
>
> When attempting to write two distinct batches the following error message is
> triggered:
>
> > ArrowInvalid: Dictionary replacement detected when writing IPC file format.
> > Arrow IPC files only support a single dictionary for a given field across
> > all batches.
>
> I believe this message is false and that support is possible based on reading
> the spec:
>
> > Dictionaries are written in the stream and file formats as a sequence of
> > record batches...
> > ...
> > The dictionary isDelta flag allows existing dictionaries to be expanded for
> > future record batch materializations. A dictionary batch with isDelta set
> > indicates that its vector should be concatenated with those of any previous
> > batches with the same id. In a stream which encodes one column, the list of
> > strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a delta dictionary
> > batch could take the form:
>
> ```
> <SCHEMA>
> <DICTIONARY 0>
> (0) "A"
> (1) "B"
> (2) "C"
>
> <RECORD BATCH 0>
> 0
> 1
> 2
> 1
>
> <DICTIONARY 0 DELTA>
> (3) "D"
> (4) "E"
>
> <RECORD BATCH 1>
> 3
> 2
> 4
> 0
> EOS
> ```
>
> > Alternatively, if isDelta is set to false, then the dictionary replaces the
> > existing dictionary for the same ID. Using the same example as above, an
> > alternate encoding could be:
>
> ```
> <SCHEMA>
> <DICTIONARY 0>
> (0) "A"
> (1) "B"
> (2) "C"
>
> <RECORD BATCH 0>
> 0
> 1
> 2
> 1
>
> <DICTIONARY 0>
> (0) "A"
> (1) "C"
> (2) "D"
> (3) "E"
>
> <RECORD BATCH 1>
> 2
> 1
> 3
> 0
> EOS
> ```
>
> It also specifies in the IPC File Format (non-streaming) section:
>
> > In the file format, there is no requirement that dictionary keys should be
> > defined in a DictionaryBatch before they are used in a RecordBatch, as long
> > as the keys are defined somewhere in the file. Further more, it is invalid
> > to have more than one non-delta dictionary batch per dictionary ID (i.e.
> > dictionary replacement is not supported). Delta dictionaries are applied in
> > the order they appear in the file footer.
>
> So for the non-streaming format multiple non-delta dictionaries are not
> supported but one non-delta followed by delta dictionaries should be.
>
> Is it possible to do this in PyArrow? If so, how? If not, how easy would it
> be to add? Is it currently possible via C++ and therefore can I write a
> Cython or similar extension that will let me do this now without waiting for
> a release?
>
In pyarrow (3.0.0 or later), you need to opt into emitting dictionary
deltas using pyarrow.ipc.IpcWriteOptions. Can you show your code?
https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc<https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc>
> Best,
>
> Sam
> IMPORTANT NOTICE: The information transmitted is intended only for the person
> or entity to which it is addressed and may contain confidential and/or
> privileged material. Any review, re-transmission, dissemination or other use
> of, or taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer. Although we routinely screen for viruses, addressees should check
> this e-mail and any attachment for viruses. We make no warranty as to absence
> of viruses in this e-mail or any attachments.
IMPORTANT NOTICE: The information transmitted is intended only for the person
or entity to which it is addressed and may contain confidential and/or
privileged material. Any review, re-transmission, dissemination or other use
of, or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this
in error, please contact the sender and delete the material from any computer.
Although we routinely screen for viruses, addressees should check this e-mail
and any attachment for viruses. We make no warranty as to absence of viruses in
this e-mail or any attachments.