hi Sam On Fri, Jul 23, 2021 at 8:15 AM Sam Davis <[email protected]> wrote: > > Hi, > > We want to write out RecordBatches of data, where one or more columns in a > batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(), > pa.string()` as the column only contains a handful of unique values. > > However, PyArrow seems to lack support for writing these batches out to > either the streaming or (non-streaming) file format. > > When attempting to write two distinct batches the following error message is > triggered: > > > ArrowInvalid: Dictionary replacement detected when writing IPC file format. > > Arrow IPC files only support a single dictionary for a given field across > > all batches. > > I believe this message is false and that support is possible based on reading > the spec: > > > Dictionaries are written in the stream and file formats as a sequence of > > record batches... > > ... > > The dictionary isDelta flag allows existing dictionaries to be expanded for > > future record batch materializations. A dictionary batch with isDelta set > > indicates that its vector should be concatenated with those of any previous > > batches with the same id. In a stream which encodes one column, the list of > > strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a delta dictionary > > batch could take the form: > > ``` > <SCHEMA> > <DICTIONARY 0> > (0) "A" > (1) "B" > (2) "C" > > <RECORD BATCH 0> > 0 > 1 > 2 > 1 > > <DICTIONARY 0 DELTA> > (3) "D" > (4) "E" > > <RECORD BATCH 1> > 3 > 2 > 4 > 0 > EOS > ``` > > > Alternatively, if isDelta is set to false, then the dictionary replaces the > > existing dictionary for the same ID. Using the same example as above, an > > alternate encoding could be: > > ``` > <SCHEMA> > <DICTIONARY 0> > (0) "A" > (1) "B" > (2) "C" > > <RECORD BATCH 0> > 0 > 1 > 2 > 1 > > <DICTIONARY 0> > (0) "A" > (1) "C" > (2) "D" > (3) "E" > > <RECORD BATCH 1> > 2 > 1 > 3 > 0 > EOS > ``` > > It also specifies in the IPC File Format (non-streaming) section: > > > In the file format, there is no requirement that dictionary keys should be > > defined in a DictionaryBatch before they are used in a RecordBatch, as long > > as the keys are defined somewhere in the file. Further more, it is invalid > > to have more than one non-delta dictionary batch per dictionary ID (i.e. > > dictionary replacement is not supported). Delta dictionaries are applied in > > the order they appear in the file footer. > > So for the non-streaming format multiple non-delta dictionaries are not > supported but one non-delta followed by delta dictionaries should be. > > Is it possible to do this in PyArrow? If so, how? If not, how easy would it > be to add? Is it currently possible via C++ and therefore can I write a > Cython or similar extension that will let me do this now without waiting for > a release? >
In pyarrow (3.0.0 or later), you need to opt into emitting dictionary deltas using pyarrow.ipc.IpcWriteOptions. Can you show your code? https://github.com/apache/arrow/commit/8d76312dd397ebe07b71531f6d23b8caa76703dc > Best, > > Sam > IMPORTANT NOTICE: The information transmitted is intended only for the person > or entity to which it is addressed and may contain confidential and/or > privileged material. Any review, re-transmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. Although we routinely screen for viruses, addressees should check > this e-mail and any attachment for viruses. We make no warranty as to absence > of viruses in this e-mail or any attachments.
