>
> I guess since the keys are only additive then you just create the master
> dictionary before allowing random access to the data.


Yes, this is what the implementation does.

At some point we might want to create an updated file format that can
handle replacements also, but this hasn't been a priority for anyone.

On Tue, Feb 22, 2022 at 10:12 AM Chris Nuernberger <[email protected]>
wrote:

> I guess since the keys are only additive then you just create the master
> dictionary before allowing random access to the data.
>
> On Tue, Feb 22, 2022 at 11:08 AM Chris Nuernberger <[email protected]>
> wrote:
>
>> OK, thanks, I will work with delta dictionaries.
>>
>> How do delta dictionaries solve the random access issue?
>>
>> On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Dictionary replacement isn't supported in the file format because the
>>> metadata makes it difficult to associate a particular dictionary with a
>>> record batch for Random access.
>>>
>>> Delta dictionaries are supported but there was a long standing bug that
>>> prevented there use in Python (
>>> https://issues.apache.org/jira/browse/ARROW-13467).  If you are still
>>> seeing issues in pyarrow 7.0 please open a bug.
>>>
>>> In regards to the usefulness of the file format without these features
>>> that is really use case dependent.
>>>
>>> Cheers,
>>> Micah
>>>
>>> On Tuesday, February 22, 2022, Chris Nuernberger <[email protected]>
>>> wrote:
>>>
>>>> How are dictionaries intended to be used in a file with multiple record
>>>> batches?
>>>>
>>>> I tried saving record-batch-specific dictionaries and got this error
>>>> from python:
>>>>
>>>>  > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
>>>> dictionary delta in IPC file
>>>>
>>>> This seems to defeat the purpose of having multiple record batches in a
>>>> single arrow file; the work around appears to be to either preprocess the
>>>> entire sequence of datasets to unify the dictionaries or save multiple
>>>> arrow files.
>>>>
>>>

Reply via email to