[jira] [Commented] (ARROW-10406) [Format] Support dictionary replacement in the IPC file format

Wes McKinney (Jira) Fri, 08 Jan 2021 12:36:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261574#comment-17261574
 ]


Wes McKinney commented on ARROW-10406:
--------------------------------------

I think it's reasonable for it to fail when writing the file format in a 
streaming fashion, but when we have the entire table up front, it seems 
reasonable (though certainly tedious...) to scan the dictionaries in each chunk 
and if there are any differences, to do a unification. I reckon there would be 
some refactoring necessary, but if it is not too gory this seems like it would 
be worth doing in the coming months. 

> [Format] Support dictionary replacement in the IPC file format
> --------------------------------------------------------------
>
>                 Key: ARROW-10406
>                 URL: https://issues.apache.org/jira/browse/ARROW-10406
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Format
>            Reporter: Neal Richardson
>            Priority: Major
>
> I read a big (taxi) csv file and specified that I wanted to dictionary-encode 
> some columns. The resulting Table has ChunkedArrays with 1604 chunks. When I 
> go to write this Table to the IPC file format (write_feather), I get an 
> error: 
> {code}
>   Invalid: Dictionary replacement detected when writing IPC file format. 
> Arrow IPC files only support a single dictionary for a given field accross 
> all batches.
> {code}
> I can write this to Parquet and read it back in, and the roundtrip of the 
> data is correct. We should be able to do this in IPC too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10406) [Format] Support dictionary replacement in the IPC file format

Reply via email to