[jira] [Commented] (ARROW-5140) [Bug?][Parquet] Can write a jagged array column of strings to disk, but hit `ArrowNotImplementedError` on read

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:50:58 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662162#comment-17662162
 ]


Rok Mihevc commented on ARROW-5140:
-----------------------------------

This issue has been migrated to [issue 
#21622|https://github.com/apache/arrow/issues/21622] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> [Bug?][Parquet] Can write a jagged array column of strings to disk, but hit 
> `ArrowNotImplementedError` on read
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-5140
>                 URL: https://issues.apache.org/jira/browse/ARROW-5140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.0
>         Environment: Debian 8
>            Reporter: Zachary Jablons
>            Priority: Blocker
>             Fix For: 0.14.0
>
>
> h1. Description
> I encountered an issue on a proprietary dataset where we have a schema that 
> looks roughly like:
> {\{ |-- ids: array (nullable = true) | |-- element: string (containsNull = 
> true) }}
> I was able to write this dataset to parquet no problem (using 
> {{pq.write_table}}), but upon reading it (using {{pq.read_table}}) I 
> encountered the following error: {{ArrowNotImplementedError: Nested data 
> conversions not implemented for chunked array outputs}} (a full stacktrace is 
> attached below)
> I believe that this is pretty confusing because I was able to serialize but 
> not deserialize this table. I was able to also find that this does not happen 
> with all sizes of the dataset - a smaller sample did not encounter this 
> issue! So I built a small reproduction harness and checked out where this 
> could happen:
> h2. Further investigation
>  * If I set the maximum number of elements per row of {{ids}}, I found that 
> reducing it allows me to serialize/deserialize more rows
>  * At a setting of maximum 15 elements per row, each element being at most 20 
> characters, I fail at about 1.3e5 rows
>  * At the limit of my willingness to spend time building giant dataframes to 
> investigate this, I haven't been able to reproduce this issue for e.g. longs 
> instead of strings
>  * Another column in this dataset consists of much longer strings than this 
> column's strings (when concatenated), and the total sum of all characters is 
> ~3x in _that_ column versus this trouble column (when the strings in each row 
> are just simply concatenated). I have no issue serializing / deserializing 
> that column.
>  * The fact that each array is of a different length doesn't seem to matter - 
> if I change it so as to force everything to be ~14 elements, it fails with 
> the same error even at 1e5 rows.
> h1. Reproduction code
> This [gist|https://gist.github.com/zmjjmz/1bf738966d2df147a4fae7268ee3d812] 
> should have both a stacktrace and reproduction code.
> h2. Version info
> {\{pyarrow==0.12.0 parquet==1.2 }}
> h1. Mea culpa
> I copy-pasted this from Github on request 
> ([https://github.com/apache/arrow/issues/4115]), and Jira formatting is a 
> nightmare compared to markdown, so I apologize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-5140) [Bug?][Parquet] Can write a jagged array column of strings to disk, but hit `ArrowNotImplementedError` on read

Reply via email to