[ https://issues.apache.org/jira/browse/ARROW-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662162#comment-17662162 ]
Rok Mihevc commented on ARROW-5140: ----------------------------------- This issue has been migrated to [issue #21622|https://github.com/apache/arrow/issues/21622] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Bug?][Parquet] Can write a jagged array column of strings to disk, but hit > `ArrowNotImplementedError` on read > -------------------------------------------------------------------------------------------------------------- > > Key: ARROW-5140 > URL: https://issues.apache.org/jira/browse/ARROW-5140 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.12.0 > Environment: Debian 8 > Reporter: Zachary Jablons > Priority: Blocker > Fix For: 0.14.0 > > > h1. Description > I encountered an issue on a proprietary dataset where we have a schema that > looks roughly like: > {\{ |-- ids: array (nullable = true) | |-- element: string (containsNull = > true) }} > I was able to write this dataset to parquet no problem (using > {{pq.write_table}}), but upon reading it (using {{pq.read_table}}) I > encountered the following error: {{ArrowNotImplementedError: Nested data > conversions not implemented for chunked array outputs}} (a full stacktrace is > attached below) > I believe that this is pretty confusing because I was able to serialize but > not deserialize this table. I was able to also find that this does not happen > with all sizes of the dataset - a smaller sample did not encounter this > issue! So I built a small reproduction harness and checked out where this > could happen: > h2. Further investigation > * If I set the maximum number of elements per row of {{ids}}, I found that > reducing it allows me to serialize/deserialize more rows > * At a setting of maximum 15 elements per row, each element being at most 20 > characters, I fail at about 1.3e5 rows > * At the limit of my willingness to spend time building giant dataframes to > investigate this, I haven't been able to reproduce this issue for e.g. longs > instead of strings > * Another column in this dataset consists of much longer strings than this > column's strings (when concatenated), and the total sum of all characters is > ~3x in _that_ column versus this trouble column (when the strings in each row > are just simply concatenated). I have no issue serializing / deserializing > that column. > * The fact that each array is of a different length doesn't seem to matter - > if I change it so as to force everything to be ~14 elements, it fails with > the same error even at 1e5 rows. > h1. Reproduction code > This [gist|https://gist.github.com/zmjjmz/1bf738966d2df147a4fae7268ee3d812] > should have both a stacktrace and reproduction code. > h2. Version info > {\{pyarrow==0.12.0 parquet==1.2 }} > h1. Mea culpa > I copy-pasted this from Github on request > ([https://github.com/apache/arrow/issues/4115]), and Jira formatting is a > nightmare compared to markdown, so I apologize. -- This message was sent by Atlassian Jira (v8.20.10#820010)