[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283731#comment-17283731 ]
Truc Lam Nguyen commented on ARROW-11497: ----------------------------------------- [~apitrou] [~emkornfield] I think we can make a final decision on this, I'm ok with the option that end users have some level of control to preserve the behaviour. Please let me know your thoughts, thanks :) > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > ------------------------------------------------------------------------------------------- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 3.0.0 > Reporter: Truc Lam Nguyen > Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)