[jira] [Updated] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11497:
---
Labels: pull-request-available  (was: )

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
>  Labels: pull-request-available
> Attachments: parquet-tools-meta.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Truc Lam Nguyen updated ARROW-11497:

Description: 
Sorry if I don't know this feature is done deliberately, but it looks like the 
parquet writer for list data type does not conform to Apache Parquet list 
logical type specification

According to this page: 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
list type contains 3 level where the middle level, named {{list}}, must be a 
repeated group with a single field named _{{element}}_

However, in the parquet file from pyarrow writer, that single field is named 
_item_ instead,

Please find below the example python code that produce a parquet file (I use 
pandas version 1.2.1 and pyarrow version 3.0.0) 
{code:java}
import pandas as pd
 
df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
'games': [{'name': 'fifa', 'version': '21'}]}, ])
df.to_parquet('/tmp/test.parquet', engine='pyarrow')
{code}
Then I use parquet-tools from [https://formulae.brew.sh/formula/parquet-tools] 
to check the metadata of parquet file via this command

parquet-tools meta /tmp/test.parquet

The full meta is included in attached, here is only an extraction of list type 
column

games: OPTIONAL F:1 
 .list: REPEATED F:1 
 ..item: OPTIONAL F:2 
 ...name: OPTIONAL BINARY L:STRING R:1 D:4
 ...version: OPTIONAL BINARY L:STRING R:1 D:4

as can be seen, under list, it is single field named _item_

I think this should be made to be name _element_ to conform with Apache Parquet 
specification.

  was:
Sorry if I don't know this feature is done deliberately, but it looks like the 
parquet writer for list data type does not confirm to Apache Parquet list 
logical type specification,

According to this page: 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
list type contains 3 level where the middle level, named {{list}}, must be a 
repeated group with a single field named _{{element}}_

However, in the parquet file from pyarrow writer, that single field is named 
_item_ instead,

Please find below the example python code that produce a parquet file (I use 
pandas version 1.2.1 and pyarrow version 3.0.0)

 
{code:java}
import pandas as pd
 
df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
'games': [{'name': 'fifa', 'version': '21'}]}, ])
df.to_parquet('/tmp/test.parquet', engine='pyarrow')
{code}
 

Then I use parquet-tools from [https://formulae.brew.sh/formula/parquet-tools] 
to check the metadata of parquet file via this command

parquet-tools meta /tmp/test.parquet

The full meta is included in attached, here is only an extraction of list type 
column

games: OPTIONAL F:1 
.list: REPEATED F:1 
..item: OPTIONAL F:2 
...name: OPTIONAL BINARY L:STRING R:1 D:4
...version: OPTIONAL BINARY L:STRING R:1 D:4

as can be seen, under list, it is single field named _item_

I think this should be made to be name _element_ to conform with Apache Parquet 
specification.


> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in a

[jira] [Updated] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Truc Lam Nguyen updated ARROW-11497:

Summary: [Python] pyarrow parquet writer for list does not conform with 
Apache Parquet specification  (was: [Python] pyarrow parquet writer for list 
does not conform with Apache Parquet sepecification)

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not confirm to Apache Parquet list 
> logical type specification,
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0)
>  
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
>  
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
> .list: REPEATED F:1 
> ..item: OPTIONAL F:2 
> ...name: OPTIONAL BINARY L:STRING R:1 D:4
> ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)