[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-13 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284263#comment-17284263
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield] I've made a PR 
[here|[https://github.com/apache/arrow/pull/9489],] please have a look at let 
me know if I'm missing something, I'll try to address as much as I can, thanks 
:) 

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-13 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284230#comment-17284230
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield] thanks for the confirmation, I'm working on a PR atm.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-12 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284083#comment-17284083
 ] 

Micah Kornfield commented on ARROW-11497:
-

My thought: I think the short term we can expose the flag.  We can figure out a 
longer term plan for migrating all users to a conformant writer/reader.

 

[~trucnguyenlam] do you want to to provide a PR?

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-12 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283731#comment-17283731
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~apitrou] [~emkornfield] I think we can make a final decision on this, I'm ok 
with the option that end users have some level of control to preserve the 
behaviour.

Please let me know your thoughts, thanks :)

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279233#comment-17279233
 ] 

Micah Kornfield commented on ARROW-11497:
-

Backwards compatibility?  It might be possible to make some inferences (haven't 
thought about it deeply).  But I think if we were reading a conforming java 
produced parquet file then we would get different column names if we 
transformed on the border (maybe there can be some rules around Arrow metadata 
being present).  I think we can make the default to be conforming behavior, but 
we should give users some level of control to preserve the old behavior.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279193#comment-17279193
 ] 

Antoine Pitrou commented on ARROW-11497:


> This should be possible but it potentially needs another flag

I don't understand why that would be necessary. It should simply be the default 
(and obviously right) behaviour. Am I missing something?

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279186#comment-17279186
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

(y) make sense 

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279178#comment-17279178
 ] 

Micah Kornfield commented on ARROW-11497:
-

{quote}Perhaps we could convert the field name at the Arrow<->Parquet boundary.
{quote}
This should be possible but it potentially needs another flag.  I think in the 
short term plumbing the additional flag through to python makes sense and we 
can figure out a longer term solution if this becomes a larger problem.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279166#comment-17279166
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield]: I got the reason now.

[~apitrou] as my understanding on apache arrow is not enough to suggest any 
technical solution but I think it might be a good idea.

Also for my use case as an end user, it would be good if we can be allowed to 
produce parquet files that are compliant to official specification.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279145#comment-17279145
 ] 

Micah Kornfield commented on ARROW-11497:
-

It isn't about round-tripping performance, its about data equality. The default 
element name for nested columns in arrow is "item" which is why it gets 
propagated to parquet without the flag being set.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279149#comment-17279149
 ] 

Antoine Pitrou commented on ARROW-11497:


Perhaps we could convert the field name at the Arrow<->Parquet boundary.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279141#comment-17279141
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield] thanks for looking at this ticket, it is causing problem on my 
site as I'm having a mixture of parquet list data types
 * one are processed by parquet-mr writer, which conforms to apache parquet spec
 * the other one are produced by pyarrow, which is of the reported structure

Processing both of them is a difficulty for me. It would be really good if this 
option is exposed in Python.

Could you also explain a bit more about how this might affect on the round 
tripping performance if applicable?

Thanks.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279080#comment-17279080
 ] 

Micah Kornfield commented on ARROW-11497:
-

Is this causing a problem in practice.  There is a C++ option 
[https://github.com/apache/arrow/blob/1c18706ac9e49e3e9b4998354f213a304e82d367/cpp/src/parquet/properties.h#L689]
 that will write out element 
[https://github.com/apache/arrow/blob/9b195493409ad434cbc42b0e03c6471a9bae/cpp/src/parquet/arrow/schema.cc#L82]

 

We could expose this in python.

 

I think the main reason it isn't enabled by default is it breaks round trips 
for arrow data.  This could potentially be fixed on the reader side as well.  I 
can't find a reference but I think this might also have some impact on 
Pandas<->Parquet round tripping.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)