[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284263#comment-17284263 ] Truc Lam Nguyen commented on ARROW-11497: - [~emkornfield] I've made a PR [here|[https://github.com/apache/arrow/pull/9489],] please have a look at let me know if I'm missing something, I'll try to address as much as I can, thanks :) > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284230#comment-17284230 ] Truc Lam Nguyen commented on ARROW-11497: - [~emkornfield] thanks for the confirmation, I'm working on a PR atm. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284083#comment-17284083 ] Micah Kornfield commented on ARROW-11497: - My thought: I think the short term we can expose the flag. We can figure out a longer term plan for migrating all users to a conformant writer/reader. [~trucnguyenlam] do you want to to provide a PR? > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283731#comment-17283731 ] Truc Lam Nguyen commented on ARROW-11497: - [~apitrou] [~emkornfield] I think we can make a final decision on this, I'm ok with the option that end users have some level of control to preserve the behaviour. Please let me know your thoughts, thanks :) > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279233#comment-17279233 ] Micah Kornfield commented on ARROW-11497: - Backwards compatibility? It might be possible to make some inferences (haven't thought about it deeply). But I think if we were reading a conforming java produced parquet file then we would get different column names if we transformed on the border (maybe there can be some rules around Arrow metadata being present). I think we can make the default to be conforming behavior, but we should give users some level of control to preserve the old behavior. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279193#comment-17279193 ] Antoine Pitrou commented on ARROW-11497: > This should be possible but it potentially needs another flag I don't understand why that would be necessary. It should simply be the default (and obviously right) behaviour. Am I missing something? > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279186#comment-17279186 ] Truc Lam Nguyen commented on ARROW-11497: - (y) make sense > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279178#comment-17279178 ] Micah Kornfield commented on ARROW-11497: - {quote}Perhaps we could convert the field name at the Arrow<->Parquet boundary. {quote} This should be possible but it potentially needs another flag. I think in the short term plumbing the additional flag through to python makes sense and we can figure out a longer term solution if this becomes a larger problem. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279166#comment-17279166 ] Truc Lam Nguyen commented on ARROW-11497: - [~emkornfield]: I got the reason now. [~apitrou] as my understanding on apache arrow is not enough to suggest any technical solution but I think it might be a good idea. Also for my use case as an end user, it would be good if we can be allowed to produce parquet files that are compliant to official specification. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279145#comment-17279145 ] Micah Kornfield commented on ARROW-11497: - It isn't about round-tripping performance, its about data equality. The default element name for nested columns in arrow is "item" which is why it gets propagated to parquet without the flag being set. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279149#comment-17279149 ] Antoine Pitrou commented on ARROW-11497: Perhaps we could convert the field name at the Arrow<->Parquet boundary. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279141#comment-17279141 ] Truc Lam Nguyen commented on ARROW-11497: - [~emkornfield] thanks for looking at this ticket, it is causing problem on my site as I'm having a mixture of parquet list data types * one are processed by parquet-mr writer, which conforms to apache parquet spec * the other one are produced by pyarrow, which is of the reported structure Processing both of them is a difficulty for me. It would be really good if this option is exposed in Python. Could you also explain a bit more about how this might affect on the round tripping performance if applicable? Thanks. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification
[ https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279080#comment-17279080 ] Micah Kornfield commented on ARROW-11497: - Is this causing a problem in practice. There is a C++ option [https://github.com/apache/arrow/blob/1c18706ac9e49e3e9b4998354f213a304e82d367/cpp/src/parquet/properties.h#L689] that will write out element [https://github.com/apache/arrow/blob/9b195493409ad434cbc42b0e03c6471a9bae/cpp/src/parquet/arrow/schema.cc#L82] We could expose this in python. I think the main reason it isn't enabled by default is it breaks round trips for arrow data. This could potentially be fixed on the reader side as well. I can't find a reference but I think this might also have some impact on Pandas<->Parquet round tripping. > [Python] pyarrow parquet writer for list does not conform with Apache Parquet > specification > --- > > Key: ARROW-11497 > URL: https://issues.apache.org/jira/browse/ARROW-11497 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 >Reporter: Truc Lam Nguyen >Priority: Major > Attachments: parquet-tools-meta.log > > > Sorry if I don't know this feature is done deliberately, but it looks like > the parquet writer for list data type does not conform to Apache Parquet list > logical type specification > According to this page: > [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] > list type contains 3 level where the middle level, named {{list}}, must be a > repeated group with a single field named _{{element}}_ > However, in the parquet file from pyarrow writer, that single field is named > _item_ instead, > Please find below the example python code that produce a parquet file (I use > pandas version 1.2.1 and pyarrow version 3.0.0) > {code:java} > import pandas as pd > > df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', > 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', > 'games': [{'name': 'fifa', 'version': '21'}]}, ]) > df.to_parquet('/tmp/test.parquet', engine='pyarrow') > {code} > Then I use parquet-tools from > [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of > parquet file via this command > parquet-tools meta /tmp/test.parquet > The full meta is included in attached, here is only an extraction of list > type column > games: OPTIONAL F:1 > .list: REPEATED F:1 > ..item: OPTIONAL F:2 > ...name: OPTIONAL BINARY L:STRING R:1 D:4 > ...version: OPTIONAL BINARY L:STRING R:1 D:4 > as can be seen, under list, it is single field named _item_ > I think this should be made to be name _element_ to conform with Apache > Parquet specification. -- This message was sent by Atlassian Jira (v8.3.4#803005)