[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-02 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423486#comment-17423486
 ] 

Truc Lam Nguyen commented on ARROW-14196:
-

[~judah.rand] sorry that I don't really understand your comment, could you 
please explain a little bit more? thanks

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12682) Airflow webserver 2.0.2 does not remember user session any longer

2021-05-07 Thread Truc Lam Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Truc Lam Nguyen closed ARROW-12682.
---
Resolution: Fixed

Wrong project (should be airflow), sorry.

> Airflow webserver 2.0.2 does not remember user session any longer
> -
>
> Key: ARROW-12682
> URL: https://issues.apache.org/jira/browse/ARROW-12682
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Truc Lam Nguyen
>Priority: Minor
>
> I recently migrated from airflow version 1.10.10 to 2.0.0. One thing I notice 
> is that the user's session is not remember anymore by airflow webserver.
> The authentication model I use is basic username and password. Airflow 
> webserver (and scheduler) are deployed as pods in Kubernetes cluster.
> As a Kubernetes pod, it can be up and down, normally not a big issue for me 
> as with version 1.10.10, my login session is still remembered and I don't 
> need to re-log when a pod is restarted, however, with version 2.0.2, its 
> forced me to login again everytime the pod is restarted.
> Could you please let me know is that an intended behaviour with airflow 2?
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12682) Airflow webserver 2.0.2 does not remember user session any longer

2021-05-07 Thread Truc Lam Nguyen (Jira)
Truc Lam Nguyen created ARROW-12682:
---

 Summary: Airflow webserver 2.0.2 does not remember user session 
any longer
 Key: ARROW-12682
 URL: https://issues.apache.org/jira/browse/ARROW-12682
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Truc Lam Nguyen


I recently migrated from airflow version 1.10.10 to 2.0.0. One thing I notice 
is that the user's session is not remember anymore by airflow webserver.

The authentication model I use is basic username and password. Airflow 
webserver (and scheduler) are deployed as pods in Kubernetes cluster.

As a Kubernetes pod, it can be up and down, normally not a big issue for me as 
with version 1.10.10, my login session is still remembered and I don't need to 
re-log when a pod is restarted, however, with version 2.0.2, its forced me to 
login again everytime the pod is restarted.

Could you please let me know is that an intended behaviour with airflow 2?

Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-13 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284263#comment-17284263
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield] I've made a PR 
[here|[https://github.com/apache/arrow/pull/9489],] please have a look at let 
me know if I'm missing something, I'll try to address as much as I can, thanks 
:) 

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-13 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284230#comment-17284230
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield] thanks for the confirmation, I'm working on a PR atm.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-12 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283731#comment-17283731
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~apitrou] [~emkornfield] I think we can make a final decision on this, I'm ok 
with the option that end users have some level of control to preserve the 
behaviour.

Please let me know your thoughts, thanks :)

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279186#comment-17279186
 ] 

Truc Lam Nguyen edited comment on ARROW-11497 at 2/4/21, 9:39 PM:
--

{quote}I think in the short term plumbing the additional flag through to python 
makes sense and we can figure out a longer term solution if this becomes a 
larger problem.
{quote}
(y) make sense 


was (Author: trucnguyenlam):
(y) make sense 

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279186#comment-17279186
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

(y) make sense 

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279166#comment-17279166
 ] 

Truc Lam Nguyen edited comment on ARROW-11497 at 2/4/21, 9:14 PM:
--

[~emkornfield]: I got the reason now :)

[~apitrou] my understanding on apache arrow is not enough to suggest any 
technical solution but I think it might be a good idea.

Also for my use case as an end user, it would be good if we can be allowed to 
produce parquet files that are compliant to official specification.


was (Author: trucnguyenlam):
[~emkornfield]: I got the reason now.

[~apitrou] my understanding on apache arrow is not enough to suggest any 
technical solution but I think it might be a good idea.

Also for my use case as an end user, it would be good if we can be allowed to 
produce parquet files that are compliant to official specification.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279166#comment-17279166
 ] 

Truc Lam Nguyen edited comment on ARROW-11497 at 2/4/21, 9:11 PM:
--

[~emkornfield]: I got the reason now.

[~apitrou] my understanding on apache arrow is not enough to suggest any 
technical solution but I think it might be a good idea.

Also for my use case as an end user, it would be good if we can be allowed to 
produce parquet files that are compliant to official specification.


was (Author: trucnguyenlam):
[~emkornfield]: I got the reason now.

[~apitrou] as my understanding on apache arrow is not enough to suggest any 
technical solution but I think it might be a good idea.

Also for my use case as an end user, it would be good if we can be allowed to 
produce parquet files that are compliant to official specification.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279166#comment-17279166
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield]: I got the reason now.

[~apitrou] as my understanding on apache arrow is not enough to suggest any 
technical solution but I think it might be a good idea.

Also for my use case as an end user, it would be good if we can be allowed to 
produce parquet files that are compliant to official specification.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279141#comment-17279141
 ] 

Truc Lam Nguyen commented on ARROW-11497:
-

[~emkornfield] thanks for looking at this ticket, it is causing problem on my 
site as I'm having a mixture of parquet list data types
 * one are processed by parquet-mr writer, which conforms to apache parquet spec
 * the other one are produced by pyarrow, which is of the reported structure

Processing both of them is a difficulty for me. It would be really good if this 
option is exposed in Python.

Could you also explain a bit more about how this might affect on the round 
tripping performance if applicable?

Thanks.

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
>  .list: REPEATED F:1 
>  ..item: OPTIONAL F:2 
>  ...name: OPTIONAL BINARY L:STRING R:1 D:4
>  ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Truc Lam Nguyen updated ARROW-11497:

Description: 
Sorry if I don't know this feature is done deliberately, but it looks like the 
parquet writer for list data type does not conform to Apache Parquet list 
logical type specification

According to this page: 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
list type contains 3 level where the middle level, named {{list}}, must be a 
repeated group with a single field named _{{element}}_

However, in the parquet file from pyarrow writer, that single field is named 
_item_ instead,

Please find below the example python code that produce a parquet file (I use 
pandas version 1.2.1 and pyarrow version 3.0.0) 
{code:java}
import pandas as pd
 
df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
'games': [{'name': 'fifa', 'version': '21'}]}, ])
df.to_parquet('/tmp/test.parquet', engine='pyarrow')
{code}
Then I use parquet-tools from [https://formulae.brew.sh/formula/parquet-tools] 
to check the metadata of parquet file via this command

parquet-tools meta /tmp/test.parquet

The full meta is included in attached, here is only an extraction of list type 
column

games: OPTIONAL F:1 
 .list: REPEATED F:1 
 ..item: OPTIONAL F:2 
 ...name: OPTIONAL BINARY L:STRING R:1 D:4
 ...version: OPTIONAL BINARY L:STRING R:1 D:4

as can be seen, under list, it is single field named _item_

I think this should be made to be name _element_ to conform with Apache Parquet 
specification.

  was:
Sorry if I don't know this feature is done deliberately, but it looks like the 
parquet writer for list data type does not confirm to Apache Parquet list 
logical type specification,

According to this page: 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
list type contains 3 level where the middle level, named {{list}}, must be a 
repeated group with a single field named _{{element}}_

However, in the parquet file from pyarrow writer, that single field is named 
_item_ instead,

Please find below the example python code that produce a parquet file (I use 
pandas version 1.2.1 and pyarrow version 3.0.0)

 
{code:java}
import pandas as pd
 
df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
'games': [{'name': 'fifa', 'version': '21'}]}, ])
df.to_parquet('/tmp/test.parquet', engine='pyarrow')
{code}
 

Then I use parquet-tools from [https://formulae.brew.sh/formula/parquet-tools] 
to check the metadata of parquet file via this command

parquet-tools meta /tmp/test.parquet

The full meta is included in attached, here is only an extraction of list type 
column

games: OPTIONAL F:1 
.list: REPEATED F:1 
..item: OPTIONAL F:2 
...name: OPTIONAL BINARY L:STRING R:1 D:4
...version: OPTIONAL BINARY L:STRING R:1 D:4

as can be seen, under list, it is single field named _item_

I think this should be made to be name _element_ to conform with Apache Parquet 
specification.


> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not conform to Apache Parquet list 
> logical type specification
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0) 
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in 

[jira] [Updated] (ARROW-11497) [Python] pyarrow parquet writer for list does not conform with Apache Parquet specification

2021-02-04 Thread Truc Lam Nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Truc Lam Nguyen updated ARROW-11497:

Summary: [Python] pyarrow parquet writer for list does not conform with 
Apache Parquet specification  (was: [Python] pyarrow parquet writer for list 
does not conform with Apache Parquet sepecification)

> [Python] pyarrow parquet writer for list does not conform with Apache Parquet 
> specification
> ---
>
> Key: ARROW-11497
> URL: https://issues.apache.org/jira/browse/ARROW-11497
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Truc Lam Nguyen
>Priority: Major
> Attachments: parquet-tools-meta.log
>
>
> Sorry if I don't know this feature is done deliberately, but it looks like 
> the parquet writer for list data type does not confirm to Apache Parquet list 
> logical type specification,
> According to this page: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists,] 
> list type contains 3 level where the middle level, named {{list}}, must be a 
> repeated group with a single field named _{{element}}_
> However, in the parquet file from pyarrow writer, that single field is named 
> _item_ instead,
> Please find below the example python code that produce a parquet file (I use 
> pandas version 1.2.1 and pyarrow version 3.0.0)
>  
> {code:java}
> import pandas as pd
>  
> df = pd.DataFrame(data=[ {'studio': 'blizzard', 'games': [{'name': 'diablo', 
> 'version': '3'}, {'name': 'star craft', 'version': '2'}]}, {'studio': 'ea', 
> 'games': [{'name': 'fifa', 'version': '21'}]}, ])
> df.to_parquet('/tmp/test.parquet', engine='pyarrow')
> {code}
>  
> Then I use parquet-tools from 
> [https://formulae.brew.sh/formula/parquet-tools] to check the metadata of 
> parquet file via this command
> parquet-tools meta /tmp/test.parquet
> The full meta is included in attached, here is only an extraction of list 
> type column
> games: OPTIONAL F:1 
> .list: REPEATED F:1 
> ..item: OPTIONAL F:2 
> ...name: OPTIONAL BINARY L:STRING R:1 D:4
> ...version: OPTIONAL BINARY L:STRING R:1 D:4
> as can be seen, under list, it is single field named _item_
> I think this should be made to be name _element_ to conform with Apache 
> Parquet specification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)