[jira] [Comment Edited] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

Chen Ming (Jira) Sun, 04 Oct 2020 23:50:18 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207855#comment-17207855
 ]


Chen Ming edited comment on ARROW-10140 at 10/5/20, 6:49 AM:
-------------------------------------------------------------

Good news, I build pyarrow latest master (by following 
[https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]),
 *the created parquet file(s) can be read by parquet-tools correctly. They can 
be queried through Amazon Athena too.* (attached the new parquet file: 
test_map_2.0.0.parquet)
{code:java}
PyArrow Version = 2.0.0.dev403+ga6b30de87
Pandas Version = 1.1.2
{code}
(And as [~jorisvandenbossche] told, pyarrow latest master can read the parquet 
file, with MapType data, back into Arrow correctly too.)

Thanks very much for all the helps. I will change the status to _Resolved_ (by 
2.0.0).

Can you kindly tell when would the next pyarrow release (2.0.0?) be available?


was (Author: acan):
Good news, I build pyarrow latest master (by following 
[https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]),
 *the created parquet file(s) can be read by parquet-tools correctly. They can 
be queried through Amazon Athena too.* (attached the new parquet file: 
test_map_2.0.0.parquet)

(And as [~jorisvandenbossche] told, pyarrow latest master can read the parquet 
file, with MapType data, back into Arrow correctly too.)

Thanks very much for all the helps. I will change the status to _Resolved_ (by 
2.0.0).

Can you kindly tell when would the next pyarrow release (2.0.0?) be available?

> [Python][C++] No data for map column of a parquet file created from pyarrow 
> and pandas
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-10140
>                 URL: https://issues.apache.org/jira/browse/ARROW-10140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Chen Ming
>            Assignee: Micah Kornfield
>            Priority: Minor
>         Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, 
> test_map_2.0.0.parquet
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>          'col1': pd.Series([
>              [('id', 'something'), ('value2', 'else')],
>              [('id', 'something2'), ('value','else2')],
>          ]),
>          'col2': pd.Series(['foo', 'bar'])
>      })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
>     repeated group key_value {
>       required binary key (STRING);
>       optional binary value (STRING);
>     }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

Reply via email to