[ https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207855#comment-17207855 ]
Chen Ming edited comment on ARROW-10140 at 10/5/20, 6:49 AM: ------------------------------------------------------------- Good news, I build pyarrow latest master (by following [https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]), *the created parquet file(s) can be read by parquet-tools correctly. They can be queried through Amazon Athena too.* (attached the new parquet file: test_map_2.0.0.parquet) {code:java} PyArrow Version = 2.0.0.dev403+ga6b30de87 Pandas Version = 1.1.2 {code} (And as [~jorisvandenbossche] told, pyarrow latest master can read the parquet file, with MapType data, back into Arrow correctly too.) Thanks very much for all the helps. I will change the status to _Resolved_ (by 2.0.0). Can you kindly tell when would the next pyarrow release (2.0.0?) be available? was (Author: acan): Good news, I build pyarrow latest master (by following [https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst]), *the created parquet file(s) can be read by parquet-tools correctly. They can be queried through Amazon Athena too.* (attached the new parquet file: test_map_2.0.0.parquet) (And as [~jorisvandenbossche] told, pyarrow latest master can read the parquet file, with MapType data, back into Arrow correctly too.) Thanks very much for all the helps. I will change the status to _Resolved_ (by 2.0.0). Can you kindly tell when would the next pyarrow release (2.0.0?) be available? > [Python][C++] No data for map column of a parquet file created from pyarrow > and pandas > -------------------------------------------------------------------------------------- > > Key: ARROW-10140 > URL: https://issues.apache.org/jira/browse/ARROW-10140 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Reporter: Chen Ming > Assignee: Micah Kornfield > Priority: Minor > Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, > test_map_2.0.0.parquet > > > Hi, > I'm having problems reading parquet files with 'map' data type created by > pyarrow. > I followed > [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] > to convert a pandas DF to an arrow table, then call write_table to output a > parquet file: > (We also referred to https://issues.apache.org/jira/browse/ARROW-9812) > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df = pd.DataFrame({ > 'col1': pd.Series([ > [('id', 'something'), ('value2', 'else')], > [('id', 'something2'), ('value','else2')], > ]), > 'col2': pd.Series(['foo', 'bar']) > }) > udt = pa.map_(pa.string(), pa.string()) > schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())]) > table = pa.Table.from_pandas(df, schema) > pq.write_table(table, './test_map.parquet') > {code} > The above code (attached as test_map.py) runs smoothly on my developing > computer: > {code:java} > PyArrow Version = 1.0.1 > Pandas Version = 1.1.2 > {code} > And generated the test_map.parquet file (attached as test_map.parquet) > successfully. > Then I use parquet-tools (1.11.1) to read the file, but get the following > output: > {code:java} > $ java -jar parquet-tools-1.11.1.jar head test_map.parquet > col1: > .key_value: > .key_value: > col2 = foo > col1: > .key_value: > .key_value: > col2 = bar > {code} > I also checked the schema of the parquet file: > {code:java} > java -jar parquet-tools-1.11.1.jar schema test_map.parquet > message schema { > optional group col1 (MAP) { > repeated group key_value { > required binary key (STRING); > optional binary value (STRING); > } > } > optional binary col2 (STRING); > }{code} > Am I doing something wrong? > We need to output the data to parquet files, and query them later. -- This message was sent by Atlassian Jira (v8.3.4#803005)