[ https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-10140: ----------------------------------- Labels: pull-request-available (was: ) > [Python][C++] Add test for map column of a parquet file created from pyarrow > and pandas > --------------------------------------------------------------------------------------- > > Key: ARROW-10140 > URL: https://issues.apache.org/jira/browse/ARROW-10140 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Reporter: Chen Ming > Assignee: Joris Van den Bossche > Priority: Minor > Labels: pull-request-available > Fix For: 8.0.0 > > Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, > test_map_2.0.0.parquet, test_map_200.parquet > > Time Spent: 10m > Remaining Estimate: 0h > > Hi, > I'm having problems reading parquet files with 'map' data type created by > pyarrow. > I followed > [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] > to convert a pandas DF to an arrow table, then call write_table to output a > parquet file: > (We also referred to https://issues.apache.org/jira/browse/ARROW-9812) > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df = pd.DataFrame({ > 'col1': pd.Series([ > [('id', 'something'), ('value2', 'else')], > [('id', 'something2'), ('value','else2')], > ]), > 'col2': pd.Series(['foo', 'bar']) > }) > udt = pa.map_(pa.string(), pa.string()) > schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())]) > table = pa.Table.from_pandas(df, schema) > pq.write_table(table, './test_map.parquet') > {code} > The above code (attached as test_map.py) runs smoothly on my developing > computer: > {code:java} > PyArrow Version = 1.0.1 > Pandas Version = 1.1.2 > {code} > And generated the test_map.parquet file (attached as test_map.parquet) > successfully. > Then I use parquet-tools (1.11.1) to read the file, but get the following > output: > {code:java} > $ java -jar parquet-tools-1.11.1.jar head test_map.parquet > col1: > .key_value: > .key_value: > col2 = foo > col1: > .key_value: > .key_value: > col2 = bar > {code} > I also checked the schema of the parquet file: > {code:java} > java -jar parquet-tools-1.11.1.jar schema test_map.parquet > message schema { > optional group col1 (MAP) { > repeated group key_value { > required binary key (STRING); > optional binary value (STRING); > } > } > optional binary col2 (STRING); > }{code} > Am I doing something wrong? > We need to output the data to parquet files, and query them later. -- This message was sent by Atlassian Jira (v8.20.1#820001)