[jira] [Updated] (ARROW-10140) [Python][C++] Add test for map column of a parquet file created from pyarrow and pandas

ASF GitHub Bot (Jira) Tue, 18 Jan 2022 01:32:06 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated ARROW-10140:
-----------------------------------
    Labels: pull-request-available  (was: )

> [Python][C++] Add test for map column of a parquet file created from pyarrow 
> and pandas
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-10140
>                 URL: https://issues.apache.org/jira/browse/ARROW-10140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Chen Ming
>            Assignee: Joris Van den Bossche
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 8.0.0
>
>         Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py, 
> test_map_2.0.0.parquet, test_map_200.parquet
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>          'col1': pd.Series([
>              [('id', 'something'), ('value2', 'else')],
>              [('id', 'something2'), ('value','else2')],
>          ]),
>          'col2': pd.Series(['foo', 'bar'])
>      })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
>     repeated group key_value {
>       required binary key (STRING);
>       optional binary value (STRING);
>     }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-10140) [Python][C++] Add test for map column of a parquet file created from pyarrow and pandas

Reply via email to