[jira] [Comment Edited] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

Micah Kornfield (Jira) Thu, 01 Oct 2020 08:54:02 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205620#comment-17205620
 ]


Micah Kornfield edited comment on ARROW-10140 at 10/1/20, 3:52 PM:
-------------------------------------------------------------------

Unfortunately, fro Arrow 1.0.1 this appears to be a data loss problem where 
values are not being written.  I think this issue was likely resolved on master 
by: https://issues.apache.org/jira/browse/ARROW-9603?filter=-2

 

Although the data described here isn't specifically what the fix was for, I 
think this must have fallen into a case where the levels are getting written 
correctly but the values are not.

 

[~jorisvandenbossche] just to confirm when you mentioned 1.0 you really mean 
1.0.1?  1.0.1 also had a fix for writing nested data.


was (Author: emkornfi...@gmail.com):
Unfortunately, fro Arrow 1.0.1 this appears to be a data loss problem where 
values are not being written.  I think this issue was likely resolved on master 
by: https://issues.apache.org/jira/browse/ARROW-9603?filter=-2

 

Although the data described here isn't specifically what the fix was for, I 
think this must have fallen into a case where the levels are getting written 
correctly but the values are not.

> [Python][C++] No data for map column of a parquet file created from pyarrow 
> and pandas
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-10140
>                 URL: https://issues.apache.org/jira/browse/ARROW-10140
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Chen Ming
>            Assignee: Micah Kornfield
>            Priority: Minor
>         Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py
>
>
> Hi,
> I'm having problems reading parquet files with 'map' data type created by 
> pyarrow.
> I followed 
> [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries]
>  to convert a pandas DF to an arrow table, then call write_table to output a 
> parquet file:
> (We also referred to https://issues.apache.org/jira/browse/ARROW-9812)
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(f'PyArrow Version = {pa.__version__}')
> print(f'Pandas Version = {pd.__version__}')
> df = pd.DataFrame({
>          'col1': pd.Series([
>              [('id', 'something'), ('value2', 'else')],
>              [('id', 'something2'), ('value','else2')],
>          ]),
>          'col2': pd.Series(['foo', 'bar'])
>      })
> udt = pa.map_(pa.string(), pa.string())
> schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
> table = pa.Table.from_pandas(df, schema)
> pq.write_table(table, './test_map.parquet')
> {code}
> The above code (attached as test_map.py) runs smoothly on my developing 
> computer:
> {code:java}
> PyArrow Version = 1.0.1
> Pandas Version = 1.1.2
> {code}
> And generated the test_map.parquet file (attached as test_map.parquet) 
> successfully.
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head test_map.parquet
> col1:
> .key_value:
> .key_value:
> col2 = foo
> col1:
> .key_value:
> .key_value:
> col2 = bar
> {code}
> I also checked the schema of the parquet file:
> {code:java}
> java -jar parquet-tools-1.11.1.jar schema test_map.parquet
> message schema {
>   optional group col1 (MAP) {
>     repeated group key_value {
>       required binary key (STRING);
>       optional binary value (STRING);
>     }
>   }
>   optional binary col2 (STRING);
> }{code}
> Am I doing something wrong? 
> We need to output the data to parquet files, and query them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-10140) [Python][C++] No data for map column of a parquet file created from pyarrow and pandas

Reply via email to