[ https://issues.apache.org/jira/browse/ARROW-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205620#comment-17205620 ]
Micah Kornfield edited comment on ARROW-10140 at 10/1/20, 3:52 PM: ------------------------------------------------------------------- Unfortunately, fro Arrow 1.0.1 this appears to be a data loss problem where values are not being written. I think this issue was likely resolved on master by: https://issues.apache.org/jira/browse/ARROW-9603?filter=-2 Although the data described here isn't specifically what the fix was for, I think this must have fallen into a case where the levels are getting written correctly but the values are not. [~jorisvandenbossche] just to confirm when you mentioned 1.0 you really mean 1.0.1? 1.0.1 also had a fix for writing nested data. was (Author: emkornfi...@gmail.com): Unfortunately, fro Arrow 1.0.1 this appears to be a data loss problem where values are not being written. I think this issue was likely resolved on master by: https://issues.apache.org/jira/browse/ARROW-9603?filter=-2 Although the data described here isn't specifically what the fix was for, I think this must have fallen into a case where the levels are getting written correctly but the values are not. > [Python][C++] No data for map column of a parquet file created from pyarrow > and pandas > -------------------------------------------------------------------------------------- > > Key: ARROW-10140 > URL: https://issues.apache.org/jira/browse/ARROW-10140 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Reporter: Chen Ming > Assignee: Micah Kornfield > Priority: Minor > Attachments: pyspark.snappy.parquet, test_map.parquet, test_map.py > > > Hi, > I'm having problems reading parquet files with 'map' data type created by > pyarrow. > I followed > [https://stackoverflow.com/questions/63553715/pyarrow-data-types-for-columns-that-have-lists-of-dictionaries] > to convert a pandas DF to an arrow table, then call write_table to output a > parquet file: > (We also referred to https://issues.apache.org/jira/browse/ARROW-9812) > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df = pd.DataFrame({ > 'col1': pd.Series([ > [('id', 'something'), ('value2', 'else')], > [('id', 'something2'), ('value','else2')], > ]), > 'col2': pd.Series(['foo', 'bar']) > }) > udt = pa.map_(pa.string(), pa.string()) > schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())]) > table = pa.Table.from_pandas(df, schema) > pq.write_table(table, './test_map.parquet') > {code} > The above code (attached as test_map.py) runs smoothly on my developing > computer: > {code:java} > PyArrow Version = 1.0.1 > Pandas Version = 1.1.2 > {code} > And generated the test_map.parquet file (attached as test_map.parquet) > successfully. > Then I use parquet-tools (1.11.1) to read the file, but get the following > output: > {code:java} > $ java -jar parquet-tools-1.11.1.jar head test_map.parquet > col1: > .key_value: > .key_value: > col2 = foo > col1: > .key_value: > .key_value: > col2 = bar > {code} > I also checked the schema of the parquet file: > {code:java} > java -jar parquet-tools-1.11.1.jar schema test_map.parquet > message schema { > optional group col1 (MAP) { > repeated group key_value { > required binary key (STRING); > optional binary value (STRING); > } > } > optional binary col2 (STRING); > }{code} > Am I doing something wrong? > We need to output the data to parquet files, and query them later. -- This message was sent by Atlassian Jira (v8.3.4#803005)