[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205731#comment-17205731 ]
Bryan Cutler edited comment on ARROW-9812 at 10/1/20, 6:10 PM: --------------------------------------------------------------- I started work on ARROW-10151 for the Pandas conversion. Let's keep this open for Parquet conversion after ARROW-1644. was (Author: bryanc): I started work on https://issues.apache.org/jira/browse/ARROW-10151 for the Pandas conversion. Let's keep this open for Parquet conversion after ARROW-1644. > [Python] Map data types doesn't work from Arrow to Pandas and Parquet > --------------------------------------------------------------------- > > Key: ARROW-9812 > URL: https://issues.apache.org/jira/browse/ARROW-9812 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Mayur Srivastava > Priority: Major > > Hi, > I'm having problems using 'map' data type in Arrow/parquet/pandas. > I'm able to convert a pandas data frame to Arrow with a map data type. > But, Arrow to Pandas doesn't work. > When I write Arrow to Parquet, it seems to work, but I'm not sure if the data > type is written correctly. > When I read back Parquet to Arrow, it fails saying "reading list of structs" > is not supported. It seems that map is stored as list of structs. > There are two problems here: > # Map data type doesn't work from Arrow -> Pandas. > # Map data type doesn't get written to or read from Arrow -> Parquet. > Questions: > 1. Am I doing something wrong? Is there a way to get these to work? > 2. If these are unsupported features, will this be fixed in a future version? > Do you plans or ETA? > The following code example (followed by output) should demonstrate the issues: > I'm using Arrow 1.0.0 and Pandas 1.0.5. > Thanks! > Mayur > {code:java} > $ cat arrowtest.py > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > import traceback as tb > import io > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df1 = pd.DataFrame({'a': [[('b', '2')]]}) > print(f'df1') > print(f'{df1}') > print(f'Pandas -> Arrow') > try: > t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', > pa.map_(pa.string(), pa.string()))])) > print('PASSED') > print(t1) > except: > print(f'FAILED') > tb.print_exc() > print(f'Arrow -> Pandas') > try: > t1.to_pandas() > print('PASSED') > except: > print(f'FAILED') > tb.print_exc()print(f'Arrow -> Parquet') > fh = io.BytesIO() > try: > pq.write_table(t1, fh) > print('PASSED') > except: > print('FAILED') > tb.print_exc() > > print(f'Parquet -> Arrow') > try: > t2 = pq.read_table(source=fh) > print('PASSED') > print(t2) > except: > print('FAILED') > tb.print_exc() > {code} > {code:java} > $ python3.6 arrowtest.py > PyArrow Version = 1.0.0 > Pandas Version = 1.0.5 > df1 > a 0 [(b, 2)] > > Pandas -> Arrow > PASSED > pyarrow.Table > a: map<string, string> > child 0, entries: struct<key: string not null, value: string> not null > child 0, key: string not null > child 1, value: string > > Arrow -> Pandas > FAILED > Traceback (most recent call last): > File "arrowtest.py", line 26, in <module> t1.to_pandas() > File "pyarrow/array.pxi", line 715, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File > "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in > table_to_blockmanager blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line > 1115, in _table_to_blocks list(extension_columns.keys())) > File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File > "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for > Arrow data of type map<string, string> is known. > > Arrow -> Parquet > PASSED > > Parquet -> Arrow > FAILED > Traceback (most recent call last): File "arrowtest.py", line 43, in <module> > t2 = pq.read_table(source=fh) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in > read_table use_pandas_metadata=use_pandas_metadata) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in > read use_threads=use_threads > File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 122, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet > files not yet supported: key_value: list<key_value: struct<key: string not > null, value: string> not null> not null > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)