[ https://issues.apache.org/jira/browse/ARROW-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Phillip Cloud reassigned ARROW-1132: ------------------------------------ Assignee: Phillip Cloud > [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate > values to parquet > --------------------------------------------------------------------------------------------- > > Key: ARROW-1132 > URL: https://issues.apache.org/jira/browse/ARROW-1132 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.4.1 > Environment: OSx, miniconda, using pyarrow build from conda-forge > Reporter: Ben Mabey > Assignee: Phillip Cloud > > Panda DataFrames that have `MultiIndex`es seem to always be converted to a > `Table` just fine. However, when writing the `Table` to disk using > `pyarrow.parquet`, I am unable to write DataFrames whose `MultiIndex` > contains a level with duplicate values (which is nearly always the case for > me). Here is an example in python with working cases and a failure case at > bottom: > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > num_rows = 3 > example = pd.DataFrame({'strs': ['foo', 'foo', 'bar'], > 'nums_b': range(num_rows), > 'nums_a': range(num_rows)}) > def pq_write(df): > table = pa.Table.from_pandas(df) > pq.write_table(table, '/tmp/df.parquet') > # single index works > pq_write(example) > pq_write(example.set_index(['nums_b'])) > # single index with duplicate values work > pq_write(example.set_index(['strs'])) > # MultiIndex with all unique, relative to the level/column, values works > pq_write(example.set_index(['nums_b', 'nums_a'])) > # MultiIndex with one level with duplicate values in one index FAILS > pq_write(example.set_index(['strs', 'nums_a'])) > {code} > {noformat} > Traceback (most recent call last): > File "test_arrow.py", line 26, in <module> > pq_write(example.set_index(['strs', 'nums_a'])) > File "test_arrow.py", line 13, in pq_write > pq.write_table(table, '/tmp/df.parquet') > File > "/Users/bmabey/anaconda/envs/test_pyarrow/lib/python3.5/site-packages/pyarrow/parquet.py", > line 702, in write_table > writer.write_table(table, row_group_size=row_group_size) > File "pyarrow/_parquet.pyx", line 609, in > pyarrow._parquet.ParquetWriter.write_table > (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/_parquet.cxx:11025) > File "pyarrow/error.pxi", line 60, in pyarrow.lib.check_status > (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/lib.cxx:6899) > pyarrow.lib.ArrowIOError: IOError: Written rows: 2 != expected rows: 3in the > current column chunk > {noformat} > Note that the written rows is equal to the number of unique values in the > `strs` level. I have found this to always be the case when I've hit this > error message. > I'm happy to write a patch for this assuming this is a bug and you can point > me in the right direction. -- This message was sent by Atlassian JIRA (v6.4.14#64029)