[jira] [Assigned] (ARROW-1132) [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet

Phillip Cloud (JIRA) Thu, 22 Jun 2017 08:28:19 -0700

     [ 
https://issues.apache.org/jira/browse/ARROW-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Phillip Cloud reassigned ARROW-1132:
------------------------------------

    Assignee: Phillip Cloud

> [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate 
> values to parquet
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1132
>                 URL: https://issues.apache.org/jira/browse/ARROW-1132
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.4.1
>         Environment: OSx, miniconda, using pyarrow build from conda-forge
>            Reporter: Ben Mabey
>            Assignee: Phillip Cloud
>
> Panda DataFrames that have `MultiIndex`es seem to always be converted to a 
> `Table` just fine. However, when writing the `Table` to disk using 
> `pyarrow.parquet`, I am unable to write DataFrames whose `MultiIndex` 
> contains a level with duplicate values (which is nearly always the case for 
> me). Here is an example in python with working cases and a failure case at 
> bottom:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> num_rows = 3
> example = pd.DataFrame({'strs': ['foo', 'foo', 'bar'],
>                         'nums_b': range(num_rows),
>                         'nums_a': range(num_rows)})
> def pq_write(df):
>     table = pa.Table.from_pandas(df)
>     pq.write_table(table, '/tmp/df.parquet')
> # single index works
> pq_write(example)
> pq_write(example.set_index(['nums_b']))
> # single index with duplicate values work
> pq_write(example.set_index(['strs']))
> # MultiIndex with all unique, relative to the level/column, values works
> pq_write(example.set_index(['nums_b', 'nums_a']))
> # MultiIndex with one level with duplicate values in one index FAILS
> pq_write(example.set_index(['strs', 'nums_a']))
> {code}
> {noformat}
> Traceback (most recent call last):
>   File "test_arrow.py", line 26, in <module>
>     pq_write(example.set_index(['strs', 'nums_a']))
>   File "test_arrow.py", line 13, in pq_write
>     pq.write_table(table, '/tmp/df.parquet')
>   File 
> "/Users/bmabey/anaconda/envs/test_pyarrow/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 702, in write_table
>     writer.write_table(table, row_group_size=row_group_size)
>   File "pyarrow/_parquet.pyx", line 609, in 
> pyarrow._parquet.ParquetWriter.write_table 
> (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/_parquet.cxx:11025)
>   File "pyarrow/error.pxi", line 60, in pyarrow.lib.check_status 
> (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/lib.cxx:6899)
> pyarrow.lib.ArrowIOError: IOError: Written rows: 2 != expected rows: 3in the 
> current column chunk
> {noformat}
> Note that the written rows is equal to the number of unique values in the 
> `strs` level. I have found this to always be the case when I've hit this 
> error message.
> I'm happy to write a patch for this assuming this is a bug and you can point 
> me in the right direction.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (ARROW-1132) [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet

Reply via email to