[ 
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961882#comment-16961882
 ] 

Joris Van den Bossche edited comment on ARROW-6999 at 10/29/19 10:45 AM:
-------------------------------------------------------------------------

Thanks for the reproducer! It's indeed due to the non-range index. Doing this 
in terms of the simpler example, I think the following is equivalent to your 
example:

{code}
df2 = pd.DataFrame({'a': [1, 2, 3]}, index=[0, 1, 2])
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema)
{code}

which gives indeed that error. In the end, it boils down to the same bug as my 
example above using a RangeIndex but with specifying {{preserve_index=True}} 
(as that forces the index to become a column, just as if you have a 
non-rangeindex).


was (Author: jorisvandenbossche):
Thanks for the reproducer! It's indeed due to the non-range index. Doing this 
in terms of the simpler example, I think the following is equivalent to your 
example:

```
df2 = pd.DataFrame({'a': [1, 2, 3]}, index=[0, 1, 2])
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema)
```

which gives indeed that error. In the end, it boils down to the same bug as my 
example above using a RangeIndex but with specifying {{preserve_index=True}} 
(as that forces the index to become a column, just as if you have a 
non-rangeindex).

> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own 
> schema
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-6999
>                 URL: https://issues.apache.org/jira/browse/ARROW-6999
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
>         Environment: pandas==0.23.4
> pyarrow==0.15.0  # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
>            Reporter: Tom Goodman
>            Priority: Major
>             Fix For: 1.0.0
>
>         Attachments: test3.hdf
>
>
> Steps to reproduce:
>  # Generate any DataFrame's pyarrow Schema using Table.from_pandas
>  # Pass the generated schema as input into Table.from_pandas
>  # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many 
> partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce 
> schema going forward that are *backwards compatible* (i.e. also have 
> '__index_level_0__'), so we should not need to re-generate all prior years' 
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes 
> '__index_level_0__', causing inconsistent schema across earlier partitions 
> that had been written using pyarrow==0.11.0.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame() 
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3078, in get_loc
>     return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 408, in _get_columns_to_convert_given_schema
>     col = df[name]
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2688, in __getitem__
>     return self._getitem_column(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
>  line 2695, in _getitem_column
>     return self._get_item_cache(key)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
>  line 2489, in _get_item_cache
>     values = self._data.get(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
>  line 4115, in get
>     loc = self.items.get_loc(item)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
>  line 3080, in get_loc
>     return self._engine.get_loc(self._maybe_cast_indexer(key))
>   File "pandas/_libs/index.pyx", line 140, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in 
> pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
>  line 3326, in run_code
>     exec(code_obj, self.user_global_ns, self.user_ns)
>   File "<ipython-input-36-6711a2fcec96>", line 5, in <module>
>     pa_table = pa.Table.from_pandas(df, 
> schema=pa.Table.from_pandas(df).schema)
>   File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 517, in dataframe_to_arrays
>     columns)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 337, in _get_columns_to_convert
>     return _get_columns_to_convert_given_schema(df, schema, preserve_index)
>   File 
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
>  line 426, in _get_columns_to_convert_given_schema
>     "in the columns or index".format(name))
> KeyError: "name '__index_level_0__' present in the specified schema is not 
> found in the columns or index"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to