[ https://issues.apache.org/jira/browse/ARROW-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-1883: ----------------------------------- Assignee: Joris Van den Bossche (was: Phillip Cloud) > [Python] BUG: Table.to_pandas metadata checking fails if columns are not > present > -------------------------------------------------------------------------------- > > Key: ARROW-1883 > URL: https://issues.apache.org/jira/browse/ARROW-1883 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.7.1 > Reporter: Joris Van den Bossche > Assignee: Joris Van den Bossche > Labels: pull-request-available > Fix For: 0.8.0 > > > Found this bug in the example in the pandas documentation > (http://pandas-docs.github.io/pandas-docs-travis/io.html#parquet), which does: > {code} > df = pd.DataFrame({'a': list('abc'), > 'b': list(range(1, 4)), > 'c': np.arange(3, 6).astype('u1'), > 'd': np.arange(4.0, 7.0, dtype='float64'), > 'e': [True, False, True], > 'f': pd.date_range('20130101', periods=3), > 'g': pd.date_range('20130101', periods=3, > tz='US/Eastern')}) > df.to_parquet('example_pa.parquet', engine='pyarrow') > pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b']) > {code} > and this raises in the last line reading a subset of columns: > {code} > ... > /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py > in _add_any_metadata(table, pandas_metadata) > 357 for i, col_meta in enumerate(pandas_metadata['columns']): > 358 if col_meta['pandas_type'] == 'datetimetz': > --> 359 col = table[i] > 360 converted = col.to_pandas() > 361 tz = col_meta['metadata']['timezone'] > table.pxi in pyarrow.lib.Table.__getitem__() > table.pxi in pyarrow.lib.Table.column() > IndexError: Table column index 6 is out of range > {code} > This is due to checking the `pandas_metadata` for all columns (and in this > case trying to deal with a datetime tz column), while in practice not all > columns are present in this case ('mismatch' between pandas metadata and > actual schema). > A smaller example without parquet: > {code} > In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01", > periods=3, tz='Europe/Brussels')}) > In [39]: table = pyarrow.Table.from_pandas(df) > In [40]: table > Out[40]: > pyarrow.Table > a: int64 > b: timestamp[ns, tz=Europe/Brussels] > __index_level_0__: int64 > metadata > -------- > {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, > "numpy_t' > b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", > "meta' > b'data": {"timezone": "Europe/Brussels"}, "numpy_type": > "datetime6' > b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", > ' > b'"metadata": null, "numpy_type": "int64", "name": > "__index_level_' > b'0__"}], "index_columns": ["__index_level_0__"], > "pandas_version"' > b': "0.22.0.dev0+277.gd61f411"}'} > In [41]: table.to_pandas() > Out[41]: > a b > 0 1 2017-01-01 00:00:00+01:00 > 1 2 2017-01-02 00:00:00+01:00 > 2 3 2017-01-03 00:00:00+01:00 > In [44]: table_without_tz = table.remove_column(1) > In [45]: table_without_tz > Out[45]: > pyarrow.Table > a: int64 > __index_level_0__: int64 > metadata > -------- > {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, > "numpy_t' > b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", > "meta' > b'data": {"timezone": "Europe/Brussels"}, "numpy_type": > "datetime6' > b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", > ' > b'"metadata": null, "numpy_type": "int64", "name": > "__index_level_' > b'0__"}], "index_columns": ["__index_level_0__"], > "pandas_version"' > b': "0.22.0.dev0+277.gd61f411"}'} > In [46]: table_without_tz.to_pandas() # <------ wrong output ! > Out[46]: > a > 1970-01-01 01:00:00+01:00 1 > 1970-01-01 01:00:00.000000001+01:00 2 > 1970-01-01 01:00:00.000000002+01:00 3 > In [47]: table_without_tz2 = table_without_tz.remove_column(1) > In [48]: table_without_tz2 > Out[48]: > pyarrow.Table > a: int64 > metadata > -------- > {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, > "numpy_t' > b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", > "meta' > b'data": {"timezone": "Europe/Brussels"}, "numpy_type": > "datetime6' > b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", > ' > b'"metadata": null, "numpy_type": "int64", "name": > "__index_level_' > b'0__"}], "index_columns": ["__index_level_0__"], > "pandas_version"' > b': "0.22.0.dev0+277.gd61f411"}'} > In [49]: table_without_tz2.to_pandas() # <------ error ! > --------------------------------------------------------------------------- > IndexError Traceback (most recent call last) > <ipython-input-49-c82f33476c6b> in <module>() > ----> 1 table_without_tz2.to_pandas() > table.pxi in pyarrow.lib.Table.to_pandas() > /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py > in table_to_blockmanager(options, table, memory_pool, nthreads) > 289 pandas_metadata = > json.loads(metadata[b'pandas'].decode('utf8')) > 290 index_columns = pandas_metadata['index_columns'] > --> 291 table = _add_any_metadata(table, pandas_metadata) > 292 > 293 block_table = table > /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py > in _add_any_metadata(table, pandas_metadata) > 357 for i, col_meta in enumerate(pandas_metadata['columns']): > 358 if col_meta['pandas_type'] == 'datetimetz': > --> 359 col = table[i] > 360 converted = col.to_pandas() > 361 tz = col_meta['metadata']['timezone'] > table.pxi in pyarrow.lib.Table.__getitem__() > table.pxi in pyarrow.lib.Table.column() > IndexError: Table column index 1 is out of range > {code} > The reason is that `_add_any_metadata` does not check if the column it is > processing (currently only datetime tz columns need such processing) is > actually present in the schema. > Working on a fix, will submit a PR. -- This message was sent by Atlassian JIRA (v6.4.14#64029)